【PyTorch】ImageNet数据集的使用和miniImageNet的构建
【PyTorch】ImageNet的使用和miniImageNet的构建
1. ImageNet下载和简介
ImageNet是由斯坦福大学等机构从2007年着手开始组件的大型计算机视觉数据集。自从2009年发布以来,已经成为了计算机视觉领域广泛用于指标评价的数据集。直到目前,该数据集有超过1400万张图像,是深度学习领域中图像分类、检测、定位的最常用数据集之一。
“ImageNet大型视觉识别任务”,即ImageNet Large-Scale Visual Recognition Challenge,是基于ImageNet的一项比赛。使用的数据集是ImageNet的子集。官网提供下ILSVRC2011~ILSVRC2017的数据集。比较常用的是ILSVRC2012。
该数据集拥有1000个类,每个分类约有1000张图片。其中,约120万张作为训练集,5万张作为验证集,10万张作为测试集(无标签)。
1.1 下载地址
ImageNet的官方地址:https://image-net.org/。目前,ImageNet数据集已经不面向公众开放,想要下载数据集,必须使用.edu的教育邮箱注册认证。在本人下载数据集的过程中,发现官网的下载速度只有300K/s左右,实在是太慢了。可以使用迅雷等torrent下载器,使用以下种子链接下载:
验证集
http://academictorrents.com/download/5d6d0df7ed81efd49ca99ea4737e0ae5e3a5f2e5.torrent
训练集
http://academictorrents.com/download/a306397ccf9c2ead27155983c254227c0fd938e2.torrent
这里也给大家安利一个网站:https://academictorrents.com/。这个网站收录了大部分常用的数据集的下载链接。
1.2 初步处理
下载完成后,得到两个文件:ILSVRC2012_img_train.tar
和ILSVRC2012_img_val.tar
。在这两个文件的目录上进入git bash运行命令:
md5sum ILSVRC2012_img_train.tar
md5sum ILSVRC2012_img_val.tar
得到两个文件的MD5校验码,如果下载的数据集完整、正确,得到的校验码应当和官网给出的一样:
训练集:1d675b47d978889d74fa0da5fadfb00e
验证集:29b22e2961454d5413ddabcf34fc5622
接下来解压数据集,其中phase参数代表训练集(train)还是验证集(val)。
import argparser
args = parser.parse_args()
parser.add_argument('--tar_dir', type=str)
def untarring(phase):
if args.tar_dir is None:
raise ValueError("tar_dir must be not None")
print('Untarring ILSVRC2012 ' + phase + ' package')
imagenet_dir = './ImageNet/' + phase
if not os.path.exists(imagenet_dir):
os.mkdir(imagenet_dir)
os.system('tar xvf ' + str(args.tar_dir) + ' -C ' + imagenet_dir)
return imagenet_dir
解压后得到的文件结构如下
训练集
|-ImagNet
|-train
|-class0.tar
|-class1.tar
|-...
|-val
|-img1.JPEG
|-...
1.3 devkit介绍
除了数据集,官网还给出了一个ILSVRC2012_devkit_t12。这里面包含了数据集的一些信息。
其中,.\ILSVRC2012_devkit_t12\data\ILSVRC2012_validation_ground_truth.txt
给出了验证集样本的对应标签。例如,第一个验证样本的标签是490,第二个是361…
除此之外,该目录下还有一个meta文件。该文件记录了数据集的分类信息。
meta中的synset字段,含有ILSVRC2012_ID,WIND等信息。WIND即对应了每一类文件夹的名称,是类的名字。ILSVRC2012_ID是该类在ILSVRC2012C数据集中的类别标识,验证集的标签采用这种标识。其中,ILSVRC2012_ID<=1000的,为类别标识,大于等于1000的,则包含了超过一个类,把相近的类放在了一起。这构成了一个树的结构。
2. miniImageNet
miniImageNet是ILSCVRC2012-img-train的子集。最早由Vinyals等人在 Matching Networks for One Shot Learning一文中提出,用于少样本学习任务,不过由于其在最一开始未公布划分方法,Ravi等人在Optimization as a model for few-shot learning一文中使用了独自的划分方法,这也是目前常用的miniImageNet的划分之一。
2.1 miniImageNet的划分
在Ravi的划分方法https://github.com/twitter-research/meta-learning-lstm/tree/master/data/miniImagenet中,从ILSVRC2012C中,随机选择了100个类。其中64个类作为trainset,16个类作为valset,20个类作为testset,并将每个图像缩放为84×84的大小。本文也采用这种划分方法。
在Ravi的csv问价中,表处理filename和对应的label名称。但注意,这里给出的filename并不和原始的ILSVR2012C的数据集中的名称一一对应,在label之后给出的序号,是该样本在该类当中的排序位置(从1开始)。
下面给出构建miniImageNet的python实现(源自https://github.com/yaoyao-liu/mini-imagenet-tools):
import argparse
import os
import numpy as np
import pandas as pd
import glob
import cv2
from shutil import copyfile
from tqdm import tqdm
# argument parser
parser = argparse.ArgumentParser(description='')
parser.add_argument('--tar_dir', type=str)
parser.add_argument('--phase', type=str, choices=['train', 'val'])
parser.add_argument('--imagenet_dir', type=str)
parser.add_argument('--miniImageNet_dir', type=str)
parser.add_argument('--split_filepath', typr=str)
parser.add_argument('--image_resize', type=int, default=84)
args = parser.parse_args()
def untarring(phase):
if args.tar_dir is None:
raise ValueError("tar_dir must be not None")
print('Untarring ILSVRC2012 ' + phase + ' package')
imagenet_dir = './ImageNet/' + phase
if not os.path.exists(imagenet_dir):
os.mkdir(imagenet_dir)
os.system('tar xvf ' + str(args.tar_dir) + ' -C ' + imagenet_dir)
return imagenet_dir
class MiniImageNetGenerator(object):
def __init__(self, input_args):
self.processed_img_dir = './miniImageNet'
self.mini_keys = None
self.input_args = input_args
self.imagenet_dir = input_args.imagenet_dir
self.raw_mini_dir = './miniImageNet_raw'
self.csv_paths = input_args.split_filepath
if not os.path.exists(self.raw_mini_dir):
os.mkdir(self.raw_mini_dir)
self.image_resize = self.input_args.image_resize
def untar_mini(self):
self.mini_keys = ['n02110341', 'n01930112', 'n04509417', 'n04067472', 'n04515003', 'n02120079', 'n03924679',
'n02687172', 'n03075370', 'n07747607', 'n09246464', 'n02457408', 'n04418357', 'n03535780',
'n04435653', 'n03207743', 'n04251144', 'n03062245', 'n02174001', 'n07613480', 'n03998194',
'n02074367', 'n04146614', 'n04243546', 'n03854065', 'n03838899', 'n02871525', 'n03544143',
'n02108089', 'n13133613', 'n03676483', 'n03337140', 'n03272010', 'n01770081', 'n09256479',
'n02091244', 'n02116738', 'n04275548', 'n03773504', 'n02606052', 'n03146219', 'n04149813',
'n07697537', 'n02823428', 'n02089867', 'n03017168', 'n01704323', 'n01532829', 'n03047690',
'n03775546', 'n01843383', 'n02971356', 'n13054560', 'n02108551', 'n02101006', 'n03417042',
'n04612504', 'n01558993', 'n04522168', 'n02795169', 'n06794110', 'n01855672', 'n04258138',
'n02110063', 'n07584110', 'n02091831', 'n03584254', 'n03888605', 'n02113712', 'n03980874',
'n02219486', 'n02138441', 'n02165456', 'n02108915', 'n03770439', 'n01981276', 'n03220513',
'n02099601', 'n02747177', 'n01749939', 'n03476684', 'n02105505', 'n02950826', 'n04389033',
'n03347037', 'n02966193', 'n03127925', 'n03400231', 'n04296562', 'n03527444', 'n04443257',
'n02443484', 'n02114548', 'n04604644', 'n01910747', 'n04596742', 'n02111277', 'n03908618',
'n02129165', 'n02981792']
for idx, keys in enumerate(self.mini_keys):
print('Untarring ' + keys)
os.system('tar xvf ' + self.imagenet_dir + '/' + keys + '.tar -C ' + self.raw_mini_dir)
print('All the tar files are untarred')
def process_original_files(self):
split_lists = ['train', 'val', 'test']
if not os.path.exists(self.processed_img_dir):
os.makedirs(self.processed_img_dir)
for this_split in split_lists:
filename = os.path.join(self.csv_paths, this_split + '.csv')
this_split_dir = self.processed_img_dir + '/' + this_split
if not os.path.exists(this_split_dir):
os.makedirs(this_split_dir)
with open(filename) as csvfile:
csv = pd.read_csv(csvfile, delimiter=',')
images = {}
print('Reading IDs....')
for row in csv.values:
if row[1] in images.keys():
images[row[1]].append(row[0])
else:
images[row[1]] = [row[0]]
print('Writing photos....')
for cls in tqdm(images.keys()):
this_cls_dir = this_split_dir + '/' + cls
if not os.path.exists(this_cls_dir):
os.makedirs(this_cls_dir)
# find files which name matches '.../...cls...'
lst_files = glob.glob(self.raw_mini_dir + "/*" + cls + "*")
# sort file names, get index
lst_index = [int(i[i.rfind('_') + 1:i.rfind('.')]) for i in lst_files]
index_sorted = np.argsort(np.array(lst_index))
# get file names in miniImageNet, the name in csv indicates the file index in miniImageNet class
index_selected = [int(i[i.index('.') - 4:i.index('.')]) for i in images[cls]]
# note that names in csv begin from 1 not 0, get selected images indexes
selected_images = index_sorted[np.array(index_selected) - 1]
for i in np.arange(len(selected_images)):
if self.image_resize == 0:
copyfile(lst_files[selected_images[i]], os.path.join(this_cls_dir, images[cls][i]))
else:
im = cv2.imread(lst_files[selected_images[i]])
im_resized = cv2.resize(im, (self.image_resize, self.image_resize),
interpolation=cv2.INTER_AREA)
cv2.imwrite(os.path.join(this_cls_dir, images[cls][i]), im_resized)
if __name__ == "__main__":
dataset_generator = MiniImageNetGenerator(args)
dataset_generator.untar_mini()
dataset_generator.process_original_files()
在执行后,miniImageNet_raw文件夹下存储着未处理、未分类的样本。miniImageNet文件夹的结构如下:
|-miniImageNet
|-train
|-class1
|-img1.jpg
|-...
|-...
|-val
|-test
其中,val、test的文件夹结构和train一样。
3. 使用ImageFolder构建数据集类
PyTorch提供非常方便的类ImageFolder用于图像数据集的构建。
dataset = ImageFolder(root='./miniImageNet/train')
但是,这样构造的数据集,样本的标签是根据文件夹的顺序而来的。例如,在第一个文件夹的样本都会被标志为0类。如果我们想要让样本标签和数据集的标签对应,那么需要重写一些方法。
3.1 重写DataFolder中的方法
ImageFolder是DataFolder的子类,在DataFolder中,提供了两个可供重写的方法:
find_classes和make_dataset。其中,find_classes需要返回类别名称和类别名称(list)与标签之间的映射(dict)。make_dataset则根据find_classes返回的参数构建数据集。
重写如下:
meta_dir = os.path.join(os.getcwd(), 'meta_info')
if not os.path.exists(meta_dir):
os.makedirs(meta_dir)
meta_info_path = os.path.join(meta_dir, "meta_info.npy")
if not os.path.exists(meta_info_path):
meta = loadmat('./ILSVRC2012_devkit_t12/data/meta.mat')
meta = meta.get('synsets')
meta = meta.reshape(1860)
meta_id = [[i[0].item(), i[1].item()] for i in meta]
meta_info = np.array(meta_id[:1000])
np.save(meta_info_path, meta_info, allow_pickle=True)
else:
meta_info = np.load(meta_info_path, allow_pickle=True)
class MiniImageNetFolder(ImageFolder):
"""
Generator miniImageNet Dataset. This is a subclass of ImageFolder <- DataFolder
Overwrite method find_class() and make_dataset() to let the label match ILSVRC2012-ID
-----------------------------------------------------------------------------------------
Parameters:
root: root directory of image dataset, should have structure of
root/dog/xxx.png
root/dog/xxy.png
root/dog/[...]/xxz.png
root/cat/123.png
root/cat/nsdf3.png
root/cat/[...]/asd932_.pn
phase: the validation phase of model, takes value from {"train", "validation", "test"}
"""
def __init__(self, root, phase="train", transformer=None):
self.meta_info = meta_info
self.phase = phase
super(MiniImageNetFolder, self).__init__(root=root, transform=transformer)
def find_classes(self, directory):
dic = {}
names = np.unique(np.array(pd.read_csv('./split_csv/miniImageNet/' + self.phase + '.csv')['label']))
for i in self.meta_info:
if i[1] in names:
dic[i[1]] = int(i[0])
return list(names), dic
其中,meta_info存储了ILSVRC2012中所有类和label的对应关系。
3.2 BatchSampler实现episode采样
在少样本学习中,经常采用的一种方法是Episode Training。该方法将每次训练划分为K个不同的任务,每个任务代表了对一个类别的少样本学习,包含数个support样本和query样本。support样本用于模型的训练,query样本用于评估模型的性能。通常,使用K-way N-shot来表示episode训练的模式,N就是support样本的数量。如果我们要向实现一个5-way 5-shot的episode训练,(假设query样本数量为1),那么每次需要从训练集中采样5×(5+1)=30个样本。
采样策略可以通过BatchSampler来实现:
class PrototypicalBatchSampler(object):
"""
Adopted from
https://github.com/orobix/Prototypical-Networks-for-Few-shot-Learning-PyTorch/blob/master/src/prototypical_batch_sampler.py
Yield a batch of indexes at each iteration.
Indexes are calculated by keeping in account 'classes_per_it' and 'num_samples',
Each iteration the batch indexes will refer to 'num_support' + 'num_query' samples
for 'classes_per_it' random classes.
__len__ returns the number of episodes per epoch (same as 'self.iterations').
----------------------------------------------------------------------------------------------
Parameters:
labels: ndarray, all labels for current dataset
classes_per_episode: int, number of classes in one episode
sample_per_class: int, numer of sample in one class
iterations: int, number of episodes in one epoch
"""
def __init__(self, labels, classes_per_episode, sample_per_class, iterations, dataset_name="miniImageNet_train"):
"""
Initialize the PrototypicalBatchSampler object
Args:
- labels: an iterable containing all the labels for the current dataset
samples indexes will be inferred from this iterable.
- classes_per_it: number of random classes for each iteration
- num_samples: number of samples for each iteration for each class (support + query)
- iterations: number of iterations (episodes) per epoch
"""
super(PrototypicalBatchSampler, self).__init__()
self.labels = labels
self.classes_per_it = classes_per_episode
self.sample_per_class = sample_per_class
self.iterations = iterations
self.dataset_name = dataset_name
# 该函数是去除数组中的重复数字,并进行排序之后输出
self.classes, self.counts = np.unique(self.labels, return_counts=True)
self.classes = torch.LongTensor(self.classes)
# create a matrix, indexes, of dim: classes X max(elements per class)
# fill it with nans
# for every class c, fill the relative row with the indices samples belonging to c
# in numel_per_class we store the number of samples for each class/row
indexes_path = os.path.join(os.getcwd() + '\episode_idx', self.dataset_name + '_indexes.npy')
numel_per_class_path = os.path.join(os.getcwd() + '\episode_idx', self.dataset_name + '_numel_per_class.npy')
if not os.path.exists(indexes_path) and not os.path.exists(numel_per_class_path):
print("Creat dataset indexes")
self.idxs = range(len(self.labels))
self.indexes = np.empty((len(self.classes), max(self.counts)), dtype=int) * np.nan
self.indexes = torch.Tensor(self.indexes)
self.numel_per_class = torch.zeros_like(self.classes)
for idx, label in enumerate(self.labels):
label_idx = np.argwhere(self.classes == label).item()
# 即np.where(condition),只有条件 (condition),没有x和y,则输出满足条件 (即非0) 元素的坐标 (等价于numpy.nonzero)。
# 这里的坐标以tuple的形式给出,通常原数组有多少维,输出的tuple中就包含几个数组,分别对应符合条件元素的各维坐标。
# 这里会返回第label_idx行中,是nan的坐标, 格式((),),所以取[0][0], 获得第一个为nan的坐标
# 给indexes对应class对应sample赋予idx (在label数组中的idx)
self.indexes[label_idx, np.where(np.isnan(self.indexes[label_idx]))[0][0]] = idx
self.numel_per_class[label_idx] += 1
save_path = os.path.join(os.getcwd(), 'episode_idx')
if not os.path.exists(save_path):
os.makedirs(save_path)
np.save(os.path.join(save_path, self.dataset_name) + "_indexes.npy", self.indexes)
np.save(os.path.join(save_path, self.dataset_name) + "_numel_per_class.npy", self.numel_per_class)
else:
print("Read Dataset indexes.")
self.indexes = torch.tensor(np.load(indexes_path))
self.numel_per_class = torch.tensor(np.load(numel_per_class_path))
def __iter__(self):
"""
yield a batch (episode) of indexes
"""
spc = self.sample_per_class
cpi = self.classes_per_it
for it in range(self.iterations):
batch_size = spc * cpi
batch = torch.LongTensor(batch_size)
# 随机选取c个类
c_idxs = torch.randperm(len(self.classes))[:cpi]
for i, c in enumerate(self.classes[c_idxs]):
# 从第i个class到第i+1个class在batch中的slice
s = slice(i * spc, (i + 1) * spc)
# FIXME when torch.argwhere will exists
# 找到第i个类的label_idx
label_idx = torch.arange(len(self.classes)).long()[self.classes == c].item()
# 在第label_idx类中随机选择spc个样本
sample_idxs = torch.randperm(self.numel_per_class[label_idx])[:spc]
# 这些样本的索引写如batch
batch[s] = self.indexes[label_idx][sample_idxs]
# 随机打乱batch
batch = batch[torch.randperm(len(batch))]
yield batch
def __len__(self):
"""
returns the number of iterations (episodes, batches) per epoch
"""
return self.iterations
这样,在每个训练的epoch中,我们采样iterations次,每次采样的batch包含了K-way N-shot的样本。则模型总共进行episode训练的次数为:epochs×iterations。
3.3 batch可视化
trans = Compose([ToTensor()])
dataset = MiniImageNetFolder(root='F:/processed_images/train/', phase="train", transformer=trans)
dataloader = DataLoader(dataset=dataset, batch_sampler=PrototypicalBatchSampler(dataset.targets, 5, 5, 10))
def visual_batch(dataloader, dataset_name):
"""
Visualize images.
:param x: Tensor, with shape of [batch_size, 3, h, w]
:param y: Tensor, with shape of [batch_size, 1]
:return:
"""
x, y = next(iter(dataloader))
plt.figure(figsize=(12, 12))
for i in range(x.shape[0]):
plt.subplot(5, 5, i + 1)
idx = y[i].item() - 1
plt.title(meta_info[idx, 1])
plt.imshow(x[i].permute(1, 2, 0))
plt.axis('off')
if not os.path.exists(os.path.join(os.getcwd(), 'imgs')):
os.makedirs(os.path.join(os.getcwd(), 'imgs'))
plt.savefig('./imgs/visual_batch_' + dataset_name + '.png')
visual_batch(dataloader, "miniImageNet_train")
可视化第一个batch结果如下:
更多推荐
所有评论(0)