【PyTorch】ImageNet数据集的使用和miniImageNet的构建

Ace2NoU

17642人浏览 · 2023-02-03 18:23:57

Ace2NoU · 2023-02-03 18:23:57 发布

【PyTorch】ImageNet的使用和miniImageNet的构建

1. ImageNet下载和简介
2. miniImageNet
- 2.1 miniImageNet的划分
3. 使用ImageFolder构建数据集类

1. ImageNet下载和简介

ImageNet是由斯坦福大学等机构从2007年着手开始组件的大型计算机视觉数据集。自从2009年发布以来，已经成为了计算机视觉领域广泛用于指标评价的数据集。直到目前，该数据集有超过1400万张图像，是深度学习领域中图像分类、检测、定位的最常用数据集之一。

“ImageNet大型视觉识别任务”，即ImageNet Large-Scale Visual Recognition Challenge，是基于ImageNet的一项比赛。使用的数据集是ImageNet的子集。官网提供下ILSVRC2011~ILSVRC2017的数据集。比较常用的是ILSVRC2012。

该数据集拥有1000个类，每个分类约有1000张图片。其中，约120万张作为训练集，5万张作为验证集，10万张作为测试集（无标签）。

1.1 下载地址

ImageNet的官方地址：https://image-net.org/。目前，ImageNet数据集已经不面向公众开放，想要下载数据集，必须使用.edu的教育邮箱注册认证。在本人下载数据集的过程中，发现官网的下载速度只有300K/s左右，实在是太慢了。可以使用迅雷等torrent下载器，使用以下种子链接下载：

验证集
http://academictorrents.com/download/5d6d0df7ed81efd49ca99ea4737e0ae5e3a5f2e5.torrent
训练集
http://academictorrents.com/download/a306397ccf9c2ead27155983c254227c0fd938e2.torrent

这里也给大家安利一个网站：https://academictorrents.com/。这个网站收录了大部分常用的数据集的下载链接。

1.2 初步处理

下载完成后，得到两个文件：ILSVRC2012_img_train.tar和ILSVRC2012_img_val.tar。在这两个文件的目录上进入git bash运行命令：

md5sum ILSVRC2012_img_train.tar
md5sum ILSVRC2012_img_val.tar

得到两个文件的MD5校验码，如果下载的数据集完整、正确，得到的校验码应当和官网给出的一样：

训练集：1d675b47d978889d74fa0da5fadfb00e
验证集：29b22e2961454d5413ddabcf34fc5622

接下来解压数据集，其中phase参数代表训练集（train）还是验证集（val）。

import argparser
args = parser.parse_args()

parser.add_argument('--tar_dir', type=str)
def untarring(phase):
    if args.tar_dir is None:
        raise ValueError("tar_dir must be not None")
    print('Untarring ILSVRC2012 ' + phase + ' package')
    imagenet_dir = './ImageNet/' + phase
    if not os.path.exists(imagenet_dir):
        os.mkdir(imagenet_dir)
    os.system('tar xvf ' + str(args.tar_dir) + ' -C ' + imagenet_dir)
    return imagenet_dir

解压后得到的文件结构如下

训练集
|-ImagNet
	|-train
		|-class0.tar
		|-class1.tar
		|-...
	|-val
		|-img1.JPEG
		|-...

1.3 devkit介绍

除了数据集，官网还给出了一个ILSVRC2012_devkit_t12。这里面包含了数据集的一些信息。

其中，.\ILSVRC2012_devkit_t12\data\ILSVRC2012_validation_ground_truth.txt给出了验证集样本的对应标签。例如，第一个验证样本的标签是490，第二个是361…
在这里插入图片描述
除此之外，该目录下还有一个meta文件。该文件记录了数据集的分类信息。
meta中的synset字段，含有ILSVRC2012_ID，WIND等信息。WIND即对应了每一类文件夹的名称，是类的名字。ILSVRC2012_ID是该类在ILSVRC2012C数据集中的类别标识，验证集的标签采用这种标识。其中，ILSVRC2012_ID<=1000的，为类别标识，大于等于1000的，则包含了超过一个类，把相近的类放在了一起。这构成了一个树的结构。

2. miniImageNet

miniImageNet是ILSCVRC2012-img-train的子集。最早由Vinyals等人在 Matching Networks for One Shot Learning一文中提出，用于少样本学习任务，不过由于其在最一开始未公布划分方法，Ravi等人在Optimization as a model for few-shot learning一文中使用了独自的划分方法，这也是目前常用的miniImageNet的划分之一。

2.1 miniImageNet的划分

在Ravi的划分方法https://github.com/twitter-research/meta-learning-lstm/tree/master/data/miniImagenet中，从ILSVRC2012C中，随机选择了100个类。其中64个类作为trainset，16个类作为valset，20个类作为testset，并将每个图像缩放为84×84的大小。本文也采用这种划分方法。

在Ravi的csv问价中，表处理filename和对应的label名称。但注意，这里给出的filename并不和原始的ILSVR2012C的数据集中的名称一一对应，在label之后给出的序号，是该样本在该类当中的排序位置（从1开始）。

下面给出构建miniImageNet的python实现（源自https://github.com/yaoyao-liu/mini-imagenet-tools）：

import argparse
import os
import numpy as np
import pandas as pd
import glob
import cv2
from shutil import copyfile
from tqdm import tqdm

# argument parser
parser = argparse.ArgumentParser(description='')
parser.add_argument('--tar_dir', type=str)
parser.add_argument('--phase', type=str, choices=['train', 'val'])
parser.add_argument('--imagenet_dir', type=str)
parser.add_argument('--miniImageNet_dir', type=str)
parser.add_argument('--split_filepath', typr=str)
parser.add_argument('--image_resize', type=int, default=84)

args = parser.parse_args()


def untarring(phase):
    if args.tar_dir is None:
        raise ValueError("tar_dir must be not None")
    print('Untarring ILSVRC2012 ' + phase + ' package')
    imagenet_dir = './ImageNet/' + phase
    if not os.path.exists(imagenet_dir):
        os.mkdir(imagenet_dir)
    os.system('tar xvf ' + str(args.tar_dir) + ' -C ' + imagenet_dir)
    return imagenet_dir


class MiniImageNetGenerator(object):
    def __init__(self, input_args):
        self.processed_img_dir = './miniImageNet'
        self.mini_keys = None
        self.input_args = input_args
        self.imagenet_dir = input_args.imagenet_dir
        self.raw_mini_dir = './miniImageNet_raw'
        self.csv_paths = input_args.split_filepath
        if not os.path.exists(self.raw_mini_dir):
            os.mkdir(self.raw_mini_dir)
        self.image_resize = self.input_args.image_resize

    def untar_mini(self):
        self.mini_keys = ['n02110341', 'n01930112', 'n04509417', 'n04067472', 'n04515003', 'n02120079', 'n03924679',
                          'n02687172', 'n03075370', 'n07747607', 'n09246464', 'n02457408', 'n04418357', 'n03535780',
                          'n04435653', 'n03207743', 'n04251144', 'n03062245', 'n02174001', 'n07613480', 'n03998194',
                          'n02074367', 'n04146614', 'n04243546', 'n03854065', 'n03838899', 'n02871525', 'n03544143',
                          'n02108089', 'n13133613', 'n03676483', 'n03337140', 'n03272010', 'n01770081', 'n09256479',
                          'n02091244', 'n02116738', 'n04275548', 'n03773504', 'n02606052', 'n03146219', 'n04149813',
                          'n07697537', 'n02823428', 'n02089867', 'n03017168', 'n01704323', 'n01532829', 'n03047690',
                          'n03775546', 'n01843383', 'n02971356', 'n13054560', 'n02108551', 'n02101006', 'n03417042',
                          'n04612504', 'n01558993', 'n04522168', 'n02795169', 'n06794110', 'n01855672', 'n04258138',
                          'n02110063', 'n07584110', 'n02091831', 'n03584254', 'n03888605', 'n02113712', 'n03980874',
                          'n02219486', 'n02138441', 'n02165456', 'n02108915', 'n03770439', 'n01981276', 'n03220513',
                          'n02099601', 'n02747177', 'n01749939', 'n03476684', 'n02105505', 'n02950826', 'n04389033',
                          'n03347037', 'n02966193', 'n03127925', 'n03400231', 'n04296562', 'n03527444', 'n04443257',
                          'n02443484', 'n02114548', 'n04604644', 'n01910747', 'n04596742', 'n02111277', 'n03908618',
                          'n02129165', 'n02981792']

        for idx, keys in enumerate(self.mini_keys):
            print('Untarring ' + keys)
            os.system('tar xvf ' + self.imagenet_dir + '/' + keys + '.tar -C ' + self.raw_mini_dir)
        print('All the tar files are untarred')

    def process_original_files(self):
        split_lists = ['train', 'val', 'test']

        if not os.path.exists(self.processed_img_dir):
            os.makedirs(self.processed_img_dir)

        for this_split in split_lists:
            filename = os.path.join(self.csv_paths, this_split + '.csv')
            this_split_dir = self.processed_img_dir + '/' + this_split
            if not os.path.exists(this_split_dir):
                os.makedirs(this_split_dir)
            with open(filename) as csvfile:
                csv = pd.read_csv(csvfile, delimiter=',')
                images = {}
                print('Reading IDs....')

                for row in csv.values:
                    if row[1] in images.keys():
                        images[row[1]].append(row[0])
                    else:
                        images[row[1]] = [row[0]]

                print('Writing photos....')
                for cls in tqdm(images.keys()):
                    this_cls_dir = this_split_dir + '/' + cls
                    if not os.path.exists(this_cls_dir):
                        os.makedirs(this_cls_dir)
                    # find files which name matches '.../...cls...'
                    lst_files = glob.glob(self.raw_mini_dir + "/*" + cls + "*")
                    # sort file names, get index
                    lst_index = [int(i[i.rfind('_') + 1:i.rfind('.')]) for i in lst_files]
                    index_sorted = np.argsort(np.array(lst_index))
                    # get file names in miniImageNet, the name in csv indicates the file index in miniImageNet class
                    index_selected = [int(i[i.index('.') - 4:i.index('.')]) for i in images[cls]]
                    # note that names in csv begin from 1 not 0, get selected images indexes
                    selected_images = index_sorted[np.array(index_selected) - 1]
                    for i in np.arange(len(selected_images)):
                        if self.image_resize == 0:
                            copyfile(lst_files[selected_images[i]], os.path.join(this_cls_dir, images[cls][i]))
                        else:
                            im = cv2.imread(lst_files[selected_images[i]])
                            im_resized = cv2.resize(im, (self.image_resize, self.image_resize),
                                                    interpolation=cv2.INTER_AREA)
                            cv2.imwrite(os.path.join(this_cls_dir, images[cls][i]), im_resized)


if __name__ == "__main__":
    dataset_generator = MiniImageNetGenerator(args)
    dataset_generator.untar_mini()
    dataset_generator.process_original_files()

在执行后，miniImageNet_raw文件夹下存储着未处理、未分类的样本。miniImageNet文件夹的结构如下：

|-miniImageNet
	|-train
		|-class1
			|-img1.jpg
			|-...
		|-...
	|-val
	|-test

其中，val、test的文件夹结构和train一样。

3. 使用ImageFolder构建数据集类

PyTorch提供非常方便的类ImageFolder用于图像数据集的构建。

dataset = ImageFolder(root='./miniImageNet/train')

但是，这样构造的数据集，样本的标签是根据文件夹的顺序而来的。例如，在第一个文件夹的样本都会被标志为0类。如果我们想要让样本标签和数据集的标签对应，那么需要重写一些方法。

3.1 重写DataFolder中的方法

ImageFolder是DataFolder的子类，在DataFolder中，提供了两个可供重写的方法：
find_classes和make_dataset。其中，find_classes需要返回类别名称和类别名称（list）与标签之间的映射（dict）。make_dataset则根据find_classes返回的参数构建数据集。

重写如下：

meta_dir = os.path.join(os.getcwd(), 'meta_info')
if not os.path.exists(meta_dir):
    os.makedirs(meta_dir)

meta_info_path = os.path.join(meta_dir, "meta_info.npy")
if not os.path.exists(meta_info_path):
    meta = loadmat('./ILSVRC2012_devkit_t12/data/meta.mat')
    meta = meta.get('synsets')
    meta = meta.reshape(1860)
    meta_id = [[i[0].item(), i[1].item()] for i in meta]
    meta_info = np.array(meta_id[:1000])
    np.save(meta_info_path, meta_info, allow_pickle=True)
else:
    meta_info = np.load(meta_info_path, allow_pickle=True)


class MiniImageNetFolder(ImageFolder):
    """
    Generator miniImageNet Dataset. This is a subclass of ImageFolder <- DataFolder
    Overwrite method find_class() and make_dataset() to let the label match ILSVRC2012-ID

    -----------------------------------------------------------------------------------------
    Parameters:
      root: root directory of image dataset, should have structure of
        root/dog/xxx.png
        root/dog/xxy.png
        root/dog/[...]/xxz.png

        root/cat/123.png
        root/cat/nsdf3.png
        root/cat/[...]/asd932_.pn

      phase: the validation phase of model, takes value from {"train", "validation", "test"}
    """

    def __init__(self, root, phase="train", transformer=None):

        self.meta_info = meta_info
        self.phase = phase
        super(MiniImageNetFolder, self).__init__(root=root, transform=transformer)

    def find_classes(self, directory):
        dic = {}
        names = np.unique(np.array(pd.read_csv('./split_csv/miniImageNet/' + self.phase + '.csv')['label']))
        for i in self.meta_info:
            if i[1] in names:
                dic[i[1]] = int(i[0])
        return list(names), dic

其中，meta_info存储了ILSVRC2012中所有类和label的对应关系。

3.2 BatchSampler实现episode采样

在少样本学习中，经常采用的一种方法是Episode Training。该方法将每次训练划分为K个不同的任务，每个任务代表了对一个类别的少样本学习，包含数个support样本和query样本。support样本用于模型的训练，query样本用于评估模型的性能。通常，使用K-way N-shot来表示episode训练的模式，N就是support样本的数量。如果我们要向实现一个5-way 5-shot的episode训练，（假设query样本数量为1），那么每次需要从训练集中采样5×(5+1)=30个样本。

采样策略可以通过BatchSampler来实现：

class PrototypicalBatchSampler(object):
    """
    Adopted from
    https://github.com/orobix/Prototypical-Networks-for-Few-shot-Learning-PyTorch/blob/master/src/prototypical_batch_sampler.py

    Yield a batch of indexes at each iteration.
    Indexes are calculated by keeping in account 'classes_per_it' and 'num_samples',
    Each iteration the batch indexes will refer to  'num_support' + 'num_query' samples
    for 'classes_per_it' random classes.

    __len__ returns the number of episodes per epoch (same as 'self.iterations').

    ----------------------------------------------------------------------------------------------
    Parameters:
        labels: ndarray, all labels for current dataset
        classes_per_episode: int, number of classes in one episode
        sample_per_class: int, numer of sample in one class
        iterations: int, number of episodes in one epoch
    """

    def __init__(self, labels, classes_per_episode, sample_per_class, iterations, dataset_name="miniImageNet_train"):
        """
        Initialize the PrototypicalBatchSampler object
        Args:
        - labels: an iterable containing all the labels for the current dataset
        samples indexes will be inferred from this iterable.
        - classes_per_it: number of random classes for each iteration
        - num_samples: number of samples for each iteration for each class (support + query)
        - iterations: number of iterations (episodes) per epoch
        """
        super(PrototypicalBatchSampler, self).__init__()
        self.labels = labels
        self.classes_per_it = classes_per_episode
        self.sample_per_class = sample_per_class
        self.iterations = iterations
        self.dataset_name = dataset_name

        # 该函数是去除数组中的重复数字，并进行排序之后输出
        self.classes, self.counts = np.unique(self.labels, return_counts=True)
        self.classes = torch.LongTensor(self.classes)

        # create a matrix, indexes, of dim: classes X max(elements per class)
        # fill it with nans
        # for every class c, fill the relative row with the indices samples belonging to c
        # in numel_per_class we store the number of samples for each class/row
        indexes_path = os.path.join(os.getcwd() + '\episode_idx', self.dataset_name + '_indexes.npy')
        numel_per_class_path = os.path.join(os.getcwd() + '\episode_idx', self.dataset_name + '_numel_per_class.npy')
        if not os.path.exists(indexes_path) and not os.path.exists(numel_per_class_path):
            print("Creat dataset indexes")
            self.idxs = range(len(self.labels))
            self.indexes = np.empty((len(self.classes), max(self.counts)), dtype=int) * np.nan
            self.indexes = torch.Tensor(self.indexes)
            self.numel_per_class = torch.zeros_like(self.classes)
            for idx, label in enumerate(self.labels):
                label_idx = np.argwhere(self.classes == label).item()
                # 即np.where(condition),只有条件 (condition)，没有x和y，则输出满足条件 (即非0) 元素的坐标 (等价于numpy.nonzero)。
                # 这里的坐标以tuple的形式给出，通常原数组有多少维，输出的tuple中就包含几个数组，分别对应符合条件元素的各维坐标。
                # 这里会返回第label_idx行中，是nan的坐标, 格式((),)，所以取[0][0]， 获得第一个为nan的坐标
                # 给indexes对应class对应sample赋予idx (在label数组中的idx)
                self.indexes[label_idx, np.where(np.isnan(self.indexes[label_idx]))[0][0]] = idx
                self.numel_per_class[label_idx] += 1
            save_path = os.path.join(os.getcwd(), 'episode_idx')
            if not os.path.exists(save_path):
                os.makedirs(save_path)
            np.save(os.path.join(save_path, self.dataset_name) + "_indexes.npy", self.indexes)
            np.save(os.path.join(save_path, self.dataset_name) + "_numel_per_class.npy", self.numel_per_class)
        else:
            print("Read Dataset indexes.")
            self.indexes = torch.tensor(np.load(indexes_path))
            self.numel_per_class = torch.tensor(np.load(numel_per_class_path))

    def __iter__(self):
        """
        yield a batch (episode) of indexes
        """
        spc = self.sample_per_class
        cpi = self.classes_per_it

        for it in range(self.iterations):
            batch_size = spc * cpi
            batch = torch.LongTensor(batch_size)
            # 随机选取c个类
            c_idxs = torch.randperm(len(self.classes))[:cpi]
            for i, c in enumerate(self.classes[c_idxs]):
                # 从第i个class到第i+1个class在batch中的slice
                s = slice(i * spc, (i + 1) * spc)
                # FIXME when torch.argwhere will exists
                # 找到第i个类的label_idx
                label_idx = torch.arange(len(self.classes)).long()[self.classes == c].item()
                # 在第label_idx类中随机选择spc个样本
                sample_idxs = torch.randperm(self.numel_per_class[label_idx])[:spc]
                # 这些样本的索引写如batch
                batch[s] = self.indexes[label_idx][sample_idxs]
            # 随机打乱batch
            batch = batch[torch.randperm(len(batch))]
            yield batch

    def __len__(self):
        """
        returns the number of iterations (episodes, batches) per epoch
        """
        return self.iterations

这样，在每个训练的epoch中，我们采样iterations次，每次采样的batch包含了K-way N-shot的样本。则模型总共进行episode训练的次数为：epochs×iterations。

3.3 batch可视化

trans = Compose([ToTensor()])
dataset = MiniImageNetFolder(root='F:/processed_images/train/', phase="train", transformer=trans)
dataloader = DataLoader(dataset=dataset, batch_sampler=PrototypicalBatchSampler(dataset.targets, 5, 5, 10))


def visual_batch(dataloader, dataset_name):
    """
    Visualize images.
    :param x: Tensor, with shape of [batch_size, 3, h, w]
    :param y: Tensor, with shape of [batch_size, 1]
    :return:
    """
    x, y = next(iter(dataloader))
    plt.figure(figsize=(12, 12))
    for i in range(x.shape[0]):
        plt.subplot(5, 5, i + 1)
        idx = y[i].item() - 1
        plt.title(meta_info[idx, 1])
        plt.imshow(x[i].permute(1, 2, 0))
        plt.axis('off')

    if not os.path.exists(os.path.join(os.getcwd(), 'imgs')):
        os.makedirs(os.path.join(os.getcwd(), 'imgs'))
    plt.savefig('./imgs/visual_batch_' + dataset_name + '.png')

visual_batch(dataloader, "miniImageNet_train")