前言

有两种类型的数据集对象,常规Dataset和✨IterableDataset✨。Dataset提供对行的快速随机访问和内存映射,因此即使加载大型数据集也只使用相对少量的设备内存。但是对于非常非常大的数据集,甚至不适合磁盘或内存,IterableDataset允许您访问和使用数据集,而无需等待它完全下载!

本教程将向您展示如何加载和访问Dataset和IterableDataset。

Operating System: Ubuntu 22.04.4 LTS

参考文档

  1. Know your dataset

Dataset

当你加载数据集分割时,你会得到一个Dataset的对象。你可以用Dataset对象做很多事情,这就是为什么学习如何操作和交互存储在里面的数据很重要。

本教程使用rotten_tomatoes数据集,但请随意加载您想要的任何数据集并遵循!

1
2
3
>>> from datasets import load_dataset

>>> dataset = load_dataset("rotten_tomatoes", split="train")

Indexing

Dataset包含数据列,每列可以是不同类型的数据。index或axis标签用于访问数据集中的示例。例如,按行索引会返回数据集中示例的字典:

1
2
3
4
# Get the first row in the dataset
>>> dataset[0]
{'label': 1,
'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

使用-运算符从数据集的末尾开始:

1
2
3
4
# Get the last row in the dataset
>>> dataset[-1]
{'label': 0,
'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .'}

按列名索引会返回列中所有值的列表:

1
2
3
4
5
6
>>> dataset["text"]
['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
'effective but too-tepid biopic',
...,
'things really get weird , though not particularly scary : the movie is all portent and no content .']

您可以组合行名和列名索引以在某个位置返回特定值:

1
2
>>> dataset[0]["text"]
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

但重要的是要记住索引顺序很重要,尤其是在处理大型音频和图像数据集时。按列名索引首先返回列中的所有值,然后在该位置加载值。对于大型数据集,先按列名索引可能会更慢。

1
2
3
4
5
6
7
8
9
10
11
12
13
>>> import time

>>> start_time = time.time()
>>> text = dataset[0]["text"]
>>> end_time = time.time()
>>> print(f"Elapsed time: {end_time - start_time:.4f} seconds")
Elapsed time: 0.0031 seconds

>>> start_time = time.time()
>>> text = dataset["text"][0]
>>> end_time = time.time()
>>> print(f"Elapsed time: {end_time - start_time:.4f} seconds")
Elapsed time: 0.0094 seconds

Slicing

切片返回数据集的切片或子集,这对于一次查看多行很有用。要对数据集进行切片,请使用:运算符指定一系列位置。

1
2
3
4
5
6
7
8
9
10
11
12
13
# Get the first three rows
>>> dataset[:3]
{'label': [1, 1, 1],
'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
'effective but too-tepid biopic']}

# Get rows between three and six
>>> dataset[3:6]
{'label': [1, 1, 1],
'text': ['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
"emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .']}

IterableDataset

在load_dataset()中将streaming参数设置为True时会加载IterableDataset:

1
2
3
4
5
6
7
>>> from datasets import load_dataset

>>> iterable_dataset = load_dataset("food101", split="train", streaming=True)
>>> for example in iterable_dataset:
... print(example)
... break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F5C520>, 'label': 6}

您也可以从现有Dataset创建IterableDataset,但它比streaming模式更快,因为数据集是从本地文件流式传输的:

1
2
3
4
>>> from datasets import load_dataset

>>> dataset = load_dataset("rotten_tomatoes", split="train")
>>> iterable_dataset = dataset.to_iterable_dataset()

IterableDataset一次逐步迭代一个示例,因此您不必等待整个数据集下载后才能使用它。可以想象,这对于您想要立即使用的大型数据集非常有用!

然而,这意味着IterableDataset的行为不同于常规Dataset。您不能随机访问IterableDataset中的示例。相反,您应该遍历其元素,例如,通过调用next(iter())或使用for循环从IterableDataset返回下一项:

1
2
3
4
5
6
7
8
>>> next(iter(iterable_dataset))
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F59B50>,
'label': 6}

>>> for example in iterable_dataset:
... print(example)
... break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DE82B0>, 'label': 6}

您可以使用IterableDataset.take()返回数据集的子集,其中包含特定数量的示例:

1
2
3
4
5
6
7
8
# Get first three examples
>>> list(iterable_dataset.take(3))
[{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DEE9D0>,
'label': 6},
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F7479DE8190>,
'label': 6},
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x383 at 0x7F7479DE8310>,
'label': 6}]

但与slicing不同的是,IterableDataset.take()创建了一个新的IterableDataset。

Next steps

有兴趣了解这两种类型数据集之间的更多差异吗?在Differences between Dataset and IterableDataset概念指南中了解更多信息。

要更多地使用这些数据集类型,请查看Process指南以了解如何预处理DatasetStream指南以了解如何预处理IterableDataset。

结语

第一百七十篇博文写完,开心!!!!

今天,也是充满希望的一天。