import webdataset as wds
import torchvision
import sys
Creating a WebDataset
Using tar
Since WebDatasets are just regular tar files, you can usually create them by just using the tar
command. All you have to do is to arrange for any files that should be in the same sample to share the same basename. Many datasets already come that way. For those, you can simply create a WebDataset with
$ tar --sort=name -cf dataset.tar dataset/
If your dataset has some other directory layout, you may need a different file name in the archive from the name on disk. You can use the --transform
argument to GNU tar to transform file names. You can also use the -T
argument to read the files from a text file and embed other options in that text file.
The tarp create
Command
The tarp
command is a little utility for manipulating tar
archives. Its create
subcommand makes it particularly simple to construct tar archives from files. The tarp create
command takes a recipe for building
a tar archive that contains lines of the form:
archive-name-1 source-name-1
archive-name-2 source-name-2
...
The source name can either be a file, "text:something", or "pipe:something".
Programmatically in Python
You can also create a WebDataset with library functions in this library:
webdataset.TarWriter
takes dictionaries containing key value pairs and writes them to diskwebdataset.ShardWriter
takes dictionaries containing key value pairs and writes them to disk as a series of shards
Direct Conversion of Any Dataset
Here is a quick way of converting an existing dataset into a WebDataset; this will store all tensors as Python pickles:
dataset = torchvision.datasets.MNIST(root="./temp", download=True)
sink = wds.TarWriter("mnist.tar")
for index, (input, output) in enumerate(dataset):
if index%1000==0:
print(f"{index:6d}", end="\r", flush=True, file=sys.stderr)
sink.write({
"__key__": "sample%06d" % index,
"input.pyd": input,
"output.pyd": output,
})
sink.close()
59000
!ls -l mnist.tar
!tar tvf mnist.tar | head
-rw-rw-r-- 1 tmb tmb 276490240 Oct 31 14:05 mnist.tar
-r--r--r-- bigdata/bigdata 845 2020-10-31 14:05 sample000000.input.pyd
-r--r--r-- bigdata/bigdata 5 2020-10-31 14:05 sample000000.output.pyd
-r--r--r-- bigdata/bigdata 845 2020-10-31 14:05 sample000001.input.pyd
-r--r--r-- bigdata/bigdata 5 2020-10-31 14:05 sample000001.output.pyd
-r--r--r-- bigdata/bigdata 845 2020-10-31 14:05 sample000002.input.pyd
-r--r--r-- bigdata/bigdata 5 2020-10-31 14:05 sample000002.output.pyd
-r--r--r-- bigdata/bigdata 845 2020-10-31 14:05 sample000003.input.pyd
-r--r--r-- bigdata/bigdata 5 2020-10-31 14:05 sample000003.output.pyd
-r--r--r-- bigdata/bigdata 845 2020-10-31 14:05 sample000004.input.pyd
-r--r--r-- bigdata/bigdata 5 2020-10-31 14:05 sample000004.output.pyd
tar: write error
Storing data as Python pickles allows most common Python datatypes to be stored, it is lossless, and the format is fast to decode. However, it is uncompressed and cannot be read by non-Python programs. It's often better to choose other storage formats, e.g., taking advantage of common image compression formats.
Direct Conversion of Any Dataset with Compression
If you know that the input is an image and the output is an integer class, you can also write something like this:
dataset = torchvision.datasets.MNIST(root="./temp", download=True)
sink = wds.TarWriter("mnist.tar")
for index, (input, output) in enumerate(dataset):
if index%1000==0:
print(f"{index:6d}", end="\r", flush=True, file=sys.stderr)
sink.write({
"__key__": "sample%06d" % index,
"ppm": input,
"cls": output,
})
sink.close()
59000
!ls -l mnist.tar
!tar tvf mnist.tar | head
-rw-rw-r-- 1 tmb tmb 276490240 Oct 31 14:05 mnist.tar
-r--r--r-- bigdata/bigdata 1 2020-10-31 14:05 sample000000.cls
-r--r--r-- bigdata/bigdata 797 2020-10-31 14:05 sample000000.ppm
-r--r--r-- bigdata/bigdata 1 2020-10-31 14:05 sample000001.cls
-r--r--r-- bigdata/bigdata 797 2020-10-31 14:05 sample000001.ppm
-r--r--r-- bigdata/bigdata 1 2020-10-31 14:05 sample000002.cls
-r--r--r-- bigdata/bigdata 797 2020-10-31 14:05 sample000002.ppm
-r--r--r-- bigdata/bigdata 1 2020-10-31 14:05 sample000003.cls
-r--r--r-- bigdata/bigdata 797 2020-10-31 14:05 sample000003.ppm
-r--r--r-- bigdata/bigdata 1 2020-10-31 14:05 sample000004.cls
-r--r--r-- bigdata/bigdata 797 2020-10-31 14:05 sample000004.ppm
tar: write error
All we needed to do was to change the key from .input.pyd
to .ppm
; this will trigger using an image compressor (in this case, writing the image in PPM format). You can use different image types depending on what speed, compression, and quality tradeoffs you want to make. If you want to encode data yourself, you can simply convert it to a byte string yourself, store it under the desired key in the sample, and that binary string will get written out.
Using TarWriter
/ShardWriter
with Binary Data (Lossless Writing)
The assert
statements in that loop are not necessary, but they document and illustrate the expectations for this
particular dataset. Generally, the ".jpg" encoder can actually encode a wide variety of array types as images. The
".cls" encoder always requires an integer for encoding.
Here is how you can use TarWriter
for writing a dataset without using an encoder:
sink = wds.TarWriter("dest.tar", encoder=False)
for basename in basenames:
with open(f"{basename}.png", "rb") as stream):
image = stream.read()
cls = lookup_cls(basename)
sample = {
"__key__": basename,
"input.png": image,
"target.cls": cls
}
sink.write(sample)
sink.close()
Since no encoder is used, if you want to be able to read this data with the default decoder, image
must contain a byte string corresponding to a PNG image (as indicated by the ".png" extension on its dictionary key), and cls
must contain an integer encoded in ASCII (as indicated by the ".cls" extension on its dictionary key).