WebDataset API
Fluid Interfaces
The FluidInterface class provides a way to create fluent interfaces for
chaining operations on datasets.
Most operations are contained in the FluidInterface mixin class.
with_epoch sets the epoch size (number of samples per epoch), effectively
an itertools.islice over the dataset.
webdataset.WebDataset
Bases: DataPipeline, FluidInterface
Create a WebDataset pipeline for efficient data loading.
This class sets up a data pipeline for loading and processing WebDataset-format data. It handles URL generation, shard shuffling, caching, and sample grouping.
| Parameters: |
|
|---|
| Raises: |
|
|---|
Source code in webdataset/compat.py
332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 | |
__enter__()
Enter the runtime context for the WebDataset.
| Returns: |
|
|---|
Source code in webdataset/compat.py
515 516 517 518 519 520 521 | |
__exit__(*args)
Exit the runtime context for the WebDataset.
| Parameters: |
|
|---|
Source code in webdataset/compat.py
523 524 525 526 527 528 529 | |
create_url_iterator(args)
Create an appropriate URL iterator based on the input type.
This method determines the type of URL input and creates the corresponding iterator for the dataset.
| Parameters: |
|
|---|
| Raises: |
|
|---|
Source code in webdataset/compat.py
468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 | |
update_cache_info(args)
Update cache information based on arguments and environment variables.
| Parameters: |
|
|---|
| Raises: |
|
|---|
Source code in webdataset/compat.py
452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 | |
webdataset.WebLoader
Bases: DataPipeline, FluidInterface
A wrapper for DataLoader that adds a fluid interface.
Source code in webdataset/compat.py
540 541 542 543 544 | |
webdataset.FluidInterface
Source code in webdataset/compat.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 | |
batched(batchsize, collation_fn=filters.default_collation_fn, partial=True)
Create batches of the given size.
This method forwards to the filters.batched function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
decode(*args, pre=None, post=None, only=None, partial=False, handler=reraise_exception)
Decode data based on the decoding functions given as arguments.
This method creates a decoder using autodecode.Decoder and applies it using filters.map.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
extract_keys(*args, **kw)
Extract specific keys from samples.
This method forwards to the filters.extract_keys function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
256 257 258 259 260 261 262 263 264 265 266 267 268 | |
listed(batchsize, partial=True)
Create lists of samples without collation.
This method forwards to the filters.batched function with collation_fn set to None.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
47 48 49 50 51 52 53 54 55 56 57 58 59 | |
lmdb_cached(*args, **kw)
Cache samples using LMDB.
This method forwards to the filters.LMDBCached class.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
294 295 296 297 298 299 300 301 302 303 304 305 306 | |
log_keys(logfile=None)
Log keys of samples passing through the pipeline.
This method forwards to the filters.log_keys function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
71 72 73 74 75 76 77 78 79 80 81 82 | |
map(f, handler=reraise_exception)
Apply a function to each sample in the stream.
This method forwards to the filters.map function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
101 102 103 104 105 106 107 108 109 110 111 112 113 | |
map_dict(handler=reraise_exception, **kw)
Map the entries in a dict sample with individual functions.
This method forwards to the filters.map_dict function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
147 148 149 150 151 152 153 154 155 156 157 158 159 | |
map_tuple(*args, handler=reraise_exception)
Map the entries of a tuple with individual functions.
This method forwards to the filters.map_tuple function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
189 190 191 192 193 194 195 196 197 198 199 200 201 | |
mcached()
Cache samples in memory.
This method forwards to the filters.Cached class.
| Returns: |
|
|---|
Source code in webdataset/compat.py
284 285 286 287 288 289 290 291 292 | |
rename(**kw)
Rename samples based on keyword arguments.
This method forwards to the filters.rename function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
216 217 218 219 220 221 222 223 224 225 226 227 | |
rename_keys(*args, **kw)
Rename keys in samples based on patterns.
This method forwards to the filters.rename_keys function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
242 243 244 245 246 247 248 249 250 251 252 253 254 | |
rsample(p=0.5)
Randomly subsample a stream of data.
This method forwards to the filters.rsample function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
229 230 231 232 233 234 235 236 237 238 239 240 | |
select(predicate, **kw)
Select samples based on a predicate.
This method forwards to the filters.select function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
161 162 163 164 165 166 167 168 169 170 171 172 173 | |
shuffle(size, **kw)
Shuffle the data in the stream.
This method forwards to the filters.shuffle function if size > 0.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | |
slice(*args)
Slice the data stream.
This method forwards to the filters.slice function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
203 204 205 206 207 208 209 210 211 212 213 214 | |
to_tuple(*args, **kw)
Convert dict samples to tuples.
This method forwards to the filters.to_tuple function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
175 176 177 178 179 180 181 182 183 184 185 186 187 | |
unbatched()
Turn batched data back into unbatched data.
This method forwards to the filters.unbatched function.
| Returns: |
|
|---|
Source code in webdataset/compat.py
37 38 39 40 41 42 43 44 45 | |
unlisted()
Turn listed data back into individual samples.
This method forwards to the filters.unlisted function.
| Returns: |
|
|---|
Source code in webdataset/compat.py
61 62 63 64 65 66 67 68 69 | |
xdecode(*args, **kw)
Decode data based on file extensions.
This method forwards to the filters.xdecode function.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/compat.py
270 271 272 273 274 275 276 277 278 279 280 281 282 | |
webdataset.with_epoch
Bases: IterableDataset
Change the actual and nominal length of an IterableDataset.
This will continuously iterate through the original dataset, but impose new epoch boundaries at the given length/nominal. This exists mainly as a workaround for the odd logic in DataLoader. It is also useful for choosing smaller nominal epoch sizes with very large datasets.
| Parameters: |
|
|---|
Source code in webdataset/extradatasets.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | |
__getstate__()
Return the pickled state of the dataset.
This resets the dataset iterator, since that can't be pickled.
| Returns: |
|
|---|
Source code in webdataset/extradatasets.py
91 92 93 94 95 96 97 98 99 100 101 | |
invoke(dataset)
Return an iterator over the dataset.
This iterator returns as many samples as given by the length parameter.
| Parameters: |
|
|---|
| Yields: |
|
|---|
Source code in webdataset/extradatasets.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | |
Writing WebDatasets
webdataset.ShardWriter
Like TarWriter but splits into multiple shards.
| Parameters: |
|
|---|
Source code in webdataset/writer.py
502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 | |
__enter__()
Enter context.
| Returns: |
|
|---|
Source code in webdataset/writer.py
609 610 611 612 613 614 615 | |
__exit__(*args, **kw)
Exit context.
Source code in webdataset/writer.py
617 618 619 | |
__init__(pattern, maxcount=100000, maxsize=3000000000.0, post=None, start_shard=0, verbose=1, opener=None, **kw)
Create a ShardWriter.
| Parameters: |
|
|---|
Source code in webdataset/writer.py
516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 | |
close()
Close the stream.
Source code in webdataset/writer.py
601 602 603 604 605 606 607 | |
finish()
Finish all writing (use close instead).
Source code in webdataset/writer.py
592 593 594 595 596 597 598 599 | |
next_stream()
Close the current stream and move to the next.
Source code in webdataset/writer.py
555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 | |
write(obj)
Write a sample.
| Parameters: |
|
|---|
Source code in webdataset/writer.py
575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 | |
webdataset.TarWriter
A class for writing dictionaries to tar files.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
True will use an encoder that behaves similar to the automatic
decoder for Dataset. False disables encoding and expects byte strings
(except for metadata, which must be strings). The encoder argument can
also be a callable, or a dictionary mapping extensions to encoders.
The following code will add two file to the tar archive: a/b.png and
a/b.output.png.
tarwriter = TarWriter(stream)
image = imread("b.jpg")
image2 = imread("b.out.jpg")
sample = {"__key__": "a/b", "png": image, "output.png": image2}
tarwriter.write(sample)
Source code in webdataset/writer.py
336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 | |
__enter__()
Enter context.
| Returns: |
|
|---|
Source code in webdataset/writer.py
416 417 418 419 420 421 422 | |
__exit__(exc_type, exc_val, exc_tb)
Exit context.
Source code in webdataset/writer.py
424 425 426 | |
__init__(fileobj, user='bigdata', group='bigdata', mode=292, compress=None, encoder=True, keep_meta=False, mtime=None, format=None)
Create a tar writer.
| Parameters: |
|
|---|
Source code in webdataset/writer.py
373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 | |
close()
Close the tar file.
Source code in webdataset/writer.py
428 429 430 431 432 433 | |
write(obj)
Write a dictionary to the tar file.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Source code in webdataset/writer.py
435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 | |
Low Level I/O
webdataset.gopen.gopen(url, mode='rb', bufsize=8192, **kw)
Open the URL using various schemes and protocols.
This function provides a unified interface for opening resources specified by URLs,
supporting multiple schemes and protocols. It uses the gopen_schemes dispatch table
to handle different URL schemes.
Built-in support is provided for the following schemes: - pipe: for opening named pipes - file: for local file system access - http, https: for web resources - sftp, ftps: for secure file transfer - scp: for secure copy protocol
When no scheme is specified in the URL, it is treated as a local file path.
Environment Variables: - GOPEN_VERBOSE: Set to a non-zero value to enable verbose logging of file operations. Format: GOPEN_VERBOSE=1 - USE_AIS_FOR: Specifies which cloud storage services should use AIS (and its cache) for access. Format: USE_AIS_FOR=aws:gs:s3 - GOPEN_BUFFER: Sets the buffer size for file operations (in bytes). Format: GOPEN_BUFFER=8192
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Note: - For stdin/stdout operations, use "-" as the URL. - The function applies URL rewriting based on the GOPEN_REWRITE environment variable before processing.
Source code in webdataset/gopen.py
487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 | |
Error Handling
webdataset.ignore_and_continue(exn)
Ignore the exception and continue processing.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/handlers.py
34 35 36 37 38 39 40 41 42 43 | |
webdataset.ignore_and_stop(exn)
Ignore the exception and stop further processing.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/handlers.py
60 61 62 63 64 65 66 67 68 69 | |
webdataset.reraise_exception(exn)
Re-raise the given exception.
| Parameters: |
|
|---|
Source code in webdataset/handlers.py
22 23 24 25 26 27 28 29 30 31 | |
webdataset.warn_and_continue(exn)
Issue a warning for the exception and continue processing.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/handlers.py
46 47 48 49 50 51 52 53 54 55 56 57 | |
webdataset.warn_and_stop(exn)
Issue a warning for the exception and stop further processing.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in webdataset/handlers.py
72 73 74 75 76 77 78 79 80 81 82 83 | |