WIDS API
wids.ShardListDataset
Bases: Dataset[T]
An indexable dataset based on a list of shards.
The dataset is either given as a list of shards with optional options and name, or as a URL pointing to a JSON descriptor file.
Datasets can reference other datasets via source_url
.
Shard references within a dataset are resolve relative to an explicitly
given base
property, or relative to the URL from which the dataset
descriptor was loaded.
Source code in wids/wids.py
354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 |
|
__getitem__(index)
Return the sample corresponding to the given index.
Source code in wids/wids.py
498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 |
|
__init__(shards, *, cache_size=int(1000000000000.0), cache_dir=None, lru_size=10, dataset_name=None, localname=None, transformations='PIL', keep=False, base=None, options=None)
Create a ShardListDataset.
Parameters: |
|
---|
Note that there are two caches: an on-disk directory, and an in-memory LRU cache.
Source code in wids/wids.py
367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 |
|
__len__()
Return the total number of samples in the dataset.
Source code in wids/wids.py
460 461 462 |
|
add_transform(transform)
Add a transformation to the dataset.
Source code in wids/wids.py
455 456 457 458 |
|
check_cache_misses()
Check if the cache miss rate is too high.
Source code in wids/wids.py
468 469 470 471 472 473 474 475 476 477 478 |
|
close()
Close the dataset.
Source code in wids/wids.py
517 518 519 |
|
get_shard(index)
Get the shard and index within the shard corresponding to the given index.
Source code in wids/wids.py
480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 |
|
get_stats()
Return the number of cache accesses and misses.
Source code in wids/wids.py
464 465 466 |
|
wids.ChunkedSampler
Bases: Sampler
A sampler that samples in chunks and then shuffles the samples within each chunk.
This preserves locality of reference while still shuffling the data.
Source code in wids/wids.py
599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 |
|
wids.DistributedChunkedSampler(dataset, *, num_replicas=None, num_samples=None, rank=None, shuffle=True, shufflefirst=False, seed=0, drop_last=None, chunksize=1000000)
Return a ChunkedSampler for the current worker in distributed training.
Reverts to a simple ChunkedSampler if not running in distributed mode.
Since the split among workers takes place before the chunk shuffle, workers end up with a fixed set of shards they need to download. The more workers, the fewer shards are used by each worker.
Source code in wids/wids.py
651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 |
|