Skip to Content

Hugging Face Introduces Dataset Streaming for Machine Learning

Blazing-Fast Model Training Starts Now

Get All The Latest to Your Inbox!

Thanks for registering!

 

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

If you’ve ever been frustrated by long waits to download massive datasets for model training, you’re not alone. Hugging Face has introduced a groundbreaking way to stream multi-terabyte datasets directly into your training loop, making slow downloads and storage headaches a thing of the past. With just a single line of code, you can now start training instantly, regardless of your hardware or storage limitations.

Seamless Integration with Existing Workflows

One of the standout features of this enhancement is its backward compatibility. Developers using the datasets library can simply add streaming=True to their load_dataset calls. There’s no need to learn new APIs or rework existing code, you can just benefit from faster and more reliable data access instantly. This approach empowers thousands of AI developers to innovate without disruption.

Overcoming Scalability Bottlenecks

Previously, streaming was mostly used for quick dataset previews, not for large-scale model training. Developers often had to download datasets to local or cloud storage, leading to delays and rate limit errors. During the nanoVLM project, it became clear that redundant requests from DataLoader workers were overwhelming the Hugging Face Hub, resulting in IP blocks and instability.

Hugging Face tackled these issues by reducing startup requests by 100x, accelerating file resolution by 10x, and doubling streaming speed. Now, even with 256 concurrent workers, streaming remains stable and twice as fast, letting teams scale up without fear of crashes.

The Technology Behind the Speed

  • Persistent Data Files Cache: The initial worker fetches the file list from the Hub while others use a cached version. This slashes startup latency and eliminates request surges.

  • Optimized Resolution Logic: API calls at startup are minimized and bundled, making initialization noticeably quicker.

  • Prefetching for Parquet Files: As one data chunk is processed, the next is fetched in the background, ensuring GPUs are never left idle.

  • Customizable Buffering: Advanced users can adjust buffer sizes and prefetch settings to match their hardware and networking environment, boosting throughput and minimizing bottlenecks.

By increasing request block sizes and fine-tuning prefetch limits, users can push the streaming pipeline to deliver even faster results.

Why Hugging Face Outperforms S3

Hugging Face’s use of Xet, a deduplication-based storage backend, means only unique data is transferred during uploads and downloads, making the process much faster than with traditional storage like S3. For Parquet datasets, Content Defined Chunking (CDC) provides even greater efficiency, all accessible through tools such as pyspark_huggingface.

Flexible Pipelines for Any Data Format

Not every dataset fits a standard mold. Hugging Face supplies the tools needed to build custom streaming pipelines for specialized cases, like video sampling or TAR archive streaming. The HfFileSystem in the huggingface_hub library enables efficient random access and file listing via local caching, making advanced workflows straightforward.

Real-World Impact: Streaming at SSD Speeds

These improvements are already powering next-generation AI projects. Teams training models like nanoVLM now see streaming performance rivaling local SSD reads, removing the need for extensive data downloads and significantly reducing time-to-train. On many clusters, streaming is now faster than even high-end disk setups.

Getting Started Is Easy

All these upgrades are available today in the latest datasets and huggingface_hub releases. Updating just takes a simple pip install --upgrade datasets huggingface_hub. To make things even simpler, Hugging Face offers preprocessed, shuffled datasets like FineVisionMax, eliminating the need for manual merging or shuffling.

Final Thoughts: A New Era for AI Training

Hugging Face’s advances in dataset streaming set a new benchmark for AI development. Whether you work with standard formats or build custom pipelines, you’ll enjoy faster starts, smoother scaling, and higher performance—no matter your dataset size. The future of large-scale model training is streaming, not downloading.

Source: Hugging Face Blog


Hugging Face Introduces Dataset Streaming for Machine Learning
Joshua Berkowitz November 7, 2025
Views 9548
Share this post