Hugging Face Introduces Large-Scale ML with Streaming Datasets

Faster Workflows for Terabyte-Scale Machine Learning

Get All The Latest to Your Inbox!

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

Inquire Now

Kicking off a massive machine learning project used to mean a race against the clock and your hard drive. Hours spent downloading and frustrating bottlenecks were just part of the job. Hugging Face’s latest updates to its streaming datasets are designed to make that pain a thing of the past. By letting users stream enormous datasets instantly, high-speed data access is finally becoming a simple, click-and-go reality.

Why Streaming Matters: Key Improvements

With the simple addition of the streaming=True flag in the load_dataset API, users can now tap into the Hugging Face Hub’s resources without changing their existing codebase. This upgrade is fully backwards compatible, so current workflows benefit automatically from:

Lightning-fast startup: Smarter request handling and persistent caching boost data file resolution speed up to 10x.

Radical efficiency: Startup requests are reduced by up to 100x, minimizing risk of IP bans and network congestion.

Double the throughput: Streaming speed can reach up to 2x more samples per second, even with large-scale parallel training jobs.

Zero worker crashes: The robust design ensures stability across distributed and multi-worker environments.

Under the Hood: Smarter Caching and Prefetching

Hugging Face tackled two main pain points: the slow discovery of data files and inefficient streaming throughput. The new persistent cache system shares the resolved file list across all DataLoader workers, meaning only the first worker needs to contact the Hub while the rest draw from the local cache. This drastically reduces latency and network load for large teams or clusters.

During streaming, Parquet prefetching and configurable buffering ensure that GPUs stay busy and data pipelines remain full. The library fetches upcoming data chunks in the background, and advanced users can tweak buffer sizes and prefetch settings to match their hardware or workload needs.

Parquet dataset prefetching keeps training pipelines efficient.
Customizable buffer sizes allow tailored data flow.

Next-Level Storage: Powered by Xet

Backing these advances is Xet, a deduplication-based storage technology. By transferring only unique data chunks, Xet paired with Parquet Content Defined Chunking (CDC) makes uploads and streaming dramatically faster than traditional cloud storage platforms. Data engineers can also leverage pyspark_huggingface to accelerate Spark-based ETL workflows, maximizing efficiency in big data environments.

Custom Streaming for Advanced Needs

For specialized scenarios such as unique file types or complex video data, the upgraded HfFileSystem allows for easy construction of custom streaming pipelines. This flexibility has powered projects like LeRobot and WebDataset, enabling efficient random data access straight from remote repositories.

Persistent cache reuse eliminates redundant file listing requests.
Seamless integration with libraries like PyTorch’s DataLoader.

Real-World Results: Training at Local SSD Speeds

In real projects like nanoVLM, streaming from Hugging Face now matches the speeds of local SSD reads, with the bonus of instant dataset access. No more waiting for hours to copy terabytes of data, developers can launch and scale massive training jobs right away, even on multi-node clusters.

Getting Started Is Simple

To unlock these benefits, users just need to update their datasets and huggingface_hub libraries. With these in place, streaming advanced datasets like FineVisionMax is as easy as a single line of code, making it accessible for vision-language models and many other use cases.

Ultimately, Hugging Face’s revamped streaming datasets infrastructure shifts the focus from data wrangling to model innovation. With faster, more reliable, and highly flexible data access, machine learning teams can accelerate development and unlock the full potential of large-scale datasets.

Ready to Harness This Kind of Power?
Innovations like Hugging Face's streaming datasets are game-changers, showing how complex technology can solve massive efficiency problems. It’s this exact principle, using cutting-edge tech to streamline operations, that I've dedicated my career to. Seeing these tools evolve is exciting, but what's even more powerful is applying that mindset to solve your specific business challenges.
With over 20 years of experience, I partner with businesses to do just that. Whether it's building intelligent automation workflows to free your team from manual tasks or developing custom software that turns your data into a strategic asset, my focus is on delivering practical, high-value solutions. If you're wondering how AI and automation can transform your operations, I’d love to connect.
If you're ready to stop drowning in manual work, let's book a consultation and discuss your strategy.

Joshua Berkowitz
Software Solutions Architect

Source: Hugging Face Blog

in News

# big data datasets data streaming efficient training hugging face machine learning parquet

Source: https://huggingface.co/blog/streaming-datasets

Joshua Berkowitz November 2, 2025

Views 16731

Share this post

blogs

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Try It

Most Popular Articles

Check out what the hot topics are!