Skip to Content

How Dicer Is Revolutionizing Auto-Sharding for Distributed Systems

Scaling Distributed Systems: The Persistent Challenge

Get All The Latest to Your Inbox!

Thanks for registering!

 

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

Engineering teams can struggle to scale distributed systems efficiently while maintaining performance and reliability. The introduction of Dicer, Databricks’ open-source dynamic auto-sharder looks to close this gap. Dicer brings automation and intelligence to sharding, enabling organizations to build scalable, highly available services without the traditional trade-offs.

Moving Beyond Stateless and Static Sharding

Historically, many architectures depended on stateless models or static sharding. Stateless designs, which push all state to external databases or caches, suffer from persistent network latency, increased CPU usage, and inefficient data retrieval. Static sharding, which keeps state in memory, reduces latency but introduces fragility:

  • Downtime during scaling or restarts due to poor coordination and unavailability.
  • Split-brain failures when pods hold inconsistent state after disruptions.
  • Hot key bottlenecks from lack of dynamic load balancing, leading to uneven workloads.

As Databricks scaled, these issues made static sharding unsustainable, while reverting to stateless architectures would have increased latency and costs. Dicer was developed to solve these dilemmas by adding intelligence and resilience to sharded services.

The Dicer Approach: Dynamic, Intelligent Sharding

Dicer introduces a centralized control plane that continuously updates shard assignments based on real-time signals such as health, load, and environmental changes. Its key capabilities include:

  • Splitting, merging, and reassigning key ranges (Slices) for optimal availability and load balancing.

  • Detecting and isolating hot keys, distributing them to prevent overload.

  • Coordinating updates using server-side (Slicelet) and client-side (Clerk) libraries, ensuring local cache freshness and low-latency lookups.

  • Maintaining high availability and rapid recovery with eventually consistent shard assignments.

This enables Databricks services to remain robust during restarts, failures, autoscaling, and workload spikes. Dicer also supports multi-tenant environments, serving diverse applications within a region.

Wide-Ranging Use Cases

Dicer’s versatility is evident across a range of demanding scenarios, including:

  • In-memory and GPU serving: Delivers sub-millisecond latency for key-value stores and AI inference.

  • Cluster management and query orchestration: Maintains consistent state for resource management and scheduling.

  • Remote caching: Enables distributed caches that autoscale and handle hot keys without imbalance.

  • Work partitioning: Efficiently assigns background tasks, reducing resource contention.

  • Batch aggregation: Groups related writes for in-memory batching, improving throughput and reducing IOPS.

  • Soft leader selection: Facilitates affinity-based, lightweight leader election within pods.

  • Real-time rendezvous: Directs clients to the same pod for fast state synchronization in coordination-heavy apps like chat rooms.

Proven Impact at Databricks

Key Databricks services have already benefited from Dicer:

  • Unity Catalog: Shifted to a Dicer-powered in-memory cache, reducing database load and achieving 90–95% cache hit rates.

  • SQL Query Orchestration Engine: Moved from static sharding to Dicer, eliminating downtime and smoothing CPU performance during scaling.

  • Softstore Remote Cache: Used Dicer's state transfer to sustain high cache hit rates (~85%) during rolling restarts, avoiding typical cache drop-offs.

Open Source and Community Empowerment

Dicer is now open source, allowing engineers everywhere to build robust and efficient sharded services. Comprehensive documentation and demos help new users integrate Dicer quickly. Ongoing development will bring additional features, including Java and Rust libraries and enhanced state transfer, with more technical deep dives planned for the future.

Takeaway

With Dicer, Databricks is empowering the tech community to overcome the persistent hurdles of scaling distributed systems. Its dynamic, intelligent auto-sharding delivers high performance, strong reliability, and cost efficiency without compromise.

Source: Databricks Blog


How Dicer Is Revolutionizing Auto-Sharding for Distributed Systems
Joshua Berkowitz January 15, 2026
Views 55
Share this post