simdjson: JSON Parsing at the Speed of Silicon

How SIMD and an On-Demand API Reshape JSON Performance

simdjson: Parsing Gigabytes of JSON per Second

Daniel Lemire Geoff Langdale John Keiser

Get All The Latest to Your Inbox!

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

Inquire Now

JSON parsing is everywhere, but it's rarely fast enough at massive scales. Web servers process millions of API requests daily. Analytics pipelines transform terabytes of log data. Trading systems parse market feeds in microseconds.

Traditional JSON parsers process data one byte at a time, creating a performance bottleneck that scales linearly with document size. The open source project simdjson breaks this limitation by leveraging the full width of modern CPUs. Using SIMD (single-instruction multiple-data) instructions and a two-stage parsing architecture, it processes JSON at gigabytes per second, often 4x faster than popular alternatives. In this analysis, I explore the repository structure, the On-Demand API design, and the architectural decisions that turn CPU vector units into JSON accelerators.

simdjson

Organization

simdjson

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks

Homepage

74.2k KB

1.1k Network

243 Subscribers

C++

aarch64armarm64avx2avx512 ...

Clever Solutions to Standard Problems

JSON is ubiquitous and deceptively expensive. Traditional parsers advance one byte at a time, branching frequently and thrashing caches. The result is predictable: lots of CPU, little throughput. simdjson's solution is to process many bytes per instruction, collapse control-flow into data-parallel operations, and separate parsing into two predictable passes (find structural characters and validate; then materialize only what you need).

The kicker is the On-Demand API: instead of building full trees eagerly, you iterate the JSON text like a cursor and only pay for what you touch. That is why simdjson can both be fast and strict about correctness.

Key Features at a Glance

On-Demand parsing that treats a document as a forward iterator; validate as you traverse (doc/basics.md).

Runtime implementation selection across AVX-512, AVX2, SSE4.2, ARM NEON, POWER VSX, LoongArch, plus a portable fallback (doc/implementation-selection.md).

Multithreaded streaming for NDJSON/JSONL via iterate_many/parse_many, sustaining multi-GB/s on commodity hardware (doc/parse_many.md).

Single-header distribution for painless adoption: just include singleheader/simdjson.h and compile the companion .cpp.

Strict conformance: full RFC 8259 JSON and UTF-8 validation with precise number parsing, plus JSON Pointer/JSONPath helpers (basics: JSON Pointer).

Builder API for high-performance JSON output; C++20/26 integrations for custom types and static reflection (doc/builder.md).

Why I Like It

Three things stand out. First, the ergonomics of the On-Demand API: documents are iterators, so you write tight, linear code that maps well to the underlying machine.

Second, the runtime CPU dispatch: without any flags, the library picks a tailored implementation for your CPU at startup and falls back cleanly when needed (doc/implementation-selection.md).

Third, the documentation and research design choices are justified in peer-reviewed work and meticulous docs, not just microbenchmarks ((Keiser & Lemire, 2024); (Langdale & Lemire, 2019); (Keiser & Lemire, 2021)).

Under the Hood

The repository structure reflects careful engineering. Public headers live under include/simdjson/, with a convenient single-header amalgamation in singleheader/ for easy deployment.

The documentation is comprehensive: basics.md covers parsing idioms, error handling strategies, JSON Pointer navigation, and dynamic number type handling; performance.md details memory management, huge page optimizations, and number parsing costs; parse_many.md explains the streaming architecture with its two-stage pipeline and background thread coordination.

The core innovation lies in the two-stage parsing approach. Stage 1 performs structural discovery: it scans the input using wide SIMD operations to identify all JSON tokens (brackets, braces, quotes, commas, colons) while simultaneously validating UTF-8 encoding. This produces a compressed index—essentially a map of where every structural character appears.

Stage 2 operates on this index, walking through the JSON skeleton with a forward-only cursor. The On-Demand API rides on this cursor architecture: when you access doc["trade"]["price"], the parser validates only the necessary path and materializes just that value, not the entire document tree.

The runtime dispatch system deserves special attention. simdjson compiles multiple implementations targeting different CPU instruction sets including AVX-512, AVX2, SSE4.2, ARM NEON, POWER VSX, LoongArch, plus a portable fallback. At startup, it automatically detects your hardware capabilities and selects the optimal implementation.

For development and benchmarking, you can introspect this choice: get_available_implementations() lists options, supported_by_runtime_system() checks compatibility, and you can manually override the selection when measuring specific performance characteristics (implementation-selection.md).

A Practical Example

Here's a minimal On-Demand flow that demonstrates the cursor-based approach. This example loads a Twitter API response and extracts search metadata, exactly the kind of operation you'd see in social media analytics or API response processing.

#include "simdjson.h"
using namespace simdjson;

int main() {
  ondemand::parser parser;
  padded_string json = padded_string::load("twitter.json");
  ondemand::document doc = parser.iterate(json);
  uint64_t count = doc["search_metadata"]["count"];
  std::cout << count << " results." << std::endl;
  return 0;
}

Notice what doesn't happen here: no memory allocations for unused parts of the document, no tree construction beyond the accessed path, no recursive descent through unneeded branches. The parser validates search_metadata.count and extracts the integer value, but leaves everything else untouched. Scale this pattern to gigabyte-sized log files, and the savings become significant both in memory usage and CPU cycles.

Use Cases and Who Uses It

simdjson excels in scenarios where JSON parsing becomes a bottleneck. High-frequency trading systems parse market data feeds at microsecond latencies. Log aggregation platforms like those processing Apache access logs or GitHub events need to sustain multi-GB/s ingestion rates.

Analytics databases extract fields from semi-structured data without materializing full document trees. The On-Demand iterator model shines when you only need specific fields like extracting timestamps from log entries or prices from trading messages, since you pay only for what you access.

For streaming NDJSON workloads (newline-delimited JSON), iterate_many processes files at 3+ GB/s with bounded memory usage. This makes it ideal for ETL pipelines that transform clickstream data, IoT sensor readings, or financial tick data. The multithreaded approach overlaps structural discovery on background threads while the main thread validates and extracts values (parse_many.md).

Real-world adoption spans critical infrastructure where performance matters. Node.js leverages simdjson for faster JavaScript object parsing. QuestDB uses simdjson in their json_extract() function to pull fields from trading data: extracting quantities, prices, and execution timestamps from JSON trade records stored as VARCHAR columns. ClickHouse integrates it for analytical queries over JSON columns, enabling SQL operations on semi-structured web logs and event data at warehouse scale.

In the compute-intensive world, Meta Velox (their vectorized execution engine) uses simdjson for columnar analytics, while Dgraph applies it in their graph database for JSON document ingestion. Developer tooling benefits too: Clang Build Analyzer parses compilation trace data, and Shopify HeapProfiler processes memory allocation traces. The ecosystem spans 20+ language bindings, from Rust ports to Python bindings, making SIMD-accelerated parsing accessible across technology stacks.

Community and Contribution

The project is active, research-backed, and well-maintained. Contribution guidelines in CONTRIBUTING.md and build/architecture notes in HACKING.md make it approachable. Issues and PRs are responsive, and the maintainers invest in portability and API clarity. If you are building bindings, the single-header distribution and stable On-Demand API remove a lot of friction.

Usage and License Terms

simdjson is dual-licensed: you may choose the Apache-2.0 license or the MIT license, both permissive and business-friendly. The repository also bundles a few small components under permissive terms (Boost for string_view-lite, MIT-licensed Grisu2 code for number serialization, BSD-3-Clause snippets for runtime dispatch) as noted in the README's license section. Practically, this means you can embed simdjson directly in commercial or open software, statically or dynamically, with minimal obligations beyond attribution.

Impact and What Comes Next

This project has already shifted expectations for JSON performance; the emphasis on correctness, UTF-8 validation, and exact number parsing sets a useful bar for the whole ecosystem. Looking forward, two areas feel impactful.

First, deeper integrations with data engines (columnar readers, vectorized execution) where the On-Demand cursor can feed operators without intermediate trees.

Second, expanded tooling around generation and transformation: the Builder API is a great base for high-speed emitters, and C++20/26 features (concepts, static reflection) point to ergonomic, zero-overhead serialization round-trips. If you work with NDJSON in distributed systems, the streaming path is already a compelling building block.

About the Project

simdjson is maintained under the simdjson GitHub organization and led by researchers and engineers including Daniel Lemire and collaborators. The work is supported by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC), as acknowledged in the README. The team publishes regularly and treats the library as both a production tool and a research vehicle, which shows in the documentation and benchmarks ((Langdale & Lemire, 2019); (Keiser & Lemire, 2024)).

Conclusion

If you care about throughput, strict validation, or both, simdjson is an easy recommendation. Start with the single-header distribution, skim doc/basics.md to internalize the On-Demand iterator model, and try a streaming parse with iterate_many on a real dataset. The payoff is immediate: fewer branches, fewer allocations, and more useful CPU cycles. For further context, the JSON spec is a short read ((Bray, 2017)). Then come back to the repo's performance tips and tune for your hardware.

in Github Repos

# c++ data-engineering json performance simd simdjson

Publication Title: simdjson: Parsing Gigabytes of JSON per Second

DOI: 10.48550/arXiv.2312.17149

Authors:

Daniel Lemire Geoff Langdale John Keiser

Number of Pages: 14

Joshua Berkowitz September 23, 2025

Views 10219

Share this post

blogs

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Try It

Most Popular Articles

Check out what the hot topics are!

See all

Follow us

simdjson: JSON Parsing at the Speed of Silicon

simdjson: Parsing Gigabytes of JSON per Second

Get All The Latest to Your Inbox!

Advertise Here!

Inquire Now

simdjson

simdjson

Clever Solutions to Standard Problems

Key Features at a Glance

Why I Like It

Under the Hood

A Practical Example

Use Cases and Who Uses It

Community and Contribution

Usage and License Terms

Impact and What Comes Next

About the Project

Conclusion

Share this post

Tags

blogs

Our latest content

Prompt Maker Image Generator

Most Popular Articles

Every shirt tells a story—and every story

#ClothingForACause