ONNX Runtime : Inference Runtime for Portability, Performance, and Scale Deploying machine learning models efficiently is as important as training them. ONNX Runtime , an open-source accelerator from Microsoft, promises fast, portable inference across operating systems and... deployment inference ONNX runtime TensorFlow Serving Triton
BitNet: 1-bit LLMs Land With Practical Inference on CPUs and GPUs BitNet from Microsoft Research is the official C++ inference stack for native 1-bit large language models, centered on BitNet b1.58. The repo ships fast, lossless ternary kernels for CPUs, a CUDA W2A8... 1-bit LLM BitNet CPU GGUF GPU inference llama.cpp quantization T-MAC