Databricks Delivers Fast, Scalable PEFT Model Serving for Enterprise AI Enterprises aiming to deploy AI agents tailored to their proprietary data face the challange of delivering high-performance inference that can scale with complex, fragmented workloads. Parameter-Effic... Databricks enterprise AI GPU optimization inference LoRA model serving PEFT quantization
ONNX Runtime : Inference Runtime for Portability, Performance, and Scale Deploying machine learning models efficiently is as important as training them. ONNX Runtime , an open-source accelerator from Microsoft, promises fast, portable inference across operating systems and... deployment inference ONNX runtime TensorFlow Serving Triton
BitNet: 1-bit LLMs Land With Practical Inference on CPUs and GPUs BitNet from Microsoft Research is the official C++ inference stack for native 1-bit large language models, centered on BitNet b1.58. The repo ships fast, lossless ternary kernels for CPUs, a CUDA W2A8... 1-bit LLM BitNet CPU GGUF GPU inference llama.cpp quantization T-MAC