Blog Posts | Joshua Berkowitz

6 Articles

vLLM ×

Unlocking LLM Efficiency: The Critical Role of KV-Cache and Smart Scheduling

As large language models (LLMs) become foundational to modern AI applications, many teams focus on model architecture and hardware but the real game-changer often lies in how efficiently you manage th...

AI performance cloud AI distributed inference KV-cache llm-d prefix caching scheduling vLLM

Dec 6, 2025

0 3036

News

Docker Model Runner Empowers Developers with Advanced AI Models and Faster Inference

AI development just got a major boost as Docker introduces new tools and models, making sophisticated machine learning accessible to more developers than ever. By integrating high-performance models a...

AI models DeepSeek Docker Edge AI Mistral AI Model Runner Open source vLLM

Dec 6, 2025

0 748

News

vLLM TPU’s Unified Backend is Revolutionizing LLM Inference

The latest vLLM TPU release is enabling developers to run open-source LLMs on TPUs with unmatched performance and flexibility. Powered by the tpu-inference backend, this innovation ensures a smooth, h...

attention kernels JAX LLM inference open source PyTorch TPU tpu-inference vLLM

Oct 18, 2025

0 33044

News

Agent Lightning: Decoupled RL Training for Any AI Agent

Agent Lightning is a Microsoft Research project that turns existing agents into trainable systems with minimal code changes. Instead of rewriting your agent to fit a trainer loop, you attach a lightwe...

AI agents AutoGen DPO LangGraph OpenAI Agents reinforcement learning RLHF VERL vLLM

Oct 8, 2025

0 49687

Github Repos

Qwen3-Omni: Native Any-to-Any Multimodality, Now Practical

Qwen3-Omni is a natively end-to-end, multilingual, omni-modal foundation model from the Qwen team at Alibaba Cloud. It can understand text, images, audio, and video, and respond in real time with both...

ASR Docker multimodal Omni Qwen Qwen3 speech Transformers vLLM

Sep 25, 2025

0 61787

Github Repos

vLLM Is Transforming High-Performance LLM Deployment

Deploying large language models at scale is no small feat, but vLLM is rapidly emerging as a solution for organizations seeking robust, efficient inference engines. Originally developed at UC Berkeley...

AI inference GPU optimization Kubernetes large language models memory management model deployment vLLM

Jun 22, 2025

0 22737

News

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Most Popular Articles

Check out what the hot topics are!

See all

Every shirt tells a story—and every story

#ClothingForACause