Skip to content

DEV Community

# vllm

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Xavier Rey-Robert

Jun 19

Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe

#ai #llm #vllm #agents

4 min read

The Cyber Sidekick

Jun 18

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm

#edgeai #kubernetes #llminference #vllm

3 min read

Creeta

Jun 18

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

#qwen3 #nvfp4 #vllm #nvidia

8 min read

GaeaRuiW

Jun 9

I built an open-source alternative to Microsoft's KAITO that works on ANY Kubernetes cluster

#kubernetes #vllm #devops #opensource

2 min read

Jun 7

Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%

#llm #ai #infrastructure #vllm

9 min read

Jun 6

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

#llm #ai #vllm #performance

8 min read

Devashish

Jun 16

Two Qwen3 Models on One DGX Spark: The Residency Math for Local LLM Coding

#localllm #vllm #ai #nvidia

5 min read

xbill for Google Developer Experts

May 30

Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run

#googleantigravity #vllm #googlecloudrun #gemma4

14 min read

May 8

vLLM's V1 Release Fixes the Silent Killer in RL Training

#vllm #machinelearning #python

2 min read

Matthew Gladding

Apr 24

The 70B Threshold: How the RTX 5090 Rewrites the Home Lab Equation

#model #memory #models #vllm

8 min read

May 26

How RunPod FlashBoot Actually Works (4-Request Test)

#runpod #flashboot #serverless #vllm

10 min read

Grace

May 21

Rethinking Open Source Contribution in the Age of AI Agents, featuring vLLM Core Maintainer Roger Wang at MLSys'26

#vllm #ai #machinelearning #llm

3 min read

May 20

Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?

#ollama #llamacpp #vllm #comparison

5 min read

May 13

72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMD MI300X

#vllm #rocm #mi300x #genai

13 min read

Apr 1

From one model to seven — what it took to make TurboQuant model-portable

#python #vllm #gpu #triton

3 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.