Skip to content
Navigation menu
Search
Powered by Algolia
Search
Log in
Create account
DEV Community
Close
#
vllm
Follow
Hide
Posts
Left menu
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
Right menu
Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe
Xavier Rey-Robert
Xavier Rey-Robert
Xavier Rey-Robert
Follow
Jun 19
Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe
#
ai
#
llm
#
vllm
#
agents
Comments
Add Comment
4 min read
AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
The Cyber Sidekick
The Cyber Sidekick
The Cyber Sidekick
Follow
Jun 18
AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
#
edgeai
#
kubernetes
#
llminference
#
vllm
Comments
Add Comment
3 min read
Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out
Creeta
Creeta
Creeta
Follow
Jun 18
Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out
#
qwen3
#
nvfp4
#
vllm
#
nvidia
Comments
Add Comment
8 min read
I built an open-source alternative to Microsoft's KAITO that works on ANY Kubernetes cluster
GaeaRuiW
GaeaRuiW
GaeaRuiW
Follow
Jun 9
I built an open-source alternative to Microsoft's KAITO that works on ANY Kubernetes cluster
#
kubernetes
#
vllm
#
devops
#
opensource
Comments
Add Comment
2 min read
Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%
Tech_Nuggets
Tech_Nuggets
Tech_Nuggets
Follow
Jun 7
Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%
#
llm
#
ai
#
infrastructure
#
vllm
Comments
Add Comment
9 min read
KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break
Tech_Nuggets
Tech_Nuggets
Tech_Nuggets
Follow
Jun 6
KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break
#
llm
#
ai
#
vllm
#
performance
1
 reaction
Comments
Add Comment
8 min read
Two Qwen3 Models on One DGX Spark: The Residency Math for Local LLM Coding
Devashish
Devashish
Devashish
Follow
Jun 16
Two Qwen3 Models on One DGX Spark: The Residency Math for Local LLM Coding
#
localllm
#
vllm
#
ai
#
nvidia
Comments
Add Comment
5 min read
Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run
xbill
xbill
xbill
Follow
for
Google Developer Experts
May 30
Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run
#
googleantigravity
#
vllm
#
googlecloudrun
#
gemma4
4
 reactions
Comments
Add Comment
14 min read
vLLM's V1 Release Fixes the Silent Killer in RL Training
Aamer Mihaysi
Aamer Mihaysi
Aamer Mihaysi
Follow
May 8
vLLM's V1 Release Fixes the Silent Killer in RL Training
#
vllm
#
machinelearning
#
python
Comments
Add Comment
2 min read
The 70B Threshold: How the RTX 5090 Rewrites the Home Lab Equation
Matthew Gladding
Matthew Gladding
Matthew Gladding
Follow
Apr 24
The 70B Threshold: How the RTX 5090 Rewrites the Home Lab Equation
#
model
#
memory
#
models
#
vllm
Comments
Add Comment
8 min read
How RunPod FlashBoot Actually Works (4-Request Test)
Sergey Shmakov
Sergey Shmakov
Sergey Shmakov
Follow
May 26
How RunPod FlashBoot Actually Works (4-Request Test)
#
runpod
#
flashboot
#
serverless
#
vllm
1
 reaction
Comments
Add Comment
10 min read
Rethinking Open Source Contribution in the Age of AI Agents, featuring vLLM Core Maintainer Roger Wang at MLSys'26
Grace
Grace
Grace
Follow
May 21
Rethinking Open Source Contribution in the Age of AI Agents, featuring vLLM Core Maintainer Roger Wang at MLSys'26
#
vllm
#
ai
#
machinelearning
#
llm
8
 reactions
Comments
6
 comments
3 min read
Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?
Thurmon Demich
Thurmon Demich
Thurmon Demich
Follow
May 20
Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?
#
ollama
#
llamacpp
#
vllm
#
comparison
Comments
1
 comment
5 min read
72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMDÂ MI300X
Manikandan T
Manikandan T
Manikandan T
Follow
May 13
72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMDÂ MI300X
#
vllm
#
rocm
#
mi300x
#
genai
Comments
Add Comment
13 min read
From one model to seven — what it took to make TurboQuant model-portable
Alberto Nieto
Alberto Nieto
Alberto Nieto
Follow
Apr 1
From one model to seven — what it took to make TurboQuant model-portable
#
python
#
vllm
#
gpu
#
triton
Comments
Add Comment
3 min read
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
We're a place where coders share, stay up-to-date and grow their careers.
Log in
Create account