Optimization and tuning vLLM

This guide details optimization and performance tuning for vLLM V1, covering preemption handling, chunked prefill behavior, parallelism strategies, input processing, and multi-modal caching.

This document describes optimization strategies and performance tuning for vLLM V1. It explains how vLLM handles preemption when kv cache space is insufficient and shows the typical warning emitted when sequences are preempted. The guide lists practical mitigations for frequent preemption, including increasing gpu_memory_utilization, reducing max_num_seqs or max_num_batched_tokens, and adjusting tensor_parallel_size or pipeline_parallel_size. It also notes that vLLM V1 uses RECOMPUTE as the default preemption mode instead of swap and that preemption and recomputation can increase end-to-end latency. Prometheus metrics and logging options are available to monitor preemption counts.

The guide explains chunked prefill, which in vLLM V1 is always enabled by default. Chunked prefill processes large prefills in smaller chunks and batches them with decode requests so compute-bound prefills and memory-bound decodes are balanced. The scheduler prioritizes decode requests and batches all pending decodes before scheduling prefills; pending prefills that do not fit the max_num_batched_tokens budget are automatically chunked. Tuning max_num_batched_tokens offers trade-offs: smaller values (for example, 2048) improve inter-token latency, larger values improve time to first token and throughput, and settings above 8192 are recommended for optimal throughput on smaller models with large gpu memory. If max_num_batched_tokens equals max_model_len, the scheduling resembles the vLLM V0 default but still prioritizes decodes.

Parallelism strategies supported by vLLM are covered in detail. Tensor parallelism shards model parameters across multiple gpus and is useful when a model cannot fit on a single gpu or to free kv cache memory. Pipeline parallelism distributes layers across gpus and can be combined with tensor parallelism for very large models. Expert parallelism targets mixture of experts models and can be enabled via enable_expert_parallel=True. Data parallelism replicates the entire model across gpu sets to scale throughput and can be combined with other strategies. The guide also covers input processing scale-out for online serving and the multi-modal processor cache, which is enabled by default and configurable via mm_processor_cache_gb (default 4 GiB per API process plus 4 GiB per engine core process).

50

Impact Score

SK Hynix slows HBM4 ramp as HBM3E demand persists

SK Hynix has postponed its HBM4 capacity increase from the second quarter of 2026 to the third quarter of 2026, citing continued strong demand for HBM3E. NVIDIA’s delays with ‘Rubin’ and sustained demand for HBM3E-equipped ‘Blackwell’ are shaping the timing.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.