vLLM server brings OpenAI compatible APIs to local and cloud models

vLLM exposes an OpenAI compatible HTTP server for text, chat, embeddings, audio, and multimodal workloads, while adding its own extensions for pooling, scoring, and re-ranking. It is designed to let existing OpenAI clients talk to local or self-hosted models with minimal code changes.

vLLM offers an HTTP server that closely follows OpenAI’s Completions, Chat Completions, Responses, Embeddings, Transcriptions, Translations, and Realtime APIs so existing clients can target self-hosted models by only changing the base_url and api_key. The server is started with the vllm serve command pointing at a model such as “NousResearch/Meta-Llama-3-8B-Instruct”, and by default it applies generation_config.json from the model’s Hugging Face repository unless disabled with –generation-config vllm. The platform also exposes custom endpoints for tokenization, pooling, classification, scoring, and re-ranking, enabling a wide range of language and multimodal workflows behind a largely OpenAI compatible surface.

Chat support depends on a Jinja2 chat template defined in the tokenizer configuration or supplied via –chat-template, and vLLM can auto detect whether templates expect a simple string content field or OpenAI’s newer list-of-dictionaries schema such as [{“type”: “text”, “text”: “Hello world!”}]. Extra parameters beyond the OpenAI specification can be passed through extra_body or merged directly into JSON payloads, covering advanced sampling controls like top_k, min_p, repetition_penalty, stop_token_ids, and prompt_logprobs, as well as extensions such as structured_outputs, cache_salt for prefix cache salting, kv_transfer_params for disaggregated serving, vllm_xargs for custom extensions, and repetition_detection to terminate runs when repetitive patterns like “abcdabcdabcd…” or “emoji emoji emoji …” occur. For chat, additional toggles like echo, add_generation_prompt, continue_final_message, documents for retrieval augmented generation, and media_io_kwargs or mm_processor_kwargs for multimodal processing allow fine grained control of how prompts are formatted and executed.

Beyond text generation, vLLM’s Embeddings and pooling stack supports both plain text and chat style inputs, plus multimodal embedding for models like TIGER-Lab/VLM2Vec-Full and MrLight/dse-qwen2-2b-mrl-v1, configured via custom chat templates and the –runner pooling flag. Extra parameters for embeddings include truncate_prompt_tokens, request_id, priority based scheduling, mm_processor_kwargs, cache_salt, add_special_tokens, embed_dtype such as “float32”, endianness such as “native”, and use_activation. The Transcriptions and Translations APIs mirror OpenAI’s audio endpoints for models like openai/whisper-large-v3-turbo, with support for uploading files via the Python client or curl, enforcing limits through the VLLM_MAX_AUDIO_CLIP_FILESIZE_MB environment variable, and advanced sampling options such as temperature values between 0 and 1, top_p, top_k, min_p, seed, and penalties for frequency, repetition, and presence. A Realtime WebSocket protocol streams base64 encoded PCM16 audio at 16kHz via input_audio_buffer.append and input_audio_buffer.commit events, returning transcription.delta and transcription.done messages for low latency speech to text.

Higher level semantic tasks are handled by dedicated APIs. The Classification endpoint wraps Hugging Face sequence classification and generic transformer models to output per class probabilities and supports both string and batched inputs, with optional use_activation and shared controls like truncate_prompt_tokens, request_id, priority, mm_processor_kwargs, cache_salt, and add_special_tokens for text or messages based inputs. The Score API evaluates sentence or multimodal pairs with cross encoder or embedding models, returning similarity scores such as 1 or 0.001094818115234375, and can be combined with score templates that use query and document roles inside Jinja. The Re-rank API, available at /rerank, /v1/rerank, and /v2/rerank, implements Jina AI and Cohere compatible interfaces so a single query can be matched against multiple documents, returning ranked results with relevance_score values like 0.99853515625 and 0.0005860328674316406 and preserving original indices. For distributed and production deployments, Ray Serve LLM integrates the engine with autoscaling, load balancing, and observability, while still exposing an OpenAI compatible HTTP and Pythonic API across single GPU and multi node clusters.

58

Impact Score

SK hynix debuts 1c LPDDR6 memory with 16 Gb capacity and higher speeds

SK hynix has developed 1c-node LPDDR6 memory with 16 Gb capacity, targeting speeds beyond 10.7 Gbps and improved power efficiency for next-generation devices. The company plans to start mass production in the first half of the year and ship to customers in the second half.

Nvidia debuts rtx mega geometry with next gen ray tracing demos

Nvidia introduced rtx mega geometry at gdc 2026 alongside its geforce rtx 50 series, showcasing new techniques for handling extreme geometric detail in ray traced scenes. Early demos in alan wake 2 and the witcher 4 highlight performance gains and memory savings from nested triangle clusters.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.