Speech-to-Text (Transcription/Translation) Support¶
This document walks you through the steps to add support for speech-to-text (ASR) models to vLLM’s transcription and translation APIs by implementing SupportsTranscription. Please refer to the supported models for further guidance.
1. Update the base vLLM model¶
It is assumed you have already implemented your model in vLLM according to the basic model guide. Extend your model with the SupportsTranscription interface and implement the following class attributes and methods.
-
Declare supported languages and capabilities:
Code
from typing import ClassVar, Mapping, Optional, Literal import numpy as np import torch from torch import nn from vllm.config import ModelConfig, SpeechToTextConfig from vllm.inputs.data import PromptType from vllm.model_executor.models.interfaces import SupportsTranscription class YourASRModel(nn.Module, SupportsTranscription): # Map of ISO 639-1 language codes to language names supported_languages: ClassVar[Mapping[str, str]] = { "en": "English", "it": "Italian", # ... add more as needed } # If your model only supports audio-conditioned generation # (no text-only generation), enable this flag. supports_transcription_only: ClassVar[bool] = True
- The
supported_languages
mapping is validated at init time. - Set
supports_transcription_only=True
if the model should not serve text generation (eg Whisper).
- The
-
Provide an ASR configuration via get_speech_to_text_config. This is for controlling general behavior of the API when serving your model:
Code
class YourASRModel(nn.Module, SupportsTranscription): ... @classmethod def get_speech_to_text_config( cls, model_config: ModelConfig, task_type: Literal["transcribe", "translate"], ) -> SpeechToTextConfig: return SpeechToTextConfig( sample_rate=16_000, max_audio_clip_s=30, # Set to None to disable server-side chunking if your # model/processor handles it already min_energy_split_window_size=None, )
See the “Audio preprocessing and chunking” section for what each field controls.
-
Implement the prompt construction via get_generation_prompt. The server passes you the resampled waveform and task parameters; you return a valid PromptType. There are two common patterns:
A. Multimodal LLM with audio embeddings (e.g., Voxtral, Gemma3n)¶
Return a dict containing
multi_modal_data
with the audio, and either aprompt
string orprompt_token_ids
:Code
class YourASRModel(nn.Module, SupportsTranscription): ... @classmethod def get_generation_prompt( cls, audio: np.ndarray, stt_config: SpeechToTextConfig, model_config: ModelConfig, language: Optional[str], task_type: Literal["transcribe", "translate"], request_prompt: str, to_language: Optional[str], ) -> PromptType: # Example with a free-form instruction prompt task_word = "Transcribe" if task_type == "transcribe" else "Translate" prompt = ( "<start_of_turn>user\n" f"{task_word} this audio: <audio_soft_token>" "<end_of_turn>\n<start_of_turn>model\n" ) return { "multi_modal_data": {"audio": (audio, stt_config.sample_rate)}, "prompt": prompt, }
For further clarification on multi modal inputs, please refer to Multi-Modal Inputs.
B. Encoder–decoder audio-only (e.g., Whisper)¶
Return a dict with separate
encoder_prompt
anddecoder_prompt
entries:Code
class YourASRModel(nn.Module, SupportsTranscription): ... @classmethod def get_generation_prompt( cls, audio: np.ndarray, stt_config: SpeechToTextConfig, model_config: ModelConfig, language: Optional[str], task_type: Literal["transcribe", "translate"], request_prompt: str, to_language: Optional[str], ) -> PromptType: if language is None: raise ValueError("Language must be specified") prompt = { "encoder_prompt": { "prompt": "", "multi_modal_data": { "audio": (audio, stt_config.sample_rate), }, }, "decoder_prompt": ( (f"<|prev|>{request_prompt}" if request_prompt else "") + f"<|startoftranscript|><|{language}|>" + f"<|{task_type}|><|notimestamps|>" ), } return cast(PromptType, prompt)
-
(Optional) Language validation via validate_language
If your model requires a language and you want a default, override this method (see Whisper):
-
(Optional) Token accounting for streaming via get_num_audio_tokens
Provide a fast duration→token estimate to improve streaming usage statistics:
Code
class YourASRModel(nn.Module, SupportsTranscription): ... @classmethod def get_num_audio_tokens( cls, audio_duration_s: float, stt_config: SpeechToTextConfig, model_config: ModelConfig, ) -> Optional[int]: # Return None if unknown; otherwise return an estimate. return int(audio_duration_s * stt_config.sample_rate // 320) # example
2. Audio preprocessing and chunking¶
The API server takes care of basic audio I/O and optional chunking before building prompts:
- Resampling: Input audio is resampled to
SpeechToTextConfig.sample_rate
usinglibrosa
. - Chunking: If
SpeechToTextConfig.allow_audio_chunking
is True and the duration exceedsmax_audio_clip_s
, the server splits the audio into overlapping chunks and generates a prompt per chunk. Overlap is controlled byoverlap_chunk_second
. - Energy-aware splitting: When
min_energy_split_window_size
is set, the server finds low-energy regions to minimize cutting within words.
Relevant server logic:
Code
# vllm/entrypoints/openai/speech_to_text.py
async def _preprocess_speech_to_text(...):
language = self.model_cls.validate_language(request.language)
...
y, sr = librosa.load(bytes_, sr=self.asr_config.sample_rate)
duration = librosa.get_duration(y=y, sr=sr)
do_split_audio = (self.asr_config.allow_audio_chunking
and duration > self.asr_config.max_audio_clip_s)
chunks = [y] if not do_split_audio else self._split_audio(y, int(sr))
prompts = []
for chunk in chunks:
prompt = self.model_cls.get_generation_prompt(
audio=chunk,
stt_config=self.asr_config,
model_config=self.model_config,
language=language,
task_type=self.task_type,
request_prompt=request.prompt,
to_language=to_language,
)
prompts.append(prompt)
return prompts, duration
3. Exposing tasks automatically¶
- vLLM automatically advertises transcription support if your model implements the interface:
if supports_transcription(model):
if model.supports_transcription_only:
return ["transcription"]
supported_tasks.append("transcription")
- When enabled, the server initializes the transcription and translation handlers:
state.openai_serving_transcription = OpenAIServingTranscription(...) if "transcription" in supported_tasks else None
state.openai_serving_translation = OpenAIServingTranslation(...) if "transcription" in supported_tasks else None
No extra registration is required beyond having your model class available via the model registry and implementing SupportsTranscription
.
4. Examples in-tree¶
- Whisper encoder–decoder (audio-only): vllm/model_executor/models/whisper.py
- Voxtral decoder-only (audio embeddings + LLM): vllm/model_executor/models/voxtral.py
- Gemma3n decoder-only with fixed instruction prompt: vllm/model_executor/models/gemma3n_mm.py
5. Test with the API¶
Once your model implements SupportsTranscription
, you can test the endpoints (API mimics OpenAI):
-
Transcription (ASR):
-
Translation (source → English unless otherwise supported):
Or check out more examples in examples/online_serving.curl -s -X POST \ -H "Authorization: Bearer $VLLM_API_KEY" \ -H "Content-Type: multipart/form-data" \ -F "file=@/path/to/audio.wav" \ -F "model=$MODEL_ID" \ http://localhost:8000/v1/audio/translations
Note
- If your model handles chunking internally (e.g., via its processor or encoder), set
min_energy_split_window_size=None
in the returnedSpeechToTextConfig
to disable server-side chunking. - Implementing
get_num_audio_tokens
improves accuracy of streaming usage metrics (prompt_tokens
) without an extra forward pass. - For multilingual behavior, keep
supported_languages
aligned with actual model capabilities.