Whisper es el modelo de reconocimiento de voz de propósito general de OpenAI entrenado con 680,000 horas de datos multilingües. Soporta transcripción, traducción, detección de idioma y marca de tiempo a nivel de palabra.
Instalación
# Install OpenAI Whisper
pip install openai-whisper
# Install with PyTorch CUDA support
pip install openai-whisper torch torchvision torchaudio
# Install faster-whisper (CTranslate2 backend, 4x faster)
pip install faster-whisper
# Install WhisperX (with alignment and diarization)
pip install whisperx
# Install ffmpeg (required dependency)
# macOS
brew install ffmpeg
# Ubuntu
sudo apt install ffmpeg
# Verify
whisper --help
Tamaños de modelos
| Model | Parameters | English-only | VRAM Required | Relative Speed |
|---|
| tiny | 39M | tiny.en | ~1 GB | ~10x |
| base | 74M | base.en | ~1 GB | ~7x |
| small | 244M | small.en | ~2 GB | ~4x |
| medium | 769M | medium.en | ~5 GB | ~2x |
| large-v3 | 1550M | No | ~10 GB | 1x |
| turbo | 809M | No | ~6 GB | ~8x |
Uso del CLI
# Basic transcription
whisper audio.mp3
# Specify model size
whisper audio.mp3 --model large-v3
# Specify language
whisper audio.mp3 --model large-v3 --language en
# Translate to English
whisper audio.mp3 --model large-v3 --task translate
# Specify output formats
whisper audio.mp3 --model large-v3 \
--output_format all # txt, vtt, srt, tsv, json
# Output to specific directory
whisper audio.mp3 --model large-v3 \
--output_dir ./transcripts
# Multiple files
whisper audio1.mp3 audio2.wav audio3.m4a --model medium
# Word-level timestamps
whisper audio.mp3 --model large-v3 \
--word_timestamps True
# Use specific device
whisper audio.mp3 --model large-v3 --device cuda
# CPU with specific threads
whisper audio.mp3 --model medium --device cpu --threads 8
API de Python
import whisper
# Load model
model = whisper.load_model("large-v3") # tiny, base, small, medium, large-v3, turbo
# Basic transcription
result = model.transcribe("audio.mp3")
print(result["text"])
# With options
result = model.transcribe(
"audio.mp3",
language="en",
task="transcribe", # or "translate" for English translation
fp16=True, # Use FP16 (GPU only)
beam_size=5,
best_of=5,
temperature=0.0,
word_timestamps=True,
verbose=True,
)
# Access segments with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")
# Access word-level timestamps
for segment in result["segments"]:
for word in segment.get("words", []):
print(f" [{word['start']:.2f} - {word['end']:.2f}] {word['word']}")
Detección de idioma
import whisper
model = whisper.load_model("large-v3")
# Load and pad audio
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
# Compute log-Mel spectrogram
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# Detect language
_, probs = model.detect_language(mel)
detected_lang = max(probs, key=probs.get)
print(f"Detected language: {detected_lang} ({probs[detected_lang]:.2%})")
Faster-Whisper
from faster_whisper import WhisperModel
# Load model (uses CTranslate2 - significantly faster)
model = WhisperModel(
"large-v3",
device="cuda",
compute_type="float16", # float16, int8_float16, int8
)
# Transcribe
segments, info = model.transcribe(
"audio.mp3",
language="en",
beam_size=5,
word_timestamps=True,
vad_filter=True, # Filter out silence
)
print(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f} - {segment.end:.2f}] {segment.text}")
if segment.words:
for word in segment.words:
print(f" [{word.start:.2f} - {word.end:.2f}] {word.word}")
# Faster-Whisper CLI
faster-whisper audio.mp3 --model large-v3 --language en
WhisperX (alineación y diarización)
import whisperx
# Load model
model = whisperx.load_model("large-v3", device="cuda", compute_type="float16")
# Transcribe
audio = whisperx.load_audio("audio.mp3")
result = model.transcribe(audio, batch_size=16)
# Align timestamps at word level
model_a, metadata = whisperx.load_align_model(language_code="en", device="cuda")
result = whisperx.align(
result["segments"], model_a, metadata, audio, device="cuda"
)
# Speaker diarization (requires HuggingFace token)
diarize_model = whisperx.DiarizationPipeline(
use_auth_token="YOUR_HF_TOKEN", device="cuda"
)
diarize_segments = diarize_model(audio)
# Assign speakers to words
result = whisperx.assign_word_speakers(diarize_segments, result)
for segment in result["segments"]:
speaker = segment.get("speaker", "UNKNOWN")
print(f"[{speaker}] {segment['text']}")
# Generate all output formats
whisper audio.mp3 --model large-v3 --output_format all
Generated files include:
| Formato | Extension | Descripción |
|---|
| Text | .txt | Plain text transcript |
| SRT | .srt | SubRip subtitle format |
| VTT | .vtt | WebVTT subtitle format |
| TSV | .tsv | Tab-separated with timestamps |
| JSON | .json | Full results with metadata |
Escritura de salida programática
import whisper
from whisper.utils import get_writer
model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3")
# Write SRT file
writer = get_writer("srt", "./output")
writer(result, "audio.mp3")
# Write VTT file
writer = get_writer("vtt", "./output")
writer(result, "audio.mp3")
# Write JSON
writer = get_writer("json", "./output")
writer(result, "audio.mp3")
Procesamiento por lotes
import os
import whisper
model = whisper.load_model("large-v3")
audio_dir = "./audio_files"
for filename in os.listdir(audio_dir):
if filename.endswith((".mp3", ".wav", ".m4a", ".flac")):
filepath = os.path.join(audio_dir, filename)
result = model.transcribe(filepath)
print(f"\n--- {filename} ---")
print(result["text"])
Opciones comunes
| Opción | Descripción | Predeterminado |
|---|
--model | Model size | turbo |
--language | Audio language code | Auto-detect |
--task | transcribe or translate | transcribe |
--output_format | txt, vtt, srt, tsv, json, all | all |
--output_dir | Output directory | . |
--device | cuda or cpu | Auto |
--word_timestamps | Enable word-level timestamps | False |
--beam_size | Beam search width | 5 |
--temperature | Sampling temperature | 0 |
--fp16 | Use FP16 inference | True on GPU |