Whisper هو نموذج التعرف على الكلام للأغراض العامة من OpenAI المُدرب على 680,000 ساعة من البيانات متعددة اللغات. يدعم النسخ والترجمة إلى الإنجليزية واكتشاف اللغة والطوابع الزمنية على مستوى الكلمة عبر 99 لغة.
التثبيت
# Install OpenAI Whisper
pip install openai-whisper
# Install with PyTorch CUDA support
pip install openai-whisper torch torchvision torchaudio
# Install faster-whisper (CTranslate2 backend, 4x faster)
pip install faster-whisper
# Install WhisperX (with alignment and diarization)
pip install whisperx
# Install ffmpeg (required dependency)
# macOS
brew install ffmpeg
# Ubuntu
sudo apt install ffmpeg
# Verify
whisper --help
Model Sizes
| Model | Parameters | English-only | VRAM Required | Relative Speed |
|---|
| tiny | 39M | tiny.en | ~1 GB | ~10x |
| base | 74M | base.en | ~1 GB | ~7x |
| small | 244M | small.en | ~2 GB | ~4x |
| medium | 769M | medium.en | ~5 GB | ~2x |
| large-v3 | 1550M | No | ~10 GB | 1x |
| turbo | 809M | No | ~6 GB | ~8x |
CLI الاستخدام
# Basic transcription
whisper audio.mp3
# Specify model size
whisper audio.mp3 --model large-v3
# Specify language
whisper audio.mp3 --model large-v3 --language en
# Translate to English
whisper audio.mp3 --model large-v3 --task translate
# Specify output formats
whisper audio.mp3 --model large-v3 \
--output_format all # txt, vtt, srt, tsv, json
# Output to specific directory
whisper audio.mp3 --model large-v3 \
--output_dir ./transcripts
# Multiple files
whisper audio1.mp3 audio2.wav audio3.m4a --model medium
# Word-level timestamps
whisper audio.mp3 --model large-v3 \
--word_timestamps True
# Use specific device
whisper audio.mp3 --model large-v3 --device cuda
# CPU with specific threads
whisper audio.mp3 --model medium --device cpu --threads 8
Python API
import whisper
# Load model
model = whisper.load_model("large-v3") # tiny, base, small, medium, large-v3, turbo
# Basic transcription
result = model.transcribe("audio.mp3")
print(result["text"])
# With options
result = model.transcribe(
"audio.mp3",
language="en",
task="transcribe", # or "translate" for English translation
fp16=True, # Use FP16 (GPU only)
beam_size=5,
best_of=5,
temperature=0.0,
word_timestamps=True,
verbose=True,
)
# Access segments with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")
# Access word-level timestamps
for segment in result["segments"]:
for word in segment.get("words", []):
print(f" [{word['start']:.2f} - {word['end']:.2f}] {word['word']}")
Language Detection
import whisper
model = whisper.load_model("large-v3")
# Load and pad audio
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
# Compute log-Mel spectrogram
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# Detect language
_, probs = model.detect_language(mel)
detected_lang = max(probs, key=probs.get)
print(f"Detected language: {detected_lang} ({probs[detected_lang]:.2%})")
Faster-Whisper
from faster_whisper import WhisperModel
# Load model (uses CTranslate2 - significantly faster)
model = WhisperModel(
"large-v3",
device="cuda",
compute_type="float16", # float16, int8_float16, int8
)
# Transcribe
segments, info = model.transcribe(
"audio.mp3",
language="en",
beam_size=5,
word_timestamps=True,
vad_filter=True, # Filter out silence
)
print(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f} - {segment.end:.2f}] {segment.text}")
if segment.words:
for word in segment.words:
print(f" [{word.start:.2f} - {word.end:.2f}] {word.word}")
# Faster-Whisper CLI
faster-whisper audio.mp3 --model large-v3 --language en
WhisperX (Alignment and Diarization)
import whisperx
# Load model
model = whisperx.load_model("large-v3", device="cuda", compute_type="float16")
# Transcribe
audio = whisperx.load_audio("audio.mp3")
result = model.transcribe(audio, batch_size=16)
# Align timestamps at word level
model_a, metadata = whisperx.load_align_model(language_code="en", device="cuda")
result = whisperx.align(
result["segments"], model_a, metadata, audio, device="cuda"
)
# Speaker diarization (requires HuggingFace token)
diarize_model = whisperx.DiarizationPipeline(
use_auth_token="YOUR_HF_TOKEN", device="cuda"
)
diarize_segments = diarize_model(audio)
# Assign speakers to words
result = whisperx.assign_word_speakers(diarize_segments, result)
for segment in result["segments"]:
speaker = segment.get("speaker", "UNKNOWN")
print(f"[{speaker}] {segment['text']}")
صيغ الإخراج
# Generate all output formats
whisper audio.mp3 --model large-v3 --output_format all
Generated files include:
| Format | Extension | الوصف |
|---|
| Text | .txt | Plain text transcript |
| SRT | .srt | SubRip subtitle format |
| VTT | .vtt | WebVTT subtitle format |
| TSV | .tsv | Tab-separated with timestamps |
| JSON | .json | Full results with metadata |
Writing المخرجات Programmatically
import whisper
from whisper.utils import get_writer
model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3")
# Write SRT file
writer = get_writer("srt", "./output")
writer(result, "audio.mp3")
# Write VTT file
writer = get_writer("vtt", "./output")
writer(result, "audio.mp3")
# Write JSON
writer = get_writer("json", "./output")
writer(result, "audio.mp3")
Batch Processing
import os
import whisper
model = whisper.load_model("large-v3")
audio_dir = "./audio_files"
for filename in os.listdir(audio_dir):
if filename.endswith((".mp3", ".wav", ".m4a", ".flac")):
filepath = os.path.join(audio_dir, filename)
result = model.transcribe(filepath)
print(f"\n--- {filename} ---")
print(result["text"])
Common الخيارات
| Option | الوصف | Default |
|---|
--model | Model size | turbo |
--language | Audio language code | Auto-detect |
--task | transcribe or translate | transcribe |
--output_format | txt, vtt, srt, tsv, json, all | all |
--output_dir | Output directory | . |
--device | cuda or cpu | Auto |
--word_timestamps | Enable word-level timestamps | False |
--beam_size | Beam search width | 5 |
--temperature | Sampling temperature | 0 |
--fp16 | Use FP16 inference | True on GPU |