# faster-whisper
Faster Whisper
Speech-to-text powered by SkillBoss API Hub (cloud mode) or local faster-whisper (local mode). When SKILLBOSS_API_KEY is set, transcription is automatically routed to the best available STT model via https://api.heybossai.com/v1/pilot — no local model download required. For local mode (GPU available), faster-whisper runs 4-6x faster with identical accuracy; with GPU acceleration, expect ~20x realtime transcription (a 10-minute audio file in ~30 seconds).
When to Use
Use this skill when you need to:
Transcribe audio/video files — meetings, interviews, podcasts, lectures, YouTube videos
Generate subtitles — SRT, VTT, ASS, LRC, or TTML broadcast-standard subtitles
Identify speakers — diarization labels who said what (--diarize)
Transcribe from URLs — YouTube links and direct audio URLs (auto-downloads via yt-dlp)
Transcribe podcast feeds — --rss
CORE RULE: default command (./scripts/transcribe audio.mp3) is the fastest path — add flags only when the user explicitly asks for that capability.
Transcription:
Only add --diarize if the user asks "who said what" / "identify speakers" / "label speakers" Only add --format srt/vtt/ass/lrc/ttml if the user asks for subtitles/captions in that format Only add --format csv if the user asks for CSV or spreadsheet output Only add --word-timestamps if the user needs word-level timing Only add --initial-prompt if there's domain-specific jargon to prime Only add --translate if the user wants non-English audio translated to English Only add --normalize/--denoise if the user mentions bad audio quality or noise Only add --stream if the user wants live/progressive output for long files Only add --clip-timestamps if the user wants a specific time range Only add --temperature 0.0 if the model is hallucinating on music/silence Only add --vad-threshold if VAD is aggressively cutting speech or including noise Only add --min-speakers/--max-speakers when you know the speaker count Only add --hf-token if the token is not cached at ~/.cache/huggingface/token Only add --max-words-per-line for subtitle readability on long segments Only add --filter-hallucinations if the transcript contains obvious artifacts (music markers, duplicates) Only add --merge-sentences if the user asks for sentence-level subtitle cues Only add --clean-filler if the user asks to remove filler words (um, uh, you know, I mean, hesitation sounds) Only add --channel left|right if the user mentions stereo tracks, dual-channel recordings, or asks for a specific channel Only add --max-chars-per-line N when the user specifies a character limit per subtitle line (e.g., "Netflix format", "42 chars per line"); takes priority over --max-words-per-line Only add --detect-paragraphs if the user asks for paragraph breaks or structured text output; --paragraph-gap (default 3.0s) only if they want a custom gap Only add --speaker-names "Alice,Bob" when the user provides real names to replace SPEAKER_1/2 — always requires --diarize Only add --hotwords WORDS when the user names specific rare terms not well served by --initial-prompt; prefer --initial-prompt for general domain jargon Only add --prefix TEXT when the user knows the exact words the audio starts with Only add --detect-language-only when the user only wants to identify the language, not transcribe Only add --stats-file PATH if the user asks for performance stats, RTF, or benchmark info Only add --parallel N for large CPU batch jobs; GPU handles one file efficiently on its own — don't add for single files or small batches Only add --retries N for unreliable inputs (URLs, network files) where transient failures are expected Only add --burn-in OUTPUT only when user explicitly asks to embed/burn subtitles into the video; requires ffmpeg and a video file input Only add --keep-temp when the user may re-process the same URL to avoid re-downloading Only add --output-template when user specifies a custom naming pattern in batch mode Multi-format output (--format srt,text): only when user explicitly wants multiple formats in one pass; always pair with -o
Search:
Only add --search "term" when the user asks to find/locate/search for a specific word or phrase in audio --search replaces the normal transcript output — it prints only matching segments with timestamps Add --search-fuzzy only when the user mentions approximate/partial matching or typos To save search results to a file, use -o results.txt Chapter detection: Only add --detect-chapters when the user asks for chapters, sections, a table of contents, or "where does the topic change" Default --chapter-gap 8 (8-second silence = new chapter) works for most podcasts/lectures; tune down for dense content --chapter-format youtube (default) outputs YouTube-ready timestamps; use json for programmatic use Always use --chapters-file PATH when combining chapters with a transcript output — avoids mixing chapter markers into the transcript text If the user only wants chapters (not the transcript), pipe stdout to a file with -o /dev/null and use --chapters-file Batch mode limitation: --chapters-file takes a single path — in batch mode, each file's chapters overwrite the previous. For batch chapter detection, omit --chapters-file (chapters print to stdout under === CHAPTERS (N) ===) or use a separate run per file Speaker audio export: Only add --export-speakers DIR when the user explicitly asks to save each speaker's audio separately Always pair with --diarize — it silently skips if no speaker labels are present Requires ffmpeg; outputs SPEAKER_1.wav, SPEAKER_2.wav, etc. (or real names if --speaker-names is set) Language map: Only add --language-map in batch mode when the user has confirmed different languages across files Inline format: "interview*.mp3=en,lecture*.mp3=fr" — fnmatch globs on filename JSON file format: @/path/to/map.json where the file is {"pattern": "lang_code"} RSS / Podcast: Only add --rss URL when the user provides a podcast RSS feed URL Default fetches 5 newest episodes; --rss-latest 0 for all; --skip-existing to resume safely Always use -o
TaskCommandNotesBasic transcription./scripts/transcribe audio.mp3Batched inference, VAD on, distil-large-v3.5SRT subtitles./scripts/transcribe audio.mp3 --format srt -o subs.srtWord timestamps auto-enabledVTT subtitles./scripts/transcribe audio.mp3 --format vtt -o subs.vttWebVTT formatWord timestamps./scripts/transcribe audio.mp3 --word-timestamps --format srtwav2vec2 aligned (~10ms)Speaker diarization./scripts/transcribe audio.mp3 --diarizeRequires pyannote.audioTranslate → English./scripts/transcribe audio.mp3 --translateAny language → EnglishStream output./scripts/transcribe audio.mp3 --streamLive segments as transcribedClip time range./scripts/transcribe audio.mp3 --clip-timestamps "30,60"Only 30s–60sDenoise + normalize./scripts/transcribe audio.mp3 --denoise --normalizeClean up noisy audio firstReduce hallucination./scripts/transcribe audio.mp3 --hallucination-silence-threshold 1.0Skip hallucinated silenceYouTube/URL./scripts/transcribe https://youtube.com/watch?v=...Auto-downloads via yt-dlpBatch process./scripts/transcribe *.mp3 -o ./transcripts/Output to directoryBatch with skip./scripts/transcribe *.mp3 --skip-existing -o ./out/Resume interrupted batchesDomain terms./scripts/transcribe audio.mp3 --initial-prompt 'Kubernetes gRPC'Boost rare terminologyHotwords boost./scripts/transcribe audio.mp3 --hotwords 'JIRA Kubernetes'Bias decoder toward specific wordsPrefix conditioning./scripts/transcribe audio.mp3 --prefix 'Good morning,'Seed the first segment with known opening wordsPin model version./scripts/transcribe audio.mp3 --revision v1.2.0Reproducible transcription with a pinned revisionDebug library logs./scripts/transcribe audio.mp3 --log-level debugShow faster_whisper internal logsTurbo model./scripts/transcribe audio.mp3 -m turboAlias for large-v3-turboFaster English./scripts/transcribe audio.mp3 --model distil-medium.en -l enEnglish-only, 6.8x fasterMaximum accuracy./scripts/transcribe audio.mp3 --model large-v3 --beam-size 10Full modelJSON output./scripts/transcribe audio.mp3 --format json -o out.jsonProgrammatic access with statsFilter noise./scripts/transcribe audio.mp3 --min-confidence 0.6Drop low-confidence segmentsHybrid quantization./scripts/transcribe audio.mp3 --compute-type int8_float16Save VRAM, minimal quality lossReduce batch size./scripts/transcribe audio.mp3 --batch-size 4If OOM on GPUTSV output./scripts/transcribe audio.mp3 --format tsv -o out.tsvOpenAI Whisper–compatible TSVFix hallucinations./scripts/transcribe audio.mp3 --temperature 0.0 --no-speech-threshold 0.8Lock temperature + skip silenceTune VAD sensitivity./scripts/transcribe audio.mp3 --vad-threshold 0.6 --min-silence-duration 500Tighter speech detectionKnown speaker count./scripts/transcribe meeting.wav --diarize --min-speakers 2 --max-speakers 3Constrain diarizationSubtitle word wrapping./scripts/transcribe audio.mp3 --format srt --word-timestamps --max-words-per-line 8Split long cuesPrivate/gated model./scripts/transcribe audio.mp3 --hf-token hf_xxxPass token directlyShow version./scripts/transcribe --versionPrint faster-whisper versionUpgrade in-place./setup.sh --updateUpgrade without full reinstallSystem check./setup.sh --checkVerify GPU, Python, ffmpeg, venv, yt-dlp, pyannoteDetect language only./scripts/transcribe audio.mp3 --detect-language-onlyFast language ID, no transcriptionDetect language JSON./scripts/transcribe audio.mp3 --detect-language-only --format jsonMachine-readable language detectionLRC subtitles./scripts/transcribe audio.mp3 --format lrc -o lyrics.lrcTimed lyrics format for music playersASS subtitles./scripts/transcribe audio.mp3 --format ass -o subtitles.assAdvanced SubStation Alpha (Aegisub, mpv, VLC)Merge sentences./scripts/transcribe audio.mp3 --format srt --merge-sentencesJoin fragments into sentence chunksStats sidecar./scripts/transcribe audio.mp3 --stats-file stats.jsonWrite perf stats JSON after transcriptionBatch stats./scripts/transcribe *.mp3 --stats-file ./stats/One stats file per input in dirTemplate naming./scripts/transcribe audio.mp3 -o ./out/ --output-template "{stem}_{lang}.{ext}"Custom batch output filenamesStdin inputffmpeg -i input.mp4 -f wav - | ./scripts/transcribe -Pipe audio directly from stdinCustom model dir./scripts/transcribe audio.mp3 --model-dir ~/my-modelsCustom HuggingFace cache dirLocal model./scripts/transcribe audio.mp3 -m ./my-model-ct2CTranslate2 model dirHTML transcript./scripts/transcribe audio.mp3 --format html -o out.htmlConfidence-coloredBurn subtitles./scripts/transcribe video.mp4 --burn-in output.mp4Requires ffmpeg + video inputName speakers./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob"Replaces SPEAKER_1/2Filter hallucinations./scripts/transcribe audio.mp3 --filter-hallucinationsRemoves artifactsKeep temp files./scripts/transcribe https://... --keep-tempFor URL re-processingParallel batch./scripts/transcribe *.mp3 --parallel 4 -o ./out/CPU multi-fileRTX 3070 recommended./scripts/transcribe audio.mp3 --compute-type int8_float16Saves ~1GB VRAM, minimal quality lossCPU thread count./scripts/transcribe audio.mp3 --threads 8Force CPU thread count (default: auto)Podcast RSS (latest 5)./scripts/transcribe --rss https://feeds.example.com/podcast.xmlDownloads & transcribes newest 5 episodesPodcast RSS (all episodes)./scripts/transcribe --rss https://... --rss-latest 0 -o ./episodes/All episodes, one file eachPodcast + SRT subtitles./scripts/transcribe --rss https://... --format srt -o ./subs/Subtitle all episodesRetry on failure./scripts/transcribe *.mp3 --retries 3 -o ./out/Retry up to 3× with backoff on errorCSV output./scripts/transcribe audio.mp3 --format csv -o out.csvSpreadsheet-ready with header row; properly quotedCSV with speakers./scripts/transcribe audio.mp3 --diarize --format csv -o out.csvAdds speaker columnLanguage map (inline)./scripts/transcribe *.mp3 --language-map "interview*.mp3=en,lecture.wav=fr"Per-file language in batchLanguage map (JSON)./scripts/transcribe *.mp3 --language-map @langs.jsonJSON file: {"pattern": "lang"}Batch with ETA./scripts/transcribe *.mp3 -o ./out/Automatic ETA shown for each file in batchTTML subtitles./scripts/transcribe audio.mp3 --format ttml -o subtitles.ttmlBroadcast-standard DFXP/TTML (Netflix, BBC, Amazon)TTML with speaker labels./scripts/transcribe audio.mp3 --diarize --format ttml -o subtitles.ttmlSpeaker-labeled TTMLSearch transcript./scripts/transcribe audio.mp3 --search "keyword"Find timestamps where keyword appearsSearch to file./scripts/transcribe audio.mp3 --search "keyword" -o results.txtSave search resultsFuzzy search./scripts/transcribe audio.mp3 --search "aproximate" --search-fuzzyApproximate/partial matchingDetect chapters./scripts/transcribe audio.mp3 --detect-chaptersAuto-detect chapters from silence gapsChapter gap tuning./scripts/transcribe audio.mp3 --detect-chapters --chapter-gap 5Chapters on gaps ≥5s (default: 8s)Chapters to file./scripts/transcribe audio.mp3 --detect-chapters --chapters-file ch.txtSave YouTube-format chapter listChapters JSON./scripts/transcribe audio.mp3 --detect-chapters --chapter-format jsonMachine-readable chapter listExport speaker audio./scripts/transcribe audio.mp3 --diarize --export-speakers ./speakers/Save each speaker's audio to separate WAV filesMulti-format output./scripts/transcribe audio.mp3 --format srt,text -o ./out/Write SRT + TXT in one passRemove filler words./scripts/transcribe audio.mp3 --clean-fillerStrip um/uh/er/ah/hmm and discourse markersLeft channel only./scripts/transcribe audio.mp3 --channel leftExtract left stereo channel before transcribingRight channel only./scripts/transcribe audio.mp3 --channel rightExtract right stereo channelMax chars per line./scripts/transcribe audio.mp3 --format srt --max-chars-per-line 42Character-based subtitle wrappingDetect paragraphs./scripts/transcribe audio.mp3 --detect-paragraphsInsert paragraph breaks in text outputParagraph gap tuning./scripts/transcribe audio.mp3 --detect-paragraphs --paragraph-gap 5.0Tune gap threshold (default 3.0s)
Model Selection
Choose the right model for your needs:
digraph model_selection {
rankdir=LR;
node [shape=box, style=rounded];
start [label="Start", shape=doublecircle];
need_accuracy [label="Need maximum\naccuracy?", shape=diamond];
multilingual [label="Multilingual\ncontent?", shape=diamond];
resource_constrained [label="Resource\nconstraints?", shape=diamond];
large_v3 [label="large-v3\nor\nlarge-v3-turbo", style="rounded,filled", fillcolor=lightblue];
large_turbo [label="large-v3-turbo", style="rounded,filled", fillcolor=lightblue];
distil_large [label="distil-large-v3.5\n(default)", style="rounded,filled", fillcolor=lightgreen];
distil_medium [label="distil-medium.en", style="rounded,filled", fillcolor=lightyellow];
distil_small [label="distil-small.en", style="rounded,filled", fillcolor=lightyellow];
start -> need_accuracy;
need_accuracy -> large_v3 [label="yes"];
need_accuracy -> multilingual [label="no"];
multilingual -> large_turbo [label="yes"];
multilingual -> resource_constrained [label="no (English)"];
resource_constrained -> distil_small [label="mobile/edge"];
resource_constrained -> distil_medium [label="some limits"];
resource_constrained -> distil_large [label="no"];
}
Model Table
Standard Models (Full Whisper)
ModelSizeSpeedAccuracyUse Casetiny / tiny.en39MFastestBasicQuick draftsbase / base.en74MVery fastGoodGeneral usesmall / small.en244MFastBetterMost tasksmedium / medium.en769MModerateHighQuality transcriptionlarge-v1/v2/v31.5GBSlowerBestMaximum accuracylarge-v3-turbo809MFastExcellentHigh accuracy (slower than distil)
Distilled Models (6x Faster, 6.3x faster7.53% WERPrevious defaultdistil-large-v2756M1% WER difference)
ModelSizeSpeed vs StandardAccuracyUse Casedistil-large-v3.5756M6.3x faster7.08% WERDefault, best balancedistil-large-v3756M5.8x faster10.1% WERFallbackdistil-medium.en394M6.8x faster11.1% WEREnglish-only, resource-constraineddistil-small.en166M~5.6x faster12.1% WERMobile/edge devices
.en models are English-only and slightly faster/better for English content.
Note for distil models: HuggingFace recommends disabling condition_on_previous_text for all distil models to prevent repetition loops. The script auto-applies --no-condition-on-previous-text whenever a distil-* model is detected. Pass --condition-on-previous-text to override if needed.
Custom & Fine-tuned Models
WhisperModel accepts local CTranslate2 model directories and HuggingFace repo names — no code changes needed.
Load a local CTranslate2 model
./scripts/transcribe audio.mp3 --model /path/to/my-model-ct2
Convert a HuggingFace model to CTranslate2
pip install ctranslate2
ct2-transformers-converter
--model openai/whisper-large-v3
--output_dir whisper-large-v3-ct2
--copy_files tokenizer.json preprocessor_config.json
--quantization float16
./scripts/transcribe audio.mp3 --model ./whisper-large-v3-ct2
Load a model by HuggingFace repo name (auto-downloads)
./scripts/transcribe audio.mp3 --model username/whisper-large-v3-ct2
Custom model cache directory By default, models are cached in ~/.cache/huggingface/. Use --model-dir to override:
./scripts/transcribe audio.mp3 --model-dir ~/my-models
Setup Linux / macOS / WSL2
# Base install (creates venv, installs deps, auto-detects GPU)
./setup.sh
# With speaker diarization support
./setup.sh --diarize
Requirements:
Python 3.10+ ffmpeg is not required for basic transcription — PyAV (bundled with faster-whisper) handles audio decoding. ffmpeg is only needed for --burn-in, --normalize, and --denoise.
Optional: yt-dlp (for URL/YouTube input)
Optional: pyannote.audio (for --diarize, installed via setup.sh --diarize)
Platform Support
PlatformAccelerationSpeedLinux + NVIDIA GPUCUDA20x realtime 🚀WSL2 + NVIDIA GPUCUDA20x realtime 🚀macOS Apple SiliconCPU*3-5x realtimemacOS IntelCPU1-2x realtimeLinux (no GPU)CPU~1x realtime
*faster-whisper uses CTranslate2 which is CPU-only on macOS, but Apple Silicon is fast enough for practical use.
GPU Support (IMPORTANT!)
The setup script auto-detects your GPU and installs PyTorch with CUDA. Always use GPU if available — CPU transcription is extremely slow.
HardwareSpeed9-min videoRTX 3070 (GPU)20x realtime27 secCPU (int8)0.3x realtime30 min
RTX 3070 tip: Use --compute-type int8_float16 for hybrid quantization — saves ~1GB VRAM with minimal quality loss. Ideal for running diarization alongside transcription.
If setup didn't detect your GPU, manually install PyTorch with CUDA:
# For CUDA 12.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu121
# For CUDA 11.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu118 WSL2 users: Ensure you have the NVIDIA CUDA drivers for WSL installed on Windows Usage
# Basic transcription
./scripts/transcribe audio.mp3
# SRT subtitles
./scripts/transcribe audio.mp3 --format srt -o subtitles.srt
# WebVTT subtitles
./scripts/transcribe audio.mp3 --format vtt -o subtitles.vtt
# Transcribe from YouTube URL
./scripts/transcribe https://youtube.com/watch?v=dQw4w9WgXcQ --language en
# Speaker diarization
./scripts/transcribe meeting.wav --diarize
# Diarized VTT subtitles
./scripts/transcribe meeting.wav --diarize --format vtt -o meeting.vtt
# Prime with domain terminology
./scripts/transcribe lecture.mp3 --initial-prompt "Kubernetes, gRPC, PostgreSQL, NGINX"
# Batch process a directory
./scripts/transcribe ./recordings/ -o ./transcripts/
# Batch with glob, skip already-done files
./scripts/transcribe *.mp3 --skip-existing -o ./transcripts/
# Filter low-confidence segments
./scripts/transcribe noisy-audio.mp3 --min-confidence 0.6
# JSON output with full metadata
./scripts/transcribe audio.mp3 --format json -o result.json
# Specify language (faster than auto-detect)
./scripts/transcribe audio.mp3 --language en
Options
Input:
AUDIO Audio file(s), directory, glob pattern, or URL
Accepts: mp3, wav, m4a, flac, ogg, webm, mp4, mkv, avi, wma, aac
URLs auto-download via yt-dlp (YouTube, direct links, etc.) Model & Language: -m, --model NAME Whisper model (default: distil-large-v3.5; "turbo" = large-v3-turbo) --revision REV Model revision (git branch/tag/commit) to pin a specific version -l, --language CODE Language code, e.g. en, es, fr (auto-detects if omitted) --initial-prompt TEXT Prompt to condition the model (terminology, formatting style) --prefix TEXT Prefix to condition the first segment (e.g. known starting words) --hotwords WORDS Space-separated hotwords to boost recognition --translate Translate any language to English (instead of transcribing) --multilingual Enable multilingual/code-switching mode (helps smaller models) --hf-token TOKEN HuggingFace token for private/gated models and diarization --model-dir PATH Custom model cache directory (default: ~/.cache/huggingface/) Output Format: -f, --format FMT text | json | srt | vtt | tsv | lrc | html | ass | ttml (default: text) Accepts comma-separated list: --format srt,text writes both in one pass Multi-format requires -o
Example: "{stem}_{lang}.{ext}" → "interview_en.srt"
Inference Tuning: --beam-size N Beam search size; higher = more accurate but slower (default: 5) --temperature T Sampling temperature or comma-separated fallback list, e.g. '0.0' or '0.0,0.2,0.4' (default: faster-whisper's schedule) --no-speech-threshold PROB Probability threshold to mark segments as silence (default: 0.6) --batch-size N Batched inference batch size (default: 8; reduce if OOM) --no-vad Disable voice activity detection (on by default) --vad-threshold T VAD speech probability threshold (default: 0.5) --vad-neg-threshold T VAD negative threshold for ending speech (default: auto) --vad-onset T Alias for --vad-threshold (legacy) --vad-offset T Alias for --vad-neg-threshold (legacy) --min-speech-duration MS Minimum speech segment duration in ms (default: 0) --max-speech-duration SEC Maximum speech segment duration in seconds (default: unlimited) --min-silence-duration MS Minimum silence before splitting a segment in ms (default: 2000) --speech-pad MS Padding around speech segments in ms (default: 400) --no-batch Disable batched inference (use standard WhisperModel) --hallucination-silence-threshold SEC Skip silent sections where model hallucinates (e.g. 1.0) --no-condition-on-previous-text Don't condition on previous text (reduces repetition/hallucination loops; auto-enabled for distil models per HuggingFace recommendation) --condition-on-previous-text Force-enable conditioning on previous text (overrides auto-disable for distil models) --compression-ratio-threshold RATIO Filter segments above this compression ratio (default: 2.4) --log-prob-threshold PROB Filter segments below this avg log probability (default: -1.0) --max-new-tokens N Maximum tokens per segment (prevents runaway generation) --clip-timestamps RANGE Transcribe specific time ranges: '30,60' or '0,30;60,90' (seconds) --progress Show transcription progress bar --best-of N Candidates when sampling with non-zero temperature (default: 5) --patience F Beam search patience factor (default: 1.0) --repetition-penalty F Penalty for repeated tokens (default: 1.0) --no-repeat-ngram-size N Prevent n-gram repetitions of this size (default: 0 = off) Advanced Inference: --no-timestamps Output text without timing info (faster; incompatible with --word-timestamps, --format srt/vtt/tsv, --diarize) --chunk-length N Audio chunk length in seconds for batched inference (default: auto) --language-detection-threshold T Confidence threshold for language auto-detection (default: 0.5) --language-detection-segments N Audio segments to sample for language detection (default: 1) --length-penalty F Beam search length penalty; >1 favors longer, <1 favors shorter (default: 1.0) --prompt-reset-on-temperature T Reset initial prompt when temperature fallback hits threshold (default: 0.5) --no-suppress-blank Disable blank token suppression (may help soft/quiet speech) --suppress-tokens IDS Comma-separated token IDs to suppress in addition to default -1 --max-initial-timestamp T Maximum timestamp for the first segment in seconds (default: 1.0) --prepend-punctuations CHARS Punctuation characters merged into preceding word (default: "'¿([{-) --append-punctuations CHARS Punctuation characters merged into following word (default: "'.。,,!!??::")]}、")
Preprocessing:
--normalize Normalize audio volume (EBU R128 loudnorm) before transcription --denoise Apply noise reduction (high-pass + FFT denoise) before transcription
Advanced:
--diarize Speaker diarization (requires pyannote.audio) --min-speakers N Minimum number of speakers hint for diarization --max-speakers N Maximum number of speakers hint for diarization --speaker-names NAMES Comma-separated names to replace SPEAKER_1, SPEAKER_2 (e.g. 'Alice,Bob') Requires --diarize --min-confidence PROB Filter segments below this avg word confidence (0.0–1.0) --skip-existing Skip files whose output already exists (batch mode) --detect-language-only Detect language and exit (no transcription). Output: "Language: en (probability: 0.984)" With --format json: {"language": "en", "language_probability": 0.984} --stats-file PATH Write JSON stats sidecar after transcription (processing time, RTF, word count, etc.) Directory path → writes {stem}.stats.json inside; file path → exact path --burn-in OUTPUT Burn subtitles into the original video (single-file mode only; requires ffmpeg) --filter-hallucinations Filter common Whisper hallucinations: music/applause markers, duplicate segments, 'Thank you for watching', lone punctuation, etc. --keep-temp Keep temp files from URL downloads (useful for re-processing without re-downloading) --parallel N Number of parallel workers for batch processing (default: sequential) --retries N Retry failed files up to N times with exponential backoff (default: 0; incompatible with --parallel) Batch ETA: Automatically shown for sequential batch jobs (no flag needed). After each file completes, the next file's progress line includes: [current/total] filename | ETA: Xm Ys ETA is calculated from average time per file × remaining files. Shown to stderr (surfaced to users via OpenClaw/Clawdbot output). Language Map (per-file language override): --language-map MAP Per-file language override for batch mode. Two forms:
Inline: "interview*.mp3=en,lecture.wav=fr,keynote.wav=de"
JSON file: "@/path/to/map.json" (must be {pattern: lang} dict) Patterns support fnmatch globs on filename or stem.
Priority: exact filename > exact stem > glob on filename > glob on stem > fallback.
Files not matched fall back to --language (or auto-detect if not set). Transcript Search: --search TERM Search the transcript for TERM and print matching segments with timestamps. Replaces normal transcript output (use -o to save results to a file). Case-insensitive exact substring match by default. --search-fuzzy Enable fuzzy/approximate matching with --search (useful for typos, phonetic near-misses, or partial words; uses SequenceMatcher ratio ≥ 0.6) Chapter Detection: --detect-chapters Auto-detect chapter/section breaks from silence gaps and print chapter markers. Output is printed after the transcript (or to --chapters-file). --chapter-gap SEC Minimum silence gap in seconds between consecutive segments to start a new chapter (default: 8.0). Tune down for dense speech, up for sparse content. --chapters-file PATH Write chapter markers to this file (default: stdout after transcript) --chapter-format FMT youtube | text | json — chapter output format:
youtube: "0:00 Chapter 1" (YouTube description ready)
text: "Chapter 1: 00:00:00"
json: JSON array with chapter, start, title fields
(default: youtube) Speaker Audio Export: --export-speakers DIR After diarization, export each speaker's audio turns concatenated into separate WAV files saved in DIR. Requires --diarize and ffmpeg.
Output: SPEAKER_1.wav, SPEAKER_2.wav, … (or real names if --speaker-names set)
RSS / Podcast: --rss URL Podcast RSS feed URL — extracts audio enclosures and transcribes them. AUDIO positional is optional when --rss is used. --rss-latest N Number of most-recent episodes to process (default: 5; 0 = all episodes)
Device:
--device DEV auto | cpu | cuda (default: auto) --compute-type TYPE auto | int8 | int8_float16 | float16 | float32 (default: auto) int8_float16 = hybrid mode for GPU (saves VRAM, minimal quality loss) --threads N CPU thread count for CTranslate2 (default: auto) -q, --quiet Suppress progress and status messages --log-level LEVEL Set faster_whisper library logging level: debug | info | warning | error (default: warning; use debug to see CTranslate2/VAD internals)
Utility:
--version Print installed faster-whisper version and exit --update Upgrade faster-whisper in the skill venv and exit Output Formats Text (default) Plain transcript text. With --diarize, speaker labels are inserted: [SPEAKER_1] Hello, welcome to the meeting. [SPEAKER_2] Thanks for having me. JSON (--format json) Full metadata including segments, timestamps, language detection, and performance stats: { "file": "audio.mp3", "text": "Hello, welcome...", "language": "en", "language_probability": 0.98, "duration": 600.5, "segments": [...], "speakers": ["SPEAKER_1", "SPEAKER_2"], "stats": { "processing_time": 28.3, "realtime_factor": 21.2 } } SRT (--format srt) Standard subtitle format for video players: 1
00:00:00,000 --> 00:00:02,500
[SPEAKER_1] Hello, welcome to the meeting. 2
00:00:02,800 --> 00:00:04,200
[SPEAKER_2] Thanks for having me. VTT (--format vtt) WebVTT format for web video players: WEBVTT 1
00:00:00.000 --> 00:00:02.500
[SPEAKER_1] Hello, welcome to the meeting. 2
00:00:02.800 --> 00:00:04.200
[SPEAKER_2] Thanks for having me. TSV (--format tsv) Tab-separated values, OpenAI Whisper–compatible. Columns: start_ms, end_ms, text: 0 2500 Hello, welcome to the meeting. 2800 4200 Thanks for having me. Useful for piping into other tools or spreadsheets. No header row. ASS/SSA (--format ass) Advanced SubStation Alpha format — supported by Aegisub, VLC, mpv, MPC-HC, and most video editors. Offers richer styling than SRT (font, size, color, position) via the [V4+ Styles] section: [Script Info]
ScriptType: v4.00+
... [V4+ Styles]
Style: Default,Arial,20,&H00FFFFFF,...
[Events]
Format: Layer, Start, End, Style, Name, ..., Text
Dialogue: 0,0:00:00.00,0:00:02.50,Default,,[SPEAKER_1] Hello, welcome.
Dialogue: 0,0:00:02.80,0:00:04.20,Default,,[SPEAKER_2] Thanks for having me.
Timestamps use H:MM:SS.cc (centiseconds). Edit the [V4+ Styles] block in Aegisub to customise font, color, and position without re-transcribing. LRC (--format lrc) Timed lyrics format used by music players (e.g., Foobar2000, VLC, AIMP). Timestamps use [mm:ss.xx] where xx = centiseconds: [00:00.50]Hello, welcome to the meeting. [00:02.80]Thanks for having me. With diarization, speaker labels are included: [00:00.50][SPEAKER_1] Hello, welcome to the meeting. [00:02.80][SPEAKER_2] Thanks for having me. Default file extension: .lrc. Useful for music transcription, karaoke, and any workflow requiring timed text with music-player compatibility. Speaker Diarization Identifies who spoke when using pyannote.audio.
Setup:
./setup.sh --diarize
Requirements:
HuggingFace token at ~/.cache/huggingface/token (huggingface-cli login) Accepted model agreements:
https://hf.co/pyannote/speaker-diarization-3.1
https://hf.co/pyannote/segmentation-3.0
Usage:
# Basic diarization (text output)
./scripts/transcribe meeting.wav --diarize
# Diarized subtitles
./scripts/transcribe meeting.wav --diarize --format srt -o meeting.srt
# Diarized JSON (includes speakers list)
./scripts/transcribe meeting.wav --diarize --format json
Speakers are labeled SPEAKER_1, SPEAKER_2, etc. in order of first appearance. Diarization runs on GPU automatically if CUDA is available. Precise Word Timestamps Whenever word-level timestamps are computed (--word-timestamps, --diarize, or --min-confidence), a wav2vec2 forced alignment pass automatically refines them from Whisper's ~100-200ms accuracy to ~10ms. No extra flag needed.
# Word timestamps with automatic wav2vec2 alignment
./scripts/transcribe audio.mp3 --word-timestamps --format json
# Diarization also gets precise alignment automatically
./scripts/transcribe meeting.wav --diarize
# Precise subtitles
./scripts/transcribe audio.mp3 --word-timestamps --format srt -o subtitles.srt
Uses the MMS (Massively Multilingual Speech) model from torchaudio — supports 1000+ languages. The model is cached after first load, so batch processing stays fast. URL & YouTube Input Pass any URL as input — audio is downloaded automatically via yt-dlp:
# YouTube video
./scripts/transcribe https://youtube.com/watch?v=dQw4w9WgXcQ
# Direct audio URL
./scripts/transcribe https://example.com/podcast.mp3
# With options
./scripts/transcribe https://youtube.com/watch?v=... --language en --format srt -o subs.srt
Requires yt-dlp (checks PATH and ~/.local/share/pipx/venvs/yt-dlp/bin/yt-dlp). Batch Processing Process multiple files at once with glob patterns, directories, or multiple paths:
# All MP3s in current directory
./scripts/transcribe *.mp3
# Entire directory (auto-filters audio files)
./scripts/transcribe ./recordings/
# Output to directory (one file per input)
./scripts/transcribe *.mp3 -o ./transcripts/
# Skip already-transcribed files (resume interrupted batch)
./scripts/transcribe *.mp3 --skip-existing -o ./transcripts/
# Mixed inputs
./scripts/transcribe file1.mp3 file2.wav ./more-recordings/
# Batch SRT subtitles
./scripts/transcribe *.mp3 --format srt -o ./subtitles/
When outputting to a directory, files are named {input-stem}.{ext} (e.g., audio.mp3 → audio.srt). Batch mode prints a summary after all files complete: 📊 Done: 12 files, 3h24m audio in 10m15s (19.9× realtime) Workflows End-to-end pipelines for common use cases. Podcast Transcription Pipeline Fetch and transcribe the latest 5 episodes from any podcast RSS feed:
# Transcribe latest 5 episodes → one .txt per episode
./scripts/transcribe --rss https://feeds.megaphone.fm/mypodcast -o ./transcripts/
# All episodes, as SRT subtitles
./scripts/transcribe --rss https://... --rss-latest 0 --format srt -o ./subtitles/
# Skip already-done episodes (safe to re-run)
./scripts/transcribe --rss https://... --skip-existing -o ./transcripts/
# With diarization (who said what) + retry on flaky network
./scripts/transcribe --rss https://... --diarize --retries 2 -o ./transcripts/
Meeting Notes Pipeline Transcribe a meeting recording with speaker labels, then output clean text:
# Diarize + name speakers (replace SPEAKER_1/2 with real names)
./scripts/transcribe meeting.wav --diarize --speaker-names "Alice,Bob" -o meeting.txt
# Diarized JSON for post-processing (summaries, action items)
./scripts/transcribe meeting.wav --diarize --format json -o meeting.json
# Stream live while it transcribes (long meetings)
./scripts/transcribe meeting.wav --stream
Video Subtitle Pipeline Generate ready-to-use subtitles for a video file:
# SRT subtitles with sentence merging (better readability)
./scripts/transcribe video.mp4 --format srt --merge-sentences -o subtitles.srt
# Burn subtitles directly into the video
./scripts/transcribe video.mp4 --format srt --burn-in video_subtitled.mp4
# Word-level SRT (karaoke-style), capped at 8 words per cue
./scripts/transcribe video.mp4 --format srt --word-timestamps --max-words-per-line 8 -o subs.srt
YouTube Batch Pipeline Transcribe multiple YouTube videos at once:
# One-liner: transcribe a playlist video + output SRT
./scripts/transcribe "https://youtube.com/watch?v=abc123" --format srt -o subs.srt
# Batch from a text file of URLs (one per line)
cat urls.txt | xargs ./scripts/transcribe -o ./transcripts/
# Download audio first, then transcribe (for re-use without re-downloading)
./scripts/transcribe https://youtube.com/watch?v=abc123 --keep-temp
Noisy Audio Pipeline Clean up poor-quality recordings before transcribing:
# Denoise + normalize, then transcribe
./scripts/transcribe interview.mp3 --denoise --normalize -o interview.txt
# Noisy batch with aggressive hallucination filtering
./scripts/transcribe *.mp3 --denoise --filter-hallucinations -o ./out/
Batch Recovery Pipeline Process a large folder with retries — safe to re-run after failures:
# Retry each failed file up to 3 times, skip already-done
./scripts/transcribe ./recordings/ --skip-existing --retries 3 -o ./transcripts/
# Check what failed (printed in batch summary at the end)
# Re-run the same command — skips successes, retries failures
Cloud STT via SkillBoss API Hub 如果你无法在本地运行 faster-whisper(云端环境、无 GPU),可以通过 SkillBoss API Hub 的 /v1/pilot 统一入口调用云端语音识别能力,自动路由最优 STT 模型,无需指定具体模型名称。 快速调用(curl)
# 通过 SkillBoss API Hub 发送音频进行转录(自动路由最优 STT 模型)
AUDIO_B64=$(base64 -w0 audio.mp3)
curl -X POST https://api.heybossai.com/v1/pilot \
-H "Authorization: Bearer $SKILLBOSS_API_KEY"
-H "Content-Type: application/json"
-d "{"type": "stt", "inputs": {"audio_data": "$AUDIO_B64", "filename": "audio.mp3"}}"
Python 调用
import requests, os, base64
SKILLBOSS_API_KEY = os.environ["SKILLBOSS_API_KEY"]
API_BASE = "https://api.heybossai.com/v1"
def pilot(body: dict) -> dict:
r = requests.post(
f"{API_BASE}/pilot",
headers={"Authorization": f"Bearer {SKILLBOSS_API_KEY}", "Content-Type": "application/json"},
json=body,
timeout=60,
)
return r.json()
audio_b64 = base64.b64encode(open("audio.mp3", "rb").read()).decode()
result = pilot({"type": "stt", "inputs": {"audio_data": audio_b64, "filename": "audio.mp3"}})
text = result["data"]["result"]["text"]
print(text)
适用于云端环境、无本地 GPU、或需要将转录集成到其他工具(Home Assistant、n8n、自定义应用)的场景。
Common Mistakes
MistakeProblemSolutionUsing CPU when GPU available10-20x slower transcriptionCheck nvidia-smi; verify CUDA installationNot specifying languageWastes time auto-detecting on known contentUse --language en when you know the languageUsing wrong modelUnnecessary slowness or poor accuracyDefault distil-large-v3.5 is excellent; only use large-v3 if accuracy issuesIgnoring distilled modelsMissing 6x speedup with <1% accuracy lossTry distil-large-v3.5 before reaching for standard modelsForgetting ffmpegSetup fails or audio can't be processedSetup script handles this; manual installs need ffmpeg separatelyOut of memory errorsModel too large for available VRAM/RAMUse smaller model, --compute-type int8, or --batch-size 4Over-engineering beam sizeDiminishing returns past beam-size 5-7Default 5 is fine; try 10 for critical transcripts--diarize without pyannoteImport error at runtimeRun setup.sh --diarize first--diarize without HuggingFace tokenModel download failsRun huggingface-cli login and accept model agreementsURL input without yt-dlpDownload failsInstall: pipx install yt-dlp--min-confidence too highDrops good segments with natural pausesStart at 0.5, adjust up; check JSON output for probabilitiesUsing --word-timestamps for basic transcriptionAdds ~5-10s overhead for negligible benefitOnly use when word-level precision mattersBatch without -o directoryAll output mixed in stdoutUse -o ./transcripts/ to write one file per input
Performance Notes
First run: Downloads model to ~/.cache/huggingface/ (one-time)
Batched inference: Enabled by default via BatchedInferencePipeline — ~3x faster than standard mode; VAD on by default
GPU: Automatically uses CUDA if available
Quantization: INT8 used on CPU for ~4x speedup with minimal accuracy loss
Performance stats: Every transcription shows audio duration, processing time, and realtime factor Benchmark (RTX 3070, 21-min file): ~24s with batched inference (both distil-large-v3 and v3.5) vs ~69s without --precise overhead: Adds ~5-10s for wav2vec2 model load + alignment (model cached for batch) Diarization overhead: Adds ~10-30s depending on audio length (runs on GPU if available)
Memory:
distil-large-v3: ~2GB RAM / ~1GB VRAM
large-v3-turbo: ~4GB RAM / ~2GB VRAM
tiny/base: <1GB RAM
Diarization: additional ~1-2GB VRAM
OOM: Lower --batch-size (try 4) if you hit out-of-memory errors
Pre-convert to WAV (optional): ffmpeg -i input.mp3 -ar 16000 -ac 1 input.wav converts to 16kHz mono WAV before transcription. Benefit is minimal (~5%) for one-off use since PyAV decodes efficiently — most useful when re-processing the same file multiple times (research/experiments) or when a format causes PyAV decode issues. Note: --normalize and --denoise already perform this conversion automatically. Silero VAD V6: faster-whisper 1.2.1 upgraded to Silero VAD V6 (improved speech detection). Run ./setup.sh --update to get it. Batched silence removal: faster-whisper 1.2.0+ automatically removes silence in BatchedInferencePipeline (used by default). Upgrade with ./setup.sh --update to get this if you installed before August 2024. Why faster-whisper?
Speed: ~4-6x faster than OpenAI's original Whisper
Accuracy: Identical (uses same model weights)
Efficiency: Lower memory usage via quantization
Production-ready: Stable C++ backend (CTranslate2)
Distilled models: ~6x faster with <1% accuracy loss
Subtitles: Native SRT/VTT/HTML output
Precise alignment: Automatic wav2vec2 refinement (~10ms word boundaries)
Diarization: Optional speaker identification via pyannote; --speaker-names maps to real names
URLs: Direct YouTube/URL input; --keep-temp preserves downloads for re-use
Custom models: Load local CTranslate2 dirs or HuggingFace repos; --model-dir controls cache Quality control: --filter-hallucinations strips music/applause markers and duplicates Parallel batch: --parallel N for multi-threaded batch processing Subtitle burn-in: --burn-in overlays subtitles directly into video via ffmpeg v1.5.0 New Features Multi-format output: --format srt,text — write multiple formats in one pass (e.g. SRT + plain text simultaneously) Comma-separated list accepted: srt,vtt,json, srt,text, etc. Requires -o
Setup:
setup.sh --check — System diagnostic: GPU, CUDA, Python, ffmpeg, pyannote, HuggingFace token (completes in ~12s) ffmpeg no longer required for basic transcription (PyAV handles decoding); skill.json updated to reflect this (ffmpeg is now optionalBins) Troubleshooting "CUDA not available — using CPU": Install PyTorch with CUDA (see GPU Support above) Setup fails: Make sure Python 3.10+ is installed Out of memory: Use smaller model, --compute-type int8, or --batch-size 4 Slow on CPU: Expected — use GPU for practical transcription Model download fails: Check ~/.cache/huggingface/ permissions Diarization model fails: Ensure HuggingFace token exists and model agreements accepted; or pass token directly with --hf-token hf_xxx URL download fails: Check yt-dlp is installed (pipx install yt-dlp) No audio files in batch: Check file extensions match supported formats
Check installed version: Run ./scripts/transcribe --version
Upgrade faster-whisper: Run ./setup.sh --update (upgrades in-place, no full reinstall) Hallucinations on silence/music: Try --temperature 0.0 --no-speech-threshold 0.8 VAD splits speech incorrectly: Tune with --vad-threshold 0.3 (lower) or --min-silence-duration 300 Improve speech detection: Run ./setup.sh --update to upgrade faster-whisper to the latest version (includes Silero VAD V6). References faster-whisper GitHub Distil-Whisper Paper HuggingFace Models pyannote.audio (diarization) yt-dlp (URL/YouTube download)
Join 80,000+ one-person companies automating with AI