video-whisper
ylongw/video-whisper🎙️ Local video/audio transcription on Apple Silicon using MLX Whisper. No API keys, no cloud, no cost.
3 stars
0 forks
Shell
26 views
SKILL.md
Video Whisper — Local Video/Audio Transcription
Transcribe videos and audio locally on Apple Silicon using MLX Whisper. Supports YouTube, Bilibili, Xiaohongshu, Douyin, podcasts, and local files.
Runs entirely on-device. No API keys. No cloud. No cost.
Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- Homebrew packages:
yt-dlp,ffmpeg - Python venv with
mlx-whisper
Installation
# 1. Install system dependencies
brew install yt-dlp ffmpeg
# 2. Create Python venv and install mlx-whisper
python3 -m venv ~/.openclaw/venvs/whisper
~/.openclaw/venvs/whisper/bin/pip install mlx-whisper
Usage
CLI
bash scripts/transcribe.sh "<URL_or_FILE>" [model]
- URL: YouTube, Bilibili, Xiaohongshu, Douyin, or any yt-dlp supported site
- Local file:
/path/to/video.mp4,/path/to/audio.wav, etc. - model (optional): defaults to
mlx-community/whisper-medium-mlx
Output:
/tmp/whisper_output.txt— plain text transcript/tmp/whisper_output.json— JSON with timestamps per segment
Examples
# YouTube video
bash scripts/transcribe.sh "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
# Bilibili video
bash scripts/transcribe.sh "https://www.bilibili.com/video/BV1xx411c7mD"
# Local file
bash scripts/transcribe.sh ~/Downloads/podcast.mp3
# Use large model for better accuracy
bash scripts/transcribe.sh "https://youtu.be/xxx" mlx-community/whisper-large-v3-mlx
Custom Python Path
If your mlx-whisper is installed in a non-standard location:
export WHISPER_PYTHON=/path/to/your/venv/bin/python3
bash scripts/transcribe.sh "<URL>"
Available Models
| Model | Size | Speed (10min video) | Best For |
|---|---|---|---|
mlx-community/whisper-small-mlx |
~460MB | ~20s | Quick drafts, English |
mlx-community/whisper-medium-mlx |
~1.5GB | ~60-90s | Recommended — good balance |
mlx-community/whisper-large-v3-mlx |
~3GB | ~90-120s | Best accuracy, multilingual |
First run downloads the model to ~/.cache/huggingface/hub/ (cached for future use).
Performance (Mac mini M4, 16GB)
| Video Length | medium | large-v3 |
|---|---|---|
| 5 min | ~30-40s | ~50-60s |
| 10 min | ~60-90s | ~90-120s |
| 30 min | ~3-4 min | ~5-6 min |
| 60 min | ~6-8 min | ~10-12 min |
OpenClaw Integration
Drop this skill into your OpenClaw workspace:
cp -r video-whisper ~/.openclaw/workspace/skills/
Then ask your agent: "帮我转录这个视频 https://..."
The agent will run the script, read the output, and summarize or analyze as needed.
Notes
- Chinese content: use
mediumorlarge-v3(small is weak on Chinese) - Xiaohongshu/Douyin: may need browser cookies (
--cookies-from-browser chrome) - Long videos (>1h): consider running in background
- All temp files in
/tmp/, cleaned up automatically
License
MIT