🎙️ Local video/audio transcription on Apple Silicon using MLX Whisper. No API keys, no cloud, no cost.

3 stars
0 forks
Shell
25 views

SKILL.md

Video Whisper — Local Video/Audio Transcription

Transcribe videos and audio locally on Apple Silicon using MLX Whisper. Supports YouTube, Bilibili, Xiaohongshu, Douyin, podcasts, and local files.

Runs entirely on-device. No API keys. No cloud. No cost.

Requirements

  • Apple Silicon Mac (M1/M2/M3/M4)
  • Homebrew packages: yt-dlp, ffmpeg
  • Python venv with mlx-whisper

Installation

# 1. Install system dependencies
brew install yt-dlp ffmpeg

# 2. Create Python venv and install mlx-whisper
python3 -m venv ~/.openclaw/venvs/whisper
~/.openclaw/venvs/whisper/bin/pip install mlx-whisper

Usage

CLI

bash scripts/transcribe.sh "<URL_or_FILE>" [model]
  • URL: YouTube, Bilibili, Xiaohongshu, Douyin, or any yt-dlp supported site
  • Local file: /path/to/video.mp4, /path/to/audio.wav, etc.
  • model (optional): defaults to mlx-community/whisper-medium-mlx

Output:

  • /tmp/whisper_output.txt — plain text transcript
  • /tmp/whisper_output.json — JSON with timestamps per segment

Examples

# YouTube video
bash scripts/transcribe.sh "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# Bilibili video
bash scripts/transcribe.sh "https://www.bilibili.com/video/BV1xx411c7mD"

# Local file
bash scripts/transcribe.sh ~/Downloads/podcast.mp3

# Use large model for better accuracy
bash scripts/transcribe.sh "https://youtu.be/xxx" mlx-community/whisper-large-v3-mlx

Custom Python Path

If your mlx-whisper is installed in a non-standard location:

export WHISPER_PYTHON=/path/to/your/venv/bin/python3
bash scripts/transcribe.sh "<URL>"

Available Models

Model Size Speed (10min video) Best For
mlx-community/whisper-small-mlx ~460MB ~20s Quick drafts, English
mlx-community/whisper-medium-mlx ~1.5GB ~60-90s Recommended — good balance
mlx-community/whisper-large-v3-mlx ~3GB ~90-120s Best accuracy, multilingual

First run downloads the model to ~/.cache/huggingface/hub/ (cached for future use).

Performance (Mac mini M4, 16GB)

Video Length medium large-v3
5 min ~30-40s ~50-60s
10 min ~60-90s ~90-120s
30 min ~3-4 min ~5-6 min
60 min ~6-8 min ~10-12 min

OpenClaw Integration

Drop this skill into your OpenClaw workspace:

cp -r video-whisper ~/.openclaw/workspace/skills/

Then ask your agent: "帮我转录这个视频 https://..."

The agent will run the script, read the output, and summarize or analyze as needed.

Notes

  • Chinese content: use medium or large-v3 (small is weak on Chinese)
  • Xiaohongshu/Douyin: may need browser cookies (--cookies-from-browser chrome)
  • Long videos (>1h): consider running in background
  • All temp files in /tmp/, cleaned up automatically

License

MIT