video-whisper

ylongw/video-whisper

🎙️ Local video/audio transcription on Apple Silicon using MLX Whisper. No API keys, no cloud, no cost.

3 stars

0 forks

Shell

149 views

View on GitHub Add to Favorites

SKILL.md

Video Whisper — Local Video/Audio Transcription

Transcribe videos and audio locally on Apple Silicon using MLX Whisper. Supports YouTube, Bilibili, Xiaohongshu, Douyin, podcasts, and local files.

Runs entirely on-device. No API keys. No cloud. No cost.

Requirements

Apple Silicon Mac (M1/M2/M3/M4)
Homebrew packages: yt-dlp, ffmpeg
Python venv with mlx-whisper

Installation

# 1. Install system dependencies
brew install yt-dlp ffmpeg

# 2. Create Python venv and install mlx-whisper
python3 -m venv ~/.openclaw/venvs/whisper
~/.openclaw/venvs/whisper/bin/pip install mlx-whisper

Usage

CLI

bash scripts/transcribe.sh "<URL_or_FILE>" [model]

URL: YouTube, Bilibili, Xiaohongshu, Douyin, or any yt-dlp supported site
Local file: /path/to/video.mp4, /path/to/audio.wav, etc.
model (optional): defaults to mlx-community/whisper-medium-mlx

Output:

/tmp/whisper_output.txt — plain text transcript
/tmp/whisper_output.json — JSON with timestamps per segment

Examples

# YouTube video
bash scripts/transcribe.sh "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# Bilibili video
bash scripts/transcribe.sh "https://www.bilibili.com/video/BV1xx411c7mD"

# Local file
bash scripts/transcribe.sh ~/Downloads/podcast.mp3

# Use large model for better accuracy
bash scripts/transcribe.sh "https://youtu.be/xxx" mlx-community/whisper-large-v3-mlx

Custom Python Path

If your mlx-whisper is installed in a non-standard location:

export WHISPER_PYTHON=/path/to/your/venv/bin/python3
bash scripts/transcribe.sh "<URL>"

Available Models

Model	Size	Speed (10min video)	Best For
`mlx-community/whisper-small-mlx`	~460MB	~20s	Quick drafts, English
`mlx-community/whisper-medium-mlx`	~1.5GB	~60-90s	Recommended — good balance
`mlx-community/whisper-large-v3-mlx`	~3GB	~90-120s	Best accuracy, multilingual

First run downloads the model to ~/.cache/huggingface/hub/ (cached for future use).

Performance (Mac mini M4, 16GB)

Video Length	medium	large-v3
5 min	~30-40s	~50-60s
10 min	~60-90s	~90-120s
30 min	~3-4 min	~5-6 min
60 min	~6-8 min	~10-12 min

OpenClaw Integration

Drop this skill into your OpenClaw workspace:

cp -r video-whisper ~/.openclaw/workspace/skills/

Then ask your agent: "帮我转录这个视频 https://..."

The agent will run the script, read the output, and summarize or analyze as needed.

Notes

Chinese content: use medium or large-v3 (small is weak on Chinese)
Xiaohongshu/Douyin: may need browser cookies (--cookies-from-browser chrome)
Long videos (>1h): consider running in background
All temp files in /tmp/, cleaned up automatically

License

MIT

Installation

Option 1: Use slash command in Claude Code

/install-skill https://github.com/ylongw/video-whisper

Option 2: Clone to skills directory

# Global (all projects)

git clone https://github.com/ylongw/video-whisper ~/.claude/skills/video-whisper

# Project-specific

git clone https://github.com/ylongw/video-whisper .claude/skills/video-whisper

Add MCP server to .cursor/mcp.json:

{
  "mcpServers": {
    "skillz": {
      "command": "npx",
      "args": ["-y", "skillz-mcp", "https://github.com/ylongw/video-whisper"]
    }
  }
}

Restart Cursor after adding the configuration.

Option 1: Use Gemini CLI command

gemini extensions install https://github.com/ylongw/video-whisper

Option 2: Clone to extensions directory

git clone https://github.com/ylongw/video-whisper ~/.gemini/extensions/video-whisper

Topics

apple-silicon local-ai mlx speech-to-text transcription whisper

Related Skills

Product-Manager-Skills

Product Management skills framework built on battle-tested methods for Claude Code, Cowork, Codex, and AI agents.

swiftui-view-refactor

My Codex Skills

ui

Claude Code Dedicated Development Harness - Achieving High-Quality Development Through an Autonomous Plan→Work→Review Cycle

daily-workflow

A complete starter kit for an Obsidian + Claude Code personal knowledge management system.