Audio Processing
357 skills in Content & Media > Audio Processing
ai-multimodal
Process and generate multimedia content using Google Gemini API for better vision capabilities. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (better image analysis than Claude models, captioning, reasoning, object detection, design extraction, OCR, visual Q&A, segmentation, handle multiple images), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots (instead of default vision capabilities of Claude, only fallback to Claude's vision capabilities if needed), processing PDF documents, extracting structured data from media, creating images/videos from text pr
drafting-til
Drafts a TIL blog post in the user's voice and creates it in Notion with Status="Claude Draft". Contains voice guide for matching the user's writing style. Use when user approves a TIL topic and wants a draft created.
benswift-writer
Writes and edits content in Ben Swift's distinctive voice for any type of writing including blog posts, emails, technical documentation, and academic content. Use when the user wants writing in Ben's voice or style.
tts
Implement text-to-speech (TTS) capabilities using the z-ai-web-dev-sdk. Use this skill when the user needs to convert text into natural-sounding speech, create audio content, build voice-enabled applications, or generate spoken audio files. Supports multiple voices, adjustable speed, and various audio formats.
asr
Implement speech-to-text (ASR/automatic speech recognition) capabilities using the z-ai-web-dev-sdk. Use this skill when the user needs to transcribe audio files, convert speech to text, build voice input features, or process audio recordings. Supports base64 encoded audio files and returns accurate text transcriptions.
writing-pr-descriptions
Voice guide for writing PR descriptions. Use this skill for ALL PR creation and updates (via gh CLI or editing existing PRs). Contains your specific voice rules, anti-patterns, examples, and workflow. Never write PR descriptions without invoking this skill first.
brand-positioning
Define your brand's core identity - purpose, values, personality, and positioning statement. Creates the strategic foundation that informs voice, design, and all brand decisions.
train-fasttext
This skill provides guidance for training FastText text classification models with constraints on accuracy and model size. It should be used when training fastText supervised models, optimizing model size while maintaining accuracy thresholds, or when hyperparameter tuning for text classification tasks.
brand-voice
Codify your brand's writing style into a reusable voice guide. Analyzes existing content to extract patterns, then generates a comprehensive style document for consistent messaging across all channels.
train-fasttext
Guidance for training FastText text classification models with constraints on model size and accuracy. This skill should be used when training FastText models, optimizing hyperparameters, or balancing trade-offs between model size and classification accuracy.
plain-language
Simplification and readability techniques. Use when writing for broad audiences or simplifying complex content. Covers active voice, short sentences, jargon elimination, and accessibility principles from the Plain Language Movement.
philosopher-analyst
Analyzes fundamental questions and concepts through philosophical lens using logic, epistemology,metaphysics, and critical analysis frameworks.Provides insights on meaning, truth, knowledge, existence, reasoning, and conceptual clarity.Use when: Conceptual ambiguity, logical arguments, foundational assumptions, meaning questions.Evaluates: Validity, soundness, coherence, assumptions, implications, conceptual clarity.
voicevox-narration-system
Generate Yukkuri-style voice narration from Git commits using VOICEVOX Engine. Use when creating development progress audio guides, YouTube content, or team reports from Git history.
talon-development
Expert guidance for Talon voice control development. Use when creating voice commands, defining actions, writing .talon files, testing Talon config, or debugging Talon issues.
frontend-production-quality
Use before implementing UI changes or frontend PRs. Enforces TodoWrite with 18+ items. Triggers: "accessibility audit", "WCAG", "Lighthouse", "screen reader", "a11y", "NVDA", "VoiceOver", "keyboard navigation", "focus indicator". For "Core Web Vitals" in frontend/UI context, use this skill. For pure backend/API performance optimization, use performance-optimization instead. If thinking "WIP doesn't need this" - use it anyway.
writing-auth0-docs
Use when authoring new documentation or fixing style/formatting violations in Auth0 docs-v2 repository - enforces Auth0 Docs Style Guide for terminology, voice/tone, admonitions, placeholders, capitalization, and translation readiness (not for reading/understanding docs)
assemblyai-streaming
This skill should be used when working with AssemblyAI’s Speech-to-Text and LLM Gateway APIs, especially for streaming/live transcription, meeting notetakers, and voice agents that need low-latency transcripts and audio analysis.
writing-enhancer
Rephrase or completely rewrite content matching user's preferred tone, voice, and style.
openai-api
Complete guide for OpenAI APIs: Chat Completions (GPT-5.2, GPT-4o), Embeddings, Images (GPT-Image-1.5), Audio (Whisper + TTS + Transcribe), Moderation. Includes Node.js SDK and fetch approaches.
id-generator
Generate intelligent session IDs based on detected content source type. Analyzes ContentSummary and creates meaningful IDs (podcast-xyz, transcript-abc, etc.).