音訊處理
357 skills in 內容與媒體 > 音訊處理
marketing-content-generation
Generate content drafts adapted to GTM motion. Use when creating blog posts, case studies, social posts, sales collateral, or app store copy. Requires brand-voice.md and positioning.md.
podcast
Creates audio podcasts from text using browser text-to-speech. Use when user mentions podcast, audio conversation, dialogue, spoken content, voice narration, audio book, or text-to-speech generation. Supports multiple speakers with automatic language detection. Zero cost, no API keys, works in browser.
ai-multimodal
Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens. | Sử dụng khi: AI, LLM, vision, embedding, phân tích hình ảnh, Gemini API.
openai-api
Build with OpenAI's stateless APIs - Chat Completions (GPT-5, GPT-4o), Embeddings, Images (DALL-E 3), Audio (Whisper + TTS), and Moderation. Includes Node.js SDK and fetch-based approaches for Cloudflare Workers.Use when: implementing chat completions with GPT-5/GPT-4o, streaming responses with SSE, using function calling/tools, creating structured outputs with JSON schemas, generating embeddings for RAG (text-embedding-3-small/large), generating images with DALL-E 3, editing images with GPT-Image-1, transcribing audio with Whisper, synthesizing speech with TTS (11 voices), moderating content (11 safety categories), or troubleshooting rate limits (429), invalid API keys (401), function calling failures, streaming parse errors, embeddings dimension mismatches, or token limit exceeded.
ai-transcript-analyzer
Analyze transcript files using OpenAI API (gpt-5-mini) to extract insights, summaries, key topics, quotes, and action items. This skill should be used when users have transcript files (from WhisperKit, YouTube, podcasts, meetings, etc.) and want AI-powered analysis, summaries, or custom insights extracted from the content. Supports both default comprehensive analysis and custom prompts for specific information extraction.
ai-multimodal
Process and generate multimedia content using Google Gemini API for better vision capabilities. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (better image analysis than Claude models, captioning, reasoning, object detection, design extraction, OCR, visual Q&A, segmentation, handle multiple images), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots (instead of default vision capabilities of Claude, only fallback to Claude's vision capabilities if needed), processing PDF documents, extracting structured data from media, creating images/videos from text pr
esphome-box3-builder
This skill should be used when the user asks to "configure esp32-s3-box-3", "set up box-3", "create box-3 voice assistant", "display lambda on box-3", "configure ili9xxx display", "set up gt911 touch", "configure i2s audio", "es7210 microphone", "es8311 speaker", "box-3 audio pipeline", or mentions error messages like "I2S DMA buffer error", "Touch not responding", "Display flicker", "Audio popping", "PSRAM not detected". Provides complete ESP32-S3-BOX-3 hardware templates, display lambda cookbook, touch patterns, and voice assistant configurations.
analysis-logic-trace
Validate inference chains step-by-step by examining whether each logical connection from premise to conclusion is sound, making implicit reasoning steps explicit and checking for gaps or leaps. Use when: (1) asked to validate reasoning steps, trace the logic, or verify if conclusions follow from premises, (2) arguments skip intermediate inferential steps or use 'therefore' without showing the reasoning path, (3) evaluating multi-step proofs, mathematical reasoning, or decision frameworks where each step builds on previous ones, (4) reasoning depends on unstated assumptions being treated as established facts.
sound-engineer
Expert in spatial audio, procedural sound design, game audio middleware, and app UX sound design. Specializes in HRTF/Ambisonics, Wwise/FMOD integration, UI sound design, and adaptive music systems. Activate on 'spatial audio', 'HRTF', 'binaural', 'Wwise', 'FMOD', 'procedural sound', 'footstep system', 'adaptive music', 'UI sounds', 'notification audio', 'sonic branding'. NOT for music composition/production (use DAW), audio post-production for film (linear media), voice cloning/TTS (use voice-audio-engineer), podcast editing (use standard audio editors), or hardware design.
narrative-voice
Find and maintain consistent authorial voice across different contexts. Use when: (1) asked to develop distinctive voice or style, (2) different sections sound like different authors, (3) writing extended content like blogs or documentation, (4) unifying multiple pieces under common identity, (5) professional writing sounds generic.
synthesisgrounded-audio-brief
Produce grounded audio briefs by chaining source-scoped input, citation verification, dialogue dramatization, and multi-speaker TTS orchestration. Use for “Audio Overview” style outputs.
biblical-accuracy
Comprehensive biblical accuracy verification for sermons, teachings, and theological content aligned with United Church of God theology. Validates scripture references, quotations, contextual integrity, theological soundness per UCG doctrine, and performs deep linguistic analysis of Greek and Hebrew original language texts to ensure fidelity to biblical meaning. Use when writing or reviewing any biblical, theological, or sermon content.
hooks-builder
Creates and configures Claude Code hooks for lifecycle automation. Use when implementing PreToolUse validation, PostToolUse formatting, PermissionRequest auto-approve, custom notifications, session management, or deterministic agent control.
voice-memos
Process voice memos with AI transcription and analysis. Multi-language support (EN, HE), speaker identification, action item extraction with priorities, smart summaries, and auto-categorization (meeting, journal, brainstorm, interview). Triggers - "process voice memos", "transcribe", "analyze memo", "show transcripts", "voice inbox", "extract action items", "meeting notes", "transcribe audio".
daw-music
Digital Audio Workstation usage, music composition, interactive music systems,and game audio implementation for immersive soundscapes.
audio-converter
Convert audio files between formats (MP3, WAV, FLAC, OGG, M4A) with bitrate and sample rate control. Batch processing supported.
transformers
Loading and using pretrained models with Hugging Face Transformers. Use when working with pretrained models from the Hub, running inference with Pipeline API, fine-tuning models with Trainer, or handling text, vision, audio, and multimodal tasks.
voice-audio-engineer
Expert in voice synthesis, TTS, voice cloning, podcast production, speech processing, and voice UI design via ElevenLabs integration. Specializes in vocal clarity, loudness standards (LUFS), de-essing, dialogue mixing, and voice transformation. Activate on 'TTS', 'text-to-speech', 'voice clone', 'voice synthesis', 'ElevenLabs', 'podcast', 'voice recording', 'speech-to-speech', 'voice UI', 'audiobook', 'dialogue'. NOT for spatial audio (use sound-engineer), music production (use DAW tools), game audio middleware (use sound-engineer), sound effects generation (use sound-engineer with ElevenLabs SFX), or live concert audio.
custom-plugin-flutter-skill-accessibility
Production-grade Flutter accessibility mastery - Semantics API, screen readers (VoiceOver/TalkBack), WCAG 2.1 AA/AAA compliance, inclusive design patterns, automated a11y testing with comprehensive code examples
brand-identity
Create or update comprehensive brand identity including strategy, visual design, and voice