← marketplace
creatorsconceptsha:c7ff2353cdf48e7amanual
ai-voiceover-pipeline
Use when scripting → AI voice → matched-timing video — ElevenLabs/PlayHT TTS aligned to HyperFrames or Remotion captions for screencast and tutorial-style videos at scale.
Tutorials · creator-attached
One-line install
curl --create-dirs -fsSL https://skillmake.xyz/i/ai-voiceover-pipeline -o ~/.claude/skills/ai-voiceover-pipeline/SKILL.md
The hash above pins this exact content. The file we serve at /api/marketplace/ai-voiceover-pipeline-c7ff2353/raw always matches sha:c7ff2353cdf48e7a.
3,282 chars · ~821 tokens
--- name: ai-voiceover-pipeline description: Use when scripting → AI voice → matched-timing video — ElevenLabs/PlayHT TTS aligned to HyperFrames or Remotion captions for screencast and tutorial-style videos at scale. source: https://elevenlabs.io/docs/quickstart generated: 2026-05-07T21:42:05.385Z category: concept audience: creators --- ## Tutorials - https://skillmake.xyz/v/ai-voiceover-pipeline.mp4 ## When to use - Producing voiceover for tutorials, explainers, and product demos without recording - Generating consistent narration across many videos with the same voice - Aligning generated audio to on-screen captions or Remotion / HyperFrames timing - Localising one script into multiple voices for multilingual variants ## Key concepts ### voice cloning vs preset voices Cloned voices need 1–10 minutes of clean source audio (mono, no music, ≤44.1kHz). Preset voices ship with the platform — start there. Use a clone only if a specific creator brand voice matters; ElevenLabs Pro and PlayHT both offer this. ### alignment data (per-word timestamps) ElevenLabs returns alignment JSON: per-character or per-word offsets in milliseconds. Use this to drive caption timing, scene cuts, or HyperFrames data-start values. Without alignment, you're back to manual sync. ### script chunking TTS quality drops on multi-paragraph scripts; chunk by sentence or paragraph (≤300 chars), generate each independently, then concat. Lets you re-roll one bad sentence without redoing the whole render. ## API reference ``` ElevenLabs text-to-speech with alignment ``` POST a script chunk + voice id, get back MP3 + per-character alignment timestamps in one round trip. ``` // pnpm add @elevenlabs/elevenlabs-js import { ElevenLabsClient } from '@elevenlabs/elevenlabs-js'; const client = new ElevenLabsClient({ apiKey: process.env.ELEVEN_API_KEY }); const result = await client.textToSpeech.convertWithTimestamps('21m00Tcm4TlvDq8ikWAM', { text: 'In this tutorial, we will…', modelId: 'eleven_multilingual_v2', }); // result.audio_base64 → MP3 // result.alignment.character_start_times_seconds → per-char offsets ``` ``` alignment → caption timing ``` Group character offsets into word-level cues, then render as VTT or as HyperFrames data-start attributes. ``` function toWordCues(text, charStarts) { const cues = []; let i = 0; for (const word of text.split(/\s+/)) { const start = charStarts[i]; i += word.length + 1; // +1 for space const end = charStarts[i] ?? charStarts[charStarts.length - 1]; cues.push({ word, start, end }); } return cues; } ``` ## Gotchas - Background music must be added AFTER TTS so it doesn't bleed into the cloned voice training set. - Punctuation drives prosody — periods make pauses, commas don't. Adjust the script for cadence, not just grammar. - Numbers, units, and acronyms are pronounced inconsistently — expand them in the script ('100 GB' → 'one hundred gigabytes' or 'GB' depending on which the voice handles cleanly). - ElevenLabs throttles low tiers aggressively at concurrent generations; chunk + sequence rather than parallelise on a Starter plan. --- Generated by SkillMake from https://elevenlabs.io/docs/quickstart on 2026-05-07T21:42:05.385Z. Verify against source before relying on details.
File: ~/.claude/skills/ai-voiceover-pipeline/SKILL.md