skillmake
← marketplace
creatorsconceptsha:c7ff2353cdf48e7amanual

ai-voiceover-pipeline

Use when scripting → AI voice → matched-timing video — ElevenLabs/PlayHT TTS aligned to HyperFrames or Remotion captions for screencast and tutorial-style videos at scale.

Tutorials · creator-attached
One-line install
curl --create-dirs -fsSL https://skillmake.xyz/i/ai-voiceover-pipeline -o ~/.claude/skills/ai-voiceover-pipeline/SKILL.md

The hash above pins this exact content. The file we serve at /api/marketplace/ai-voiceover-pipeline-c7ff2353/raw always matches sha:c7ff2353cdf48e7a.

3,282 chars · ~821 tokens
---
name: ai-voiceover-pipeline
description: Use when scripting → AI voice → matched-timing video — ElevenLabs/PlayHT TTS aligned to HyperFrames or Remotion captions for screencast and tutorial-style videos at scale.
source: https://elevenlabs.io/docs/quickstart
generated: 2026-05-07T21:42:05.385Z
category: concept
audience: creators
---

## Tutorials

- https://skillmake.xyz/v/ai-voiceover-pipeline.mp4

## When to use

- Producing voiceover for tutorials, explainers, and product demos without recording
- Generating consistent narration across many videos with the same voice
- Aligning generated audio to on-screen captions or Remotion / HyperFrames timing
- Localising one script into multiple voices for multilingual variants

## Key concepts

### voice cloning vs preset voices

Cloned voices need 1–10 minutes of clean source audio (mono, no music, ≤44.1kHz). Preset voices ship with the platform — start there. Use a clone only if a specific creator brand voice matters; ElevenLabs Pro and PlayHT both offer this.

### alignment data (per-word timestamps)

ElevenLabs returns alignment JSON: per-character or per-word offsets in milliseconds. Use this to drive caption timing, scene cuts, or HyperFrames data-start values. Without alignment, you're back to manual sync.

### script chunking

TTS quality drops on multi-paragraph scripts; chunk by sentence or paragraph (≤300 chars), generate each independently, then concat. Lets you re-roll one bad sentence without redoing the whole render.

## API reference

```
ElevenLabs text-to-speech with alignment
```

POST a script chunk + voice id, get back MP3 + per-character alignment timestamps in one round trip.

```
// pnpm add @elevenlabs/elevenlabs-js
import { ElevenLabsClient } from '@elevenlabs/elevenlabs-js';
const client = new ElevenLabsClient({ apiKey: process.env.ELEVEN_API_KEY });
const result = await client.textToSpeech.convertWithTimestamps('21m00Tcm4TlvDq8ikWAM', {
  text: 'In this tutorial, we will…',
  modelId: 'eleven_multilingual_v2',
});
// result.audio_base64 → MP3
// result.alignment.character_start_times_seconds → per-char offsets
```

```
alignment → caption timing
```

Group character offsets into word-level cues, then render as VTT or as HyperFrames data-start attributes.

```
function toWordCues(text, charStarts) {
  const cues = [];
  let i = 0;
  for (const word of text.split(/\s+/)) {
    const start = charStarts[i];
    i += word.length + 1; // +1 for space
    const end = charStarts[i] ?? charStarts[charStarts.length - 1];
    cues.push({ word, start, end });
  }
  return cues;
}
```

## Gotchas

- Background music must be added AFTER TTS so it doesn't bleed into the cloned voice training set.
- Punctuation drives prosody — periods make pauses, commas don't. Adjust the script for cadence, not just grammar.
- Numbers, units, and acronyms are pronounced inconsistently — expand them in the script ('100 GB' → 'one hundred gigabytes' or 'GB' depending on which the voice handles cleanly).
- ElevenLabs throttles low tiers aggressively at concurrent generations; chunk + sequence rather than parallelise on a Starter plan.

---
Generated by SkillMake from https://elevenlabs.io/docs/quickstart on 2026-05-07T21:42:05.385Z.
Verify against source before relying on details.

File: ~/.claude/skills/ai-voiceover-pipeline/SKILL.md