EDUCATION & TRAINING

Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent

Dev.to Machine Learning

About This Tutorial

Gemini 3.1 Pro Preview (orchestrator) ↓ system prompt enforces 4-6 scenes + 2-character fixed cast + vertical plan.json {scenes: [{speaker, script, tts_language, ltx_prompt, renderer ↓ XTTS (, port 8880) generates audio per scene ↓ scene_NN.wa renderer routing: ├─ Ditto-TalkingHead (, port 8881): normal dialogue ~1-2s/scene └─ LTX-2 A2V (, port 8892): reaction_only scenes only ~100s ↓ scene_NN.mp4 ffmpeg concat (libx264 + aac, 512x768 vertical) final.mp4 Gemini 3.1 Pro Preview (reviewer) ↓ multimodal evaluation of video + plan summary.