Rocco (AI-Powered Podcast Generator)

System Modules
Rocco: AI-Powered Podcast Generator

Overview
Rocco is an autonomous podcast generation platform that transforms topics and research queries into fully-produced, multi-host podcast episodes. Built as a personal solution for consuming educational content and daily news briefings, Rocco demonstrates sophisticated agent orchestration, multi-modal AI integration, and end-to-end content production automation. The system rivals commercial solutions like NotebookLM while offering customizable scheduling and deeper research capabilities.
Technical Architecture
Rocco's architecture integrates multiple best-in-class AI services into a cohesive pipeline. The system leverages GPT-5 and GPT-5-mini for text inference and script generation, Tavily for web research, Cartesia.ai for natural voice synthesis, and Gemini for image generation. The platform supports both user-initiated and scheduled autonomous podcast creation, enabling daily briefings and recurring deep-dives on specific topics without manual intervention.

The technical stack demonstrates careful API selection: Tavily provides high-quality research results with better filtering than traditional search APIs, Cartesia delivers emotionally nuanced voice synthesis through SSML tag support, and the dual GPT model approach balances performance with cost efficiency across different pipeline stages.
Research and Content Pipeline
Rocco's most sophisticated component is its multi-stage research and scripting pipeline. When a topic is submitted—either manually or via scheduled agent—the system begins with an exploratory research phase. The research agent conducts initial searches through Tavily to establish context, then develops a research strategy: a structured plan identifying key subtopics and specific search queries needed to comprehensively cover the subject matter.

This strategy drives targeted information gathering, with results filtered and ranked to retain only the most relevant sources. Extracted content is parsed and indexed into a vector database, creating a queryable knowledge base specific to each episode. This approach ensures research depth while maintaining focus on the user's original topic.
The scripting process operates in distinct phases. First, an outlining agent determines episode structure, selecting which aspects of the research to cover based on the chosen podcast style (educational deep-dive, news briefing, conversational discussion, etc.) and host configuration (single or dual-host format). The outline represents a curated subset of the research, tailored to the episode format and time constraints.

For each outline section, the system performs vector searches against the indexed research to retrieve relevant context, then generates detailed script segments. This section-by-section approach allows fine-grained control over content development but introduces a critical challenge: maintaining narrative coherence across independently-generated segments.
Quality Assurance and Natural Dialogue
To address coherence challenges, Rocco implements an editorial review stage. An editor agent analyzes the complete script, identifying inconsistencies, tonal shifts, or logical gaps between sections. The original scriptwriting agent receives this feedback and applies revisions, ensuring the final episode flows naturally despite its segmented generation process.
Achieving natural-sounding dialogue presented significant early challenges. The system has evolved to incorporate automated evaluation frameworks that assess script quality across multiple dimensions: conversational flow, factual accuracy, tonal consistency, and emotional appropriateness. These evaluations have dramatically improved output quality through iterative refinement of prompting strategies and generation parameters.

A critical innovation is the integration of SSML (Speech Synthesis Markup Language) tags directly into generated scripts. The scriptwriting agent annotates dialogue with emotional cues, pacing instructions, and prosodic markers, enabling Cartesia's voice synthesis to deliver genuinely expressive performances rather than monotone narration. This attention to emotional nuance distinguishes Rocco's output from simpler text-to-speech implementations.
Development Challenges and Solutions
The primary technical challenge in developing Rocco was debugging a long-horizon task with execution times extending up to an hour for full-length episodes. Traditional debugging approaches prove ineffective when iteration cycles are measured in hours rather than seconds. This necessitated building robust logging, intermediate state inspection, and component-level testing frameworks to isolate issues without requiring full pipeline execution.
The automated evaluation system emerged as both a quality assurance tool and a development accelerator. By establishing quantitative metrics for script quality, the system enables rapid assessment of changes to prompting, model selection, or pipeline architecture without subjective human review of every test output.
Results and Impact
Rocco has been in daily production use for several weeks, generating personalized content ranging from cosmology and physics deep-dives for evening listening to morning news briefings. The system's ability to autonomously research, script, and produce episodes on schedule has transformed content consumption from passive browsing to curated, personally-relevant audio programming.
The project demonstrates advanced capabilities in agent orchestration, multi-stage reasoning, and cross-modal AI integration while solving a genuine personal need. The architecture patterns developed—particularly the research-outline-script-edit pipeline and automated quality evaluation—represent reusable frameworks applicable to other long-form content generation challenges.
Future Directions
Rocco establishes a foundation for exploring more sophisticated podcast formats, multi-episode series with narrative continuity, and integration of user feedback loops for continuous personalization. The vector database architecture enables potential features like cross-episode knowledge graphs and recurring segment automation based on emerging topics in saved research areas.