Clip Forge
**ClipForge** is a self-hosted, 100% free tool that turns long YouTube videos into viral, ready-to-publish vertical Shorts, Reels, and TikToks. By combining YouTube's viewer heatmap data, local Whisper transcriptions, and local LLMs (via Ollama), it automatically pinpoints the most engaging hook and story segment, reframes it to a 9:16 vertical crop using FFmpeg, and hands you a high-quality video file—all running completely on your local machine with zero API costs.

The Challenge
The Challenge
Repurposing long-form video content into highly engaging, vertical short-form clips (9:16 format) has become a necessity for modern content creators looking to grow their audience. However, executing this workflow at scale presents significant technical and economic challenges:
1. The High Cost of Automated SaaS Platforms
Most existing tools that automate the clipping process rely on closed-source APIs (like OpenAI's GPT-4 or Whisper API) hosted on cloud infrastructure. These platforms charge high subscription fees or scale costs linearly with video duration. For creators processing hours of podcasts or live streams daily, these costs quickly become unsustainable.
2. Algorithmic Ambiguity in Content Selection
Selecting a viral segment is not as simple as clipping the most viewed section of a video. Traditional automated tools often rely on basic metrics (like audio volume spikes or raw YouTube viewer heatmaps) that select generic, low-quality moments. True virality requires a multi-stage editorial approach:
- The Hook (First 2 Seconds): Capturing viewer attention instantly to prevent scroll-away.
- Narrative Coherence: Ensuring the segment tells a complete, satisfying story with a clear setup and payoff, rather than cutting off mid-sentence.
- Engagement Flow: Identifying and eliminating "dead zones" (awkward silences, filler words like "um" or "like") that disrupt retention.
3. Context Window and Performance Limits of Local LLMs
Moving away from expensive cloud APIs to local open-source models (like gemma3:4b or qwen2.5:7b via Ollama) introduces hardware constraints. Large video transcripts easily exceed the memory and context-window limitations (often restricted to 4,096 or 8,192 tokens on consumer-grade laptops) of local LLMs. Running full-length transcripts directly through local LLMs causes context overflow, severe generation slowdowns, or API timeouts, resulting in pipeline failures.
4. Heavy Local Processing Bottlenecks
Orchestrating transcription (Whisper), AI evaluation (Ollama), and video encoding/reframing (FFmpeg) on a single local machine requires a highly efficient, non-blocking pipeline. Without proper concurrency limits and timeout management, a single 15-minute video can freeze the user interface or cause local API requests to time out prematurely, rendering the tool ineffective for longer content like podcasts.