{"id":475494,"date":"2026-04-11T10:52:48","date_gmt":"2026-04-11T10:52:48","guid":{"rendered":"https:\/\/savepearlharbor.com\/?p=475494"},"modified":"-0001-11-30T00:00:00","modified_gmt":"-0001-11-29T21:00:00","slug":"","status":"publish","type":"post","link":"https:\/\/savepearlharbor.com\/?p=475494","title":{"rendered":"How I Built an \u201cAnime Factory\u201d: a System That Automatically Turns Episodes into YouTube Shorts"},"content":{"rendered":"<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/00d\/fc6\/12a\/00dfc612a4a25cc6133f598a10177c25.png\" width=\"1536\" height=\"1024\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/00d\/fc6\/12a\/00dfc612a4a25cc6133f598a10177c25.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/00d\/fc6\/12a\/00dfc612a4a25cc6133f598a10177c25.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>Hi, Habr!<\/p>\n<p>Over the past few months, I have been building a system that I internally call an <strong>\u201canime factory\u201d<\/strong>: it takes a source episode as input and produces a ready-to-publish YouTube Short with dynamic reframing, subtitles, post-processing, and metadata.<\/p>\n<p>What makes it interesting is not just the fact that editing can be automated, but that a significant part of this work can be decomposed into engineering stages: transcription, audio and scene analysis, strong-moment discovery, \u201cvirtual camera\u201d control, and a feedback loop based on performance metrics.<\/p>\n<p>In this article, I will show how this pipeline is structured, why I chose a modular architecture instead of an end-to-end black box, where the system broke, and which decisions eventually made it actually usable.<\/p>\n<h4>Where the idea came from<\/h4>\n<p>For a long time, I kept running into the same problem: any digital product without users is effectively dead. You can build backend systems, automation, and pipelines all day long, but if the project has no distribution channel and no audience attention, it barely moves forward.<\/p>\n<p>My first attempts to automate content-related tasks started back in 2020. At that time, they were simpler ideas around TikTok, Telegram, and content promotion. But manual work hits a ceiling very quickly: finding moments, cutting clips, adding subtitles, converting to vertical format, packaging, publishing \u2014 all of that takes too much time and barely scales. One person can produce a few videos per day. A system can produce dozens or hundreds.<\/p>\n<p>At some point, I formulated the problem correctly for myself: I did not need an \u201cediting script.\u201d I needed an actual production loop that turns long-form video into a stream of short clips with minimal manual involvement.<\/p>\n<p>That is how the \u201canime factory\u201d was born.<\/p>\n<h4>What problem the system actually solves<\/h4>\n<p>In simplified form, the task sounds like this: take a long horizontal episode and automatically turn it into a short vertical video that works as a self-contained Short.<\/p>\n<p>But once you decompose it into engineering subproblems, a whole set of non-obvious requirements appears immediately:<\/p>\n<ul>\n<li>\n<p>You need to understand where the episode contains potentially strong moments.<\/p>\n<\/li>\n<li>\n<p>You need to select fragments that work as a micro-story, not just as a random chunk torn out of context.<\/p>\n<\/li>\n<li>\n<p>You need to adapt 16:9 into 9:16 without losing the main character, the emotion, or the visual focus of the scene.<\/p>\n<\/li>\n<li>\n<p>You need subtitles that are quick to read and do not kill the image.<\/p>\n<\/li>\n<li>\n<p>You need to assemble all of this into a stable batch pipeline where individual stages can be restarted independently.<\/p>\n<\/li>\n<li>\n<p>You need to teach the system to analyze publishing results and adjust future selection logic.<\/p>\n<\/li>\n<\/ul>\n<p>At that point, it becomes clear that this is no longer \u201cjust a little editing script,\u201d but a fairly mature engineering system with its own artifacts, errors, quality degradation modes, fallback mechanisms, and feedback loops.<\/p>\n<h4>Why simple automatic clipping does not work<\/h4>\n<p>From the outside, it looks like the problem should be easy to solve. For example:<\/p>\n<ul>\n<li>\n<p>split the video into equal 30-second chunks;<\/p>\n<\/li>\n<li>\n<p>pick the loudest moments;<\/p>\n<\/li>\n<li>\n<p>crop to the center;<\/p>\n<\/li>\n<li>\n<p>overlay auto-generated subtitles.<\/p>\n<\/li>\n<\/ul>\n<p>In practice, that approach almost always produces garbage.<\/p>\n<p>A loud moment is not necessarily an interesting one. An interesting moment does not necessarily have a good visual focus. A line can be strong only in the context of the previous five seconds. A character\u2019s face can drift out of a centered crop. A scene with two characters falls apart completely if you simply keep a static window in the middle.<\/p>\n<p>So the core idea behind my pipeline was this: do not rely on a single signal. Do not select moments only by text. Do not crop only by center. Do not try to make one model guess the entire process end to end. Instead, combine several relatively independent signal sources into a decision-making system.<\/p>\n<h4>Architecture: what the \u201cfactory\u201d consists of<\/h4>\n<figure class=\"full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/80f\/e0e\/c56\/80fe0ec56ed6a4ac4745fdc3c5b8f906.png\" alt=\"Overall block diagram of the pipeline: Episode -&gt; Transcription -&gt; Audio Analysis -&gt; Scene\/Face Detection -&gt; Candidate Scoring -&gt; Dynamic Crop -&gt; Subtitles\/Post-processing -&gt; Export\/Publish -&gt; Analytics Feedback Loop.\" title=\"Overall block diagram of the pipeline: Episode -&gt; Transcription -&gt; Audio Analysis -&gt; Scene\/Face Detection -&gt; Candidate Scoring -&gt; Dynamic Crop -&gt; Subtitles\/Post-processing -&gt; Export\/Publish -&gt; Analytics Feedback Loop.\" width=\"1014\" height=\"211\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/80f\/e0e\/c56\/80fe0ec56ed6a4ac4745fdc3c5b8f906.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/80f\/e0e\/c56\/80fe0ec56ed6a4ac4745fdc3c5b8f906.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/p>\n<div><figcaption><em>Overall block diagram of the pipeline: Episode -&gt; Transcription -&gt; Audio Analysis -&gt; Scene\/Face Detection -&gt; Candidate Scoring -&gt; Dynamic Crop -&gt; Subtitles\/Post-processing -&gt; Export\/Publish -&gt; Analytics Feedback Loop.<\/em><\/figcaption><\/div>\n<\/figure>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">Loop<\/p>\n<\/th>\n<th>\n<p align=\"left\">Purpose<\/p>\n<\/th>\n<th>\n<p align=\"left\">Main output<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Production<\/p>\n<\/td>\n<td>\n<p align=\"left\">Generate videos from the source episode<\/p>\n<\/td>\n<td>\n<p align=\"left\">A ready Short<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">R&amp;D \/ Analytics<\/p>\n<\/td>\n<td>\n<p align=\"left\">Analyze published videos and update heuristics<\/p>\n<\/td>\n<td>\n<p align=\"left\">New weights and trigger dictionaries<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Community<\/p>\n<\/td>\n<td>\n<p align=\"left\">Automate interaction around the channel<\/p>\n<\/td>\n<td>\n<p align=\"left\">Replies, warm-up, engagement<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>At a high level, my system breaks down into three major loops:<\/p>\n<ol>\n<li>\n<p><strong>Production loop<\/strong> \u2014 the main line that generates videos.<\/p>\n<\/li>\n<li>\n<p><strong>R&amp;D \/ Analytics loop<\/strong> \u2014 analysis of already published videos and heuristic updates.<\/p>\n<\/li>\n<li>\n<p><strong>Community \/ Interaction loop<\/strong> \u2014 additional automation around audience interaction.<\/p>\n<\/li>\n<\/ol>\n<p>Let\u2019s go through each of them in more detail.<\/p>\n<h4>1. Production loop: from episode to finished Short<\/h4>\n<p>This is the heart of the whole system. This is where the source media content goes through all processing stages and becomes a final vertical video.<\/p>\n<h3>Stage 1. Getting the source material<\/h3>\n<p>To make the pipeline easier to debug, I intentionally avoided the \u201cone giant script that does everything\u201d approach and instead went for explicit intermediate artifacts.<\/p>\n<pre><code>episode_001\/  source.mp4  transcript.json  audio_features.json  scene_cuts.json  faces.json  candidates.json  crop_path.json  subtitles.srt  metadata.json  final_short_01.mp4<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:87px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>This structure is important not for aesthetics, but because it allows individual stages to be recomputed independently. For example, I can rebuild <code>crop_path<\/code> without retranscribing the entire episode, or change subtitle logic without rerunning scene analysis.<\/p>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/c0c\/bf0\/063\/c0cbf00634456eed9d56ba6d81526b89.png\" width=\"745\" height=\"494\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/c0c\/bf0\/063\/c0cbf00634456eed9d56ba6d81526b89.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/c0c\/bf0\/063\/c0cbf00634456eed9d56ba6d81526b89.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>At the pipeline entrance, an episode arrives. For the system, it is just raw material: a video file that must be parsed, indexed, scored, and turned into several potential short-clip candidates.<\/p>\n<p>Even at this stage, it was important not to build something that simply \u201cdownloads the file and moves on,\u201d but to introduce a proper artifact structure. For each episode, the system stores separate intermediate results: metadata, transcripts, timestamped clip candidates, CV analysis results, detected faces, crop parameters, and final renders. That may sound like a boring infrastructure detail, but it is exactly what makes the system maintainable.<\/p>\n<p>If I had to rerun the entire episode from scratch every time, development would have been painful. With this design, I can recompute only dynamic cropping or only subtitle logic without touching the rest of the pipeline.<\/p>\n<h3>Stage 2. Transcription and working with speech<\/h3>\n<p>The next layer is turning audio into timestamped text. At this point, the system gets not just one continuous transcript, but speech segments tied to time. This matters for two reasons:<\/p>\n<ul>\n<li>\n<p>First, the text itself already provides a strong signal about scene content.<\/p>\n<\/li>\n<li>\n<p>Second, the same segments are later used for subtitles and for binding semantic fragments back to the video.<\/p>\n<\/li>\n<\/ul>\n<p>But I quickly discovered that \u201ctake the transcript and search for interesting lines\u201d is not enough.<\/p>\n<p>Multimedia content has an unpleasant property: the emotional force of a scene is not always in the text. Sometimes the text is neutral, but the scene has powerful music, a tense pause, a camera cut, or a strong facial expression. Sometimes it is the opposite: the line itself is strong, but without visual context it does not work.<\/p>\n<p>So for me, the transcript is one signal \u2014 not the single source of truth.<\/p>\n<h3>Stage 3. Audio analysis<\/h3>\n<p>In simplified form, one of the internal audio passes looks like this:<\/p>\n<pre><code class=\"python\">def extract_audio_signal(window):    speech_density = measure_speech_density(window)    loudness_peak = detect_loudness_peak(window)    energy_delta = detect_energy_change(window)    return (        0.45 * speech_density +        0.35 * loudness_peak +        0.20 * energy_delta    )<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>Of course, the real implementation is more complex: it includes normalization, thresholds, protection against false spikes, and combinations with other signals. But the core idea is the same: audio is not used as a standalone oracle, but as another layer in evaluating a moment.<\/p>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/7e7\/4ca\/00b\/7e74ca00b32034400aa5a5015124503c.png\" alt=\"A timeline with audio peaks and highlighted windows where the system sees increased emotional density.\" title=\"A timeline with audio peaks and highlighted windows where the system sees increased emotional density.\" width=\"1536\" height=\"1024\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/7e7\/4ca\/00b\/7e74ca00b32034400aa5a5015124503c.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/7e7\/4ca\/00b\/7e74ca00b32034400aa5a5015124503c.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/p>\n<div><figcaption><em>A timeline with audio peaks and highlighted windows where the system sees increased emotional density.<\/em><\/figcaption><\/div>\n<\/figure>\n<p>In parallel with text, the system analyzes the audio track itself. I look not only at the presence of speech, but also at the energy structure: loudness peaks, emotional spikes, transitions, sections with pronounced sound dynamics, musical pressure, and so on.<\/p>\n<p>The purpose of this stage is not to blindly choose the loudest chunk, but to add another axis of evaluation. In real videos, what often works is the combination of:<\/p>\n<ul>\n<li>\n<p>a strong short line,<\/p>\n<\/li>\n<li>\n<p>a pronounced audio transition,<\/p>\n<\/li>\n<li>\n<p>a visual accent in the frame.<\/p>\n<\/li>\n<\/ul>\n<p>If you use only text, you miss these scenes. If you use only audio, you collect meaningless explosions and screams. Together, the signals work much better.<\/p>\n<h3>Stage 4. Computer Vision: scenes, faces, and visual events<\/h3>\n<p>In simplified form, useful visual signal detection looks something like this:<\/p>\n<pre><code class=\"python\">def analyze_frame(frame):    faces = detect_faces(frame)    scene_score = detect_scene_change(frame)    face_focus_score = estimate_face_focus(faces, frame)    return {        \"faces\": faces,        \"scene_score\": scene_score,        \"face_focus_score\": face_focus_score,    }<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>In practice, what matters here is not just the fact that face detection exists, but how that data is used downstream: can we confidently build a vertical crop window, does it make sense to hold on one character, is there a transition between characters, does the composition fall apart?<\/p>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/504\/8f0\/2f7\/5048f02f73e82f4f42034bce79499f1b.png\" width=\"784\" height=\"344\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/504\/8f0\/2f7\/5048f02f73e82f4f42034bce79499f1b.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/504\/8f0\/2f7\/5048f02f73e82f4f42034bce79499f1b.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>The next major block is computer vision. Here the system solves several tasks at once:<\/p>\n<ul>\n<li>\n<p>detects scene changes;<\/p>\n<\/li>\n<li>\n<p>determines whether there is a face in the frame and where it is;<\/p>\n<\/li>\n<li>\n<p>estimates whether the frame is suitable for vertical focus;<\/p>\n<\/li>\n<li>\n<p>extracts visual features that later participate in candidate scoring.<\/p>\n<\/li>\n<\/ul>\n<p>In practice, this turned out to be one of the most useful layers in the whole system. Without faces and scene analysis, vertical adaptation was too crude. A centered crop destroys a large part of the image\u2019s meaning: one character may stand on the left, another on the right, while the center of the frame contains almost nothing interesting.<\/p>\n<p>Once the system started tracking faces and their positions, it became possible to build a <strong>\u201cvirtual camera\u201d<\/strong> \u2014 not just crop the video, but imitate camera work within the original frame.<\/p>\n<h3>Stage 5. Finding clip candidates<\/h3>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">Signal<\/p>\n<\/th>\n<th>\n<p align=\"left\">What it evaluates<\/p>\n<\/th>\n<th>\n<p align=\"left\">Why it matters<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Transcript signal<\/p>\n<\/td>\n<td>\n<p align=\"left\">Density and meaningfulness of lines<\/p>\n<\/td>\n<td>\n<p align=\"left\">To understand whether there is a semantic hook<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Audio signal<\/p>\n<\/td>\n<td>\n<p align=\"left\">Emotional peaks and dynamics<\/p>\n<\/td>\n<td>\n<p align=\"left\">To avoid missing strong audio-driven moments<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Face signal<\/p>\n<\/td>\n<td>\n<p align=\"left\">Presence of the main character in frame<\/p>\n<\/td>\n<td>\n<p align=\"left\">To determine whether vertical focus is feasible<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Scene signal<\/p>\n<\/td>\n<td>\n<p align=\"left\">Scene changes and visual density<\/p>\n<\/td>\n<td>\n<p align=\"left\">To avoid empty or visually weak windows<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Pacing signal<\/p>\n<\/td>\n<td>\n<p align=\"left\">Tempo and internal rhythm of the fragment<\/p>\n<\/td>\n<td>\n<p align=\"left\">To filter out sluggish or overly chaotic parts<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>After text, audio, and CV signals are collected, the system forms candidate clips.<\/p>\n<p>This is not one timeline pass with a simple rule like \u201cevery 30 seconds take the best fragment.\u201d Instead, the video is decomposed into potential windows, a feature set is computed for each one, and then a final score is calculated.<\/p>\n<p>In simplified form, the logic looks like this:<\/p>\n<pre><code class=\"python\">score = (    transcript_weight * transcript_signal +    audio_weight * audio_signal +    face_weight * face_signal +    scene_weight * scene_signal +    pacing_weight * pacing_signal)<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>Naturally, the real system is messier: it has penalties, thresholds, fallback heuristics, length limits, empty-fragment checks, duplicate filtering, and re-evaluation of neighboring windows. But the core idea is exactly this: do not make the decision from a single feature, but combine several relatively weak signals into one more stable score.<\/p>\n<p>An important nuance: the system is not looking for simply \u201can interesting 20 seconds,\u201d but for fragments that have a chance to feel like a complete micro-episode. That strongly affects output quality. A Shorts viewer does not need to know the context of the full episode, so the clip should still hold together on its own.<\/p>\n<h3>Stage 6. Dynamic reframing \u2014 the \u201cvirtual camera\u201d<\/h3>\n<p>Internally, this is closer to a constrained state machine than to magic:<\/p>\n<pre><code class=\"python\">def update_crop_window(prev_window, target_focus, dt):    desired_window = build_window_around_focus(target_focus)    smoothed_window = smooth_transition(prev_window, desired_window, dt)    limited_window = limit_shift_speed(smoothed_window, prev_window, dt)    return clamp_to_frame(limited_window)<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>Three things are fundamentally important here:<\/p>\n<ul>\n<li>\n<p>the system must not twitch because of noisy detections;<\/p>\n<\/li>\n<li>\n<p>the window must not move faster than a visually comfortable speed;<\/p>\n<\/li>\n<li>\n<p>when the face disappears, a fallback must activate instead of a chaotic jump.<\/p>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/015\/f94\/0b5\/015f940b5daafe5139776259fad4eb5f.png\" width=\"981\" height=\"438\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/015\/f94\/0b5\/015f940b5daafe5139776259fad4eb5f.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/015\/f94\/0b5\/015f940b5daafe5139776259fad4eb5f.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<\/li>\n<\/ul>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/cf6\/aed\/725\/cf6aed72571ae3d885291faed45c5fc6.png\" alt=\"A mini-diagram of the crop window moving across several adjacent frames.\" title=\"A mini-diagram of the crop window moving across several adjacent frames.\" width=\"1536\" height=\"1024\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/cf6\/aed\/725\/cf6aed72571ae3d885291faed45c5fc6.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/cf6\/aed\/725\/cf6aed72571ae3d885291faed45c5fc6.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/p>\n<div><figcaption><em>A mini-diagram of the crop window moving across several adjacent frames.<\/em><\/figcaption><\/div>\n<\/figure>\n<p>This is probably the most interesting and also the most temperamental part of the whole system.<\/p>\n<p>If we have a horizontal video and want to turn it into a vertical Short, there are several options:<\/p>\n<ul>\n<li>\n<p>do a dumb centered crop;<\/p>\n<\/li>\n<li>\n<p>choose one static focus area;<\/p>\n<\/li>\n<li>\n<p>try to control the crop window dynamically.<\/p>\n<\/li>\n<\/ul>\n<p>The first two approaches quickly showed their limitations, so I moved to the third.<\/p>\n<p>The \u201cvirtual camera\u201d logic is roughly as follows:<\/p>\n<ul>\n<li>\n<p>if there is one obvious character in frame, the camera tries to keep them in focus;<\/p>\n<\/li>\n<li>\n<p>if there are multiple faces, it chooses a strategy somewhere between holding the main object and smoothly shifting between characters;<\/p>\n<\/li>\n<li>\n<p>if faces disappear temporarily, fallback logic kicks in so that the camera does not jerk around;<\/p>\n<\/li>\n<li>\n<p>all movement is smoothed to avoid the feel of broken auto-tracking.<\/p>\n<\/li>\n<\/ul>\n<p>From an engineering perspective, this turned out to be much closer to state control than to \u201cmagical AI.\u201d Inertia, stabilization, shift-speed limits, protection against shaky detections, and proper handling of object disappearance are all crucial.<\/p>\n<p>The most annoying part of this module is that a formally \u201ccorrect\u201d solution does not always look good visually. The camera can mathematically follow the face perfectly and still make the clip unpleasant to watch. So I had to balance tracking precision against visual smoothness.<\/p>\n<h3>Stage 7. Subtitles and post-processing<\/h3>\n<p>For subtitles, it is important not only <em>what<\/em> is written, but <em>how<\/em> the text is split into lines and timed. In simplified form, the packing logic looks like this:<\/p>\n<pre><code class=\"python\">def build_subtitle_lines(segment, max_chars=24):    words = segment[\"text\"].split()    lines = wrap_words(words, max_chars=max_chars)    return highlight_keywords(lines)<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>In reality, this layer also accounts for line breaks, line length, readability on a mobile screen, synchronization with speech, and visual emphasis of key words.<\/p>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">Poor version<\/p>\n<\/th>\n<th>\n<p align=\"left\">Better version<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Long lines taking up half the screen<\/p>\n<\/td>\n<td>\n<p align=\"left\">Short, readable lines<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Random line breaks<\/p>\n<\/td>\n<td>\n<p align=\"left\">Meaningful breaks by phrase<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Tiny text<\/p>\n<\/td>\n<td>\n<p align=\"left\">Phone-readable size<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Uniform presentation<\/p>\n<\/td>\n<td>\n<p align=\"left\">Highlighting key words<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/da6\/825\/61d\/da682561d2c3b5e7d8af8ab83e9606bd.png\" width=\"1536\" height=\"1024\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/da6\/825\/61d\/da682561d2c3b5e7d8af8ab83e9606bd.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/da6\/825\/61d\/da682561d2c3b5e7d8af8ab83e9606bd.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>After selecting the moment and building the virtual camera trajectory, the system moves into final packaging.<\/p>\n<p>At this stage, more familiar steps kick in:<\/p>\n<ul>\n<li>\n<p>subtitle rendering by timecode;<\/p>\n<\/li>\n<li>\n<p>line-length limits and line-break control;<\/p>\n<\/li>\n<li>\n<p>visual emphasis of important words;<\/p>\n<\/li>\n<li>\n<p>loudness normalization;<\/p>\n<\/li>\n<li>\n<p>watermarking;<\/p>\n<\/li>\n<li>\n<p>speed correction and additional video effects;<\/p>\n<\/li>\n<li>\n<p>final export.<\/p>\n<\/li>\n<\/ul>\n<p>Subtitles, by the way, turned out not to be a decorative detail, but a part of the attention-retention mechanics. Poorly typeset auto-subtitles kill perception very quickly. Well-assembled ones, on the contrary, hold the viewer\u2019s gaze even when the person is watching without sound or only half-paying attention.<\/p>\n<p>That is why this layer is not just \u201cburn the transcript onto the video.\u201d It has its own composition and presentation logic.<\/p>\n<h3>Stage 8. Metadata and release<\/h3>\n<p>After rendering, the clip receives packaging data: title, description, set of tags, and auxiliary fields for publishing and notifications. An important detail here is that video production does not end with the <code>mp4<\/code> file. For a normal content pipeline, you also need a packaging and delivery layer that moves the result further through the system.<\/p>\n<p>That is why the pipeline has separate steps for preparing metadata and notifications, so that the process does not get stuck at a manual \u201cI\u2019ll title and upload it later.\u201d<\/p>\n<h4>Why I chose a modular architecture instead of one big ML model<\/h4>\n<p>Whenever you describe a project like this, the question comes up almost immediately: why not make it end-to-end? For example, feed the video to a model and ask it to output a ready-made Short.<\/p>\n<p>The answer is very practical: because from an engineering standpoint, that would be much less convenient.<\/p>\n<p>A modular architecture provides several critically important advantages:<\/p>\n<ul>\n<li>\n<p>each stage can be debugged independently;<\/p>\n<\/li>\n<li>\n<p>a weak module can be replaced quickly without rewriting everything else;<\/p>\n<\/li>\n<li>\n<p>intermediate artifacts can be stored and reused;<\/p>\n<\/li>\n<li>\n<p>it becomes much easier to understand why the system made a particular decision;<\/p>\n<\/li>\n<li>\n<p>fallback scenarios and fail-soft behavior become possible.<\/p>\n<\/li>\n<\/ul>\n<p>If face detection performs poorly, I improve the CV layer. If the selected moments are weak, I change scoring. If the videos are jerky, I refine the virtual camera. It is a very engineering-driven approach: less magic, more observability and control.<\/p>\n<p>For a production system, this path turned out to be much more practical than one opaque black box.<\/p>\n<h4>Architectural principles without which this would quickly collapse<\/h4>\n<p>Over the course of building the system, I developed several principles without which a pipeline like this turns into an uncontrollable monolith very quickly.<\/p>\n<h3>1. Independent stages<\/h3>\n<p>Each stage should be able to work as an independent pipeline step. This allows me to rerun only the needed part of processing instead of wasting resources on the whole loop.<\/p>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/1b8\/b77\/4ff\/1b8b774ff0524aefeb9968935946f5ef.png\" width=\"1536\" height=\"1024\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/1b8\/b77\/4ff\/1b8b774ff0524aefeb9968935946f5ef.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/1b8\/b77\/4ff\/1b8b774ff0524aefeb9968935946f5ef.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<h3>2. Artifact persistence<\/h3>\n<p>Transcripts, detected faces, candidate windows, crop trajectories, final timecodes \u2014 all of that must be persisted between steps. Without that, any debugging process becomes torture.<\/p>\n<h3>3. Fail-soft instead of fail-fast<\/h3>\n<p>For an experimental pipeline, it is not enough to \u201cfail \u043a\u0440\u0430\u0441\u0438\u0432\u043e\u201d \u2014 it must be able to degrade into an acceptable result. If no face is found, use a fallback crop. If tracking is jerky, smooth it and limit the speed. If a confident signal disappears, reduce its weight and continue.<\/p>\n<h3>4. Simple heuristics are often more useful than \u201ccomplex magic\u201d<\/h3>\n<p>In many places, the most stable results did not come from heavy models, but from a combination of sane constraints, good thresholds, repeatable rules, and careful scoring.<\/p>\n<h4>2. Analytics loop: how the system learns from its own publications<\/h4>\n<p>If the story ended there, this would simply be a good clip generator. But for me, it was important to go further and build a loop that not only produces content, but also gradually adapts its heuristics based on what actually performs well.<\/p>\n<p>That is why I introduced a separate analytics worker.<\/p>\n<p>Its job is to periodically traverse published videos, collect data from the strongest-performing ones, and extract patterns that can later be used when forming the next batch of candidates.<\/p>\n<p>In practice, this layer solves tasks like these:<\/p>\n<ul>\n<li>\n<p>collecting successful videos from the channel;<\/p>\n<\/li>\n<li>\n<p>analyzing subtitle length, structure, and vocabulary;<\/p>\n<\/li>\n<li>\n<p>looking at which characters, words, scene types, and pacing patterns appear most often in successful publications;<\/p>\n<\/li>\n<li>\n<p>updating internal weights and trigger dictionaries;<\/p>\n<\/li>\n<li>\n<p>feeding those updates back into the production-loop scoring logic.<\/p>\n<\/li>\n<\/ul>\n<p>It is important to emphasize here: this is not \u201cfull self-learning\u201d in the academic sense. It is closer to an engineering feedback loop that allows the system to become less static.<\/p>\n<p>For example, if successful videos repeatedly feature specific characters, types of lines, or pacing styles, the system starts weighing those signals more heavily during the next selection cycle.<\/p>\n<pre><code class=\"python\">def update_trigger_weights(top_videos):    trigger_stats = collect_trigger_stats(top_videos)    return normalize_weights(trigger_stats)<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>This is not \u201ctraining a neural network from scratch,\u201d but an engineering mechanism for adjusting weights based on the observed behavior of already published videos.<\/p>\n<p>In essence, the loop looks like this:<\/p>\n<pre><code>production -&gt; publishing -&gt; metrics -&gt; analytics -&gt; heuristic updates -&gt; production<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>For me personally, this became one of the most interesting parts of the project, because this is exactly where an \u201cediting script\u201d turns into a system that accumulates applied knowledge about its domain.<\/p>\n<h4>How scoring works and why it changes all the time<\/h4>\n<p>Candidate scoring cannot be fixed once and then forgotten. On paper, you can always invent a beautiful formula, but the real viewer does not watch the formula \u2014 they watch the clip. Audience behavior quickly shows which hypotheses worked and which did not.<\/p>\n<p>That is why my scoring layer was designed from the start to support continuous tuning.<\/p>\n<p>What changes there:<\/p>\n<ul>\n<li>\n<p>weights of different signals;<\/p>\n<\/li>\n<li>\n<p>penalties for weak or empty fragments;<\/p>\n<\/li>\n<li>\n<p>priorities for specific trigger dictionaries;<\/p>\n<\/li>\n<li>\n<p>length limits;<\/p>\n<\/li>\n<li>\n<p>selection rules and deduplication conditions for similar moments.<\/p>\n<\/li>\n<\/ul>\n<p>This is very different from the feeling of \u201cwrite the pipeline once and forget it.\u201d In reality, a system like this is a living mechanism that constantly requires heuristic revision.<\/p>\n<h4>3. Interaction loop: automation around comments and audience warm-up<\/h4>\n<p>As a separate direction, I experimented with an audience-interaction module. The idea was that publishing videos is not the only activity around a channel. For new accounts \u2014 and in general for engagement growth \u2014 behavior in comments also matters.<\/p>\n<p>So I built a separate layer that can generate replies in a more natural style rather than like a typical soulless bot.<\/p>\n<p>For that, I used real conversation logs as a style dataset. The goal was not to \u201cdeceive the user,\u201d but to avoid the typical bot-like spam tone and make responses closer to a natural human pattern of short interaction.<\/p>\n<p>This is not the system\u2019s main module yet, but as an engineering experiment it turned out to be quite useful: it showed that additional automation gradually starts growing around content production \u2014 not only for the video itself, but also for accompanying audience touchpoints.<\/p>\n<h4>Technologies used<\/h4>\n<p>From a stack perspective, this is not one monolithic \u201cAI product,\u201d but a composition of several practical tools, each solving its own part of the pipeline.<\/p>\n<ul>\n<li>\n<p><strong>Python<\/strong> \u2014 the main orchestration language for the whole pipeline. It ties together video analysis, transcription, post-processing, subtitle generation, and supporting integrations.<\/p>\n<\/li>\n<li>\n<p><strong>MoviePy + imageio[ffmpeg]<\/strong> \u2014 clip assembly, work with video fragments, concatenation, export, and basic post-processing. At the low level, the whole story obviously rests on FFmpeg.<\/p>\n<\/li>\n<li>\n<p><strong>Whisper<\/strong> \u2014 audio transcription and timestamped speech segments. This layer is later used both for semantic analysis and subtitle rendering.<\/p>\n<\/li>\n<li>\n<p><strong>OpenCV + MediaPipe<\/strong> \u2014 frame analysis, face detection, scene-change handling, and signals for dynamic reframing. This layer helps the system understand where the main character is and how best to adapt a horizontal frame to a vertical format.<\/p>\n<\/li>\n<li>\n<p><strong>Pillow + Pilmoji<\/strong> \u2014 subtitle rendering, text styling over video, emoji handling, and visual packaging of the final clip.<\/p>\n<\/li>\n<li>\n<p><strong>NumPy<\/strong> \u2014 base computations, array handling, and numerical operations for signal analysis and intermediate processing.<\/p>\n<\/li>\n<li>\n<p><strong>PyYAML + python-dotenv<\/strong> \u2014 pipeline configuration, processing parameters, and environment management.<\/p>\n<\/li>\n<li>\n<p><strong>requests + lxml<\/strong> \u2014 obtaining and parsing source data, working with external sources, and automating the content-ingestion stage.<\/p>\n<\/li>\n<li>\n<p><strong>google-api-python-client + google-auth + google-auth-oauthlib<\/strong> \u2014 integrations with external Google services for surrounding automation around the pipeline.<\/p>\n<\/li>\n<li>\n<p><strong>Playwright<\/strong> \u2014 browser automation for cases where an interface-driven scenario is more convenient than an API.<\/p>\n<\/li>\n<li>\n<p><strong>inference-sdk + OpenAI API<\/strong> \u2014 separate AI layers and auxiliary inference tasks related to analysis and decision-making in the pipeline.<\/p>\n<\/li>\n<li>\n<p><strong>tqdm<\/strong> \u2014 a small operational detail, but useful: progress tracking for long batch jobs and easier debugging of long runs.<\/p>\n<\/li>\n<\/ul>\n<p>Why this stack specifically? Because this system is orchestration-heavy by nature. There is a lot of \u201cglue\u201d code, intermediate artifacts, batch processing, research iterations, and quick logic changes. For that kind of mode, Python turned out to be a natural choice.<\/p>\n<p>If the task were reduced to one narrow, high-load media service, some components might make more sense in a lower-level implementation. But at the active R&amp;D stage, speed of evolution, observability, and the ability to quickly change individual pipeline stages were more important to me than academic \u201cstack purity.\u201d<\/p>\n<h4>Which problems cost me the most time<\/h4>\n<p>From the outside, projects like this look flashy: \u201cthe system edits video by itself.\u201d But the real work is largely a fight against edge cases.<\/p>\n<h3>Problem 1. A strong text moment is not always a strong visual moment<\/h3>\n<p>Transcription alone was not enough. The system regularly found strong lines in scenes that worked poorly as clips without visual context. The solution was to combine text with audio and CV signals.<\/p>\n<h3>Problem 2. A face is detected, but the frame still looks bad<\/h3>\n<p>The presence of face detection does not automatically mean the clip will look good in 9:16. Sometimes the object is found, but the composition still falls apart. I had to introduce additional constraints and a fallback strategy.<\/p>\n<h3>Problem 3. An overactive virtual camera becomes annoying<\/h3>\n<p>A naive implementation of dynamic cropping starts twitching very quickly and looks like broken auto-tracking. This required a lot of work on smoothing, inertia, and crop-window speed limits.<\/p>\n<h3>Problem 4. The \u201cengineering-best\u201d clip is not always the best by metrics<\/h3>\n<p>This was probably the most sobering moment. Sometimes a clip that feels more polished, coherent, and \u201chigher quality\u201d from a technical perspective performs worse than a simpler and rougher version. That is exactly why the analytics loop became necessary: it watches real results instead of my internal sense of pipeline beauty.<\/p>\n<h3>Problem 5. Any fully automated system must know how to degrade gracefully<\/h3>\n<p>There is no perfect detection, perfect tracking, or perfect moment selection. The question is not how to avoid ever making mistakes, but how to ensure the mistake does not destroy the entire release. That is why a large part of the system\u2019s robustness is not accuracy in a vacuum, but competent fallback scenarios.<\/p>\n<h4>What was most surprising to me in this system<\/h4>\n<p>Probably the main unexpected conclusion was this: a significant part of \u201ccreative\u201d content actually consists of repeatable, formalizable operations.<\/p>\n<p>That does not mean taste, visual literacy, and a sense of rhythm are not important. On the contrary, they are. But it turned out that part of them can be transferred into a system of rules, constraints, priorities, scoring, and analytical feedback.<\/p>\n<p>In other words, the task stops being magic and becomes engineering quality control under noisy data.<\/p>\n<h4>Where the system is still limited<\/h4>\n<p>It would be dishonest to pretend that a pipeline like this can already do everything.<\/p>\n<p>It has clear weak spots:<\/p>\n<ul>\n<li>\n<p>complex scenes with fast action and chaotic motion;<\/p>\n<\/li>\n<li>\n<p>moments where the meaning depends on long context rather than a short fragment;<\/p>\n<\/li>\n<li>\n<p>scenes without pronounced facial focus;<\/p>\n<\/li>\n<li>\n<p>cases where \u201cvirality\u201d is determined by a very subtle cultural context rather than formal signals;<\/p>\n<\/li>\n<li>\n<p>the risk of overfitting heuristics to one type of content or one audience.<\/p>\n<\/li>\n<\/ul>\n<p>And in my opinion, that is normal. A system should not pretend to be all-powerful. It is much more useful to understand where it works confidently and where it still needs improvement.<\/p>\n<h4>Where this architecture can be applied beyond anime<\/h4>\n<p>Although the project grew out of anime episodes, the architectural idea itself is not tied to that domain.<\/p>\n<p>Essentially, it is a general template for any scenario where you have long-form source video and want to automatically produce short vertical clips:<\/p>\n<ul>\n<li>\n<p>streams and gaming broadcasts;<\/p>\n<\/li>\n<li>\n<p>podcasts and interviews;<\/p>\n<\/li>\n<li>\n<p>educational videos and lectures;<\/p>\n<\/li>\n<li>\n<p>music content;<\/p>\n<\/li>\n<li>\n<p>UGC platforms and media archives;<\/p>\n<\/li>\n<li>\n<p>internal clip factories for content teams.<\/p>\n<\/li>\n<\/ul>\n<p>So the value here is not only in one specific channel, but in the approach itself: build not \u201cone script for one video,\u201d but a reproducible content-production line.<\/p>\n<h4>What came out at the end<\/h4>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/62a\/05c\/d2f\/62a05cd2f156116064f48381d3349fca.png\" width=\"1491\" height=\"631\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/62a\/05c\/d2f\/62a05cd2f156116064f48381d3349fca.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/62a\/05c\/d2f\/62a05cd2f156116064f48381d3349fca.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/7f0\/54c\/f9a\/7f054cf9ad4e60792a1c21c3ea2bd522.png\" width=\"1495\" height=\"632\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/7f0\/54c\/f9a\/7f054cf9ad4e60792a1c21c3ea2bd522.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/7f0\/54c\/f9a\/7f054cf9ad4e60792a1c21c3ea2bd522.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/505\/b75\/6f2\/505b756f2ea8f85522d8df1ae3cb7df5.png\" width=\"1481\" height=\"626\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/505\/b75\/6f2\/505b756f2ea8f85522d8df1ae3cb7df5.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/505\/b75\/6f2\/505b756f2ea8f85522d8df1ae3cb7df5.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p><em>On the current stage, the system already works as an autonomous loop capable of going through the main production steps without manual editing: from episode analysis to final clip render.<\/em><\/p>\n<p>Some of the generated clips collected tens and even hundreds of thousands of views.<\/p>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/b72\/8d4\/60d\/b728d460d0025634a57258fc5ff65e63.png\" width=\"1560\" height=\"678\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/b72\/8d4\/60d\/b728d460d0025634a57258fc5ff65e63.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/b72\/8d4\/60d\/b728d460d0025634a57258fc5ff65e63.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>For me, that became an important validation not only of the product hypothesis, but also of the engineering one: a well-designed automated pipeline really can compete with manual production if it contains proper decision-making logic instead of random timeline slicing.<\/p>\n<p>Even more importantly, this pipeline can be improved iteratively \u2014 not by intuition in a vacuum, but through measurable changes to individual modules.<\/p>\n<h4>Why I think systems like this will become a separate engineering direction<\/h4>\n<p>If you look more broadly, there is more and more long-form video around us, and the demand for short-form packaging keeps growing. At the same time, manual editing remains expensive, slow, and poorly scalable.<\/p>\n<p>Against that backdrop, systems that can automatically:<\/p>\n<ul>\n<li>\n<p>analyze source media,<\/p>\n<\/li>\n<li>\n<p>isolate potentially strong moments,<\/p>\n<\/li>\n<li>\n<p>adapt framing to the required format,<\/p>\n<\/li>\n<li>\n<p>package the result,<\/p>\n<\/li>\n<li>\n<p>and close the loop through metrics,<\/p>\n<\/li>\n<\/ul>\n<p>will increasingly become an applied engineering problem rather than just a curious hobby.<\/p>\n<p>In other words, this is no longer only about \u201ccontent generation,\u201d but about building automated media pipelines with observability, an R&amp;D cycle, and quality control.<\/p>\n<h4>Conclusion<\/h4>\n<p>This project started as an experiment: is it even possible to partially automate a task that is usually considered almost entirely manual and creative?<\/p>\n<p>Over time, it turned into something much more interesting \u2014 a system where content production is decomposed into engineering stages, and quality grows not only out of code, but also out of a feedback loop.<\/p>\n<p>The main takeaway for me is this: automation in media is not just about saving time. It is a way to turn scattered creative operations into a reproducible production line that can be scaled, measured, debugged, and improved.<\/p>\n<p>That is exactly the moment when a \u201cchannel with videos\u201d stops being a set of random publications and becomes a system.<\/p>\n<p>If there is interest, in the next article I can separately break down the technical details of one of the hardest modules \u2014 dynamic reframing \/ the \u201cvirtual camera\u201d: how the focus area is selected, how movement is smoothed, which fallback modes are used, and where such algorithms most often break.<\/p>\n<h4>Video demonstration<\/h4>\n<p>If you would rather first see the system in action and only then go through the architecture layer by layer, I recorded a separate demo showing the entire pipeline: from episode processing to the final vertical clip.<\/p>\n<p><a href=\"https:\/\/youtu.be\/xu8sg_mXEh0\" rel=\"noopener noreferrer nofollow\">Watch the system demo on YouTube<\/a><\/p>\n<\/div>\n<p>\u0441\u0441\u044b\u043b\u043a\u0430 \u043d\u0430 \u043e\u0440\u0438\u0433\u0438\u043d\u0430\u043b \u0441\u0442\u0430\u0442\u044c\u0438 <a href=\"https:\/\/habr.com\/ru\/articles\/1022250\/\">https:\/\/habr.com\/ru\/articles\/1022250\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hi, Habr!Over the past few months, I have been building a system that I internally call an \u201canime factory\u201d: it takes a source episode as input and produces a ready-to-publish YouTube Short with dynamic reframing, subtitles, post-processing, and metadata.What makes it interesting is not just the fact that editing can be automated, but that a significant part of this work can be decomposed into engineering stages: transcription, audio and scene analysis, strong-moment discovery, \u201cvirtual camera\u201d control, and a feedback loop based on performance metrics.In this article, I will show how this pipeline is structured, why I chose a modular architecture instead of an end-to-end black box, where the system broke, and which decisions eventually made it actually usable.Where the idea came fromFor a long time, I kept running into the same problem: any digital product without users is effectively dead. You can build backend systems, automation, and pipelines all day long, but if the project has no distribution channel and no audience attention, it barely moves forward.My first attempts to automate content-related tasks started back in 2020. At that time, they were simpler ideas around TikTok, Telegram, and content promotion. But manual work hits a ceiling very quickly: finding moments, cutting clips, adding subtitles, converting to vertical format, packaging, publishing \u2014 all of that takes too much time and barely scales. One person can produce a few videos per day. A system can produce dozens or hundreds.At some point, I formulated the problem correctly for myself: I did not need an \u201cediting script.\u201d I needed an actual production loop that turns long-form video into a stream of short clips with minimal manual involvement.That is how the \u201canime factory\u201d was born.What problem the system actually solvesIn simplified form, the task sounds like this: take a long horizontal episode and automatically turn it into a short vertical video that works as a self-contained Short.But once you decompose it into engineering subproblems, a whole set of non-obvious requirements appears immediately:You need to understand where the episode contains potentially strong moments.You need to select fragments that work as a micro-story, not just as a random chunk torn out of context.You need to adapt 16:9 into 9:16 without losing the main character, the emotion, or the visual focus of the scene.You need subtitles that are quick to read and do not kill the image.You need to assemble all of this into a stable batch pipeline where individual stages can be restarted independently.You need to teach the system to analyze publishing results and adjust future selection logic.At that point, it becomes clear that this is no longer \u201cjust a little editing script,\u201d but a fairly mature engineering system with its own artifacts, errors, quality degradation modes, fallback mechanisms, and feedback loops.Why simple automatic clipping does not workFrom the outside, it looks like the problem should be easy to solve. For example:split the video into equal 30-second chunks;pick the loudest moments;crop to the center;overlay auto-generated subtitles.In practice, that approach almost always produces garbage.A loud moment is not necessarily an interesting one. An interesting moment does not necessarily have a good visual focus. A line can be strong only in the context of the previous five seconds. A character\u2019s face can drift out of a centered crop. A scene with two characters falls apart completely if you simply keep a static window in the middle.So the core idea behind my pipeline was this: do not rely on a single signal. Do not select moments only by text. Do not crop only by center. Do not try to make one model guess the entire process end to end. Instead, combine several relatively independent signal sources into a decision-making system.Architecture: what the \u201cfactory\u201d consists ofOverall block diagram of the pipeline: Episode -&gt; Transcription -&gt; Audio Analysis -&gt; Scene\/Face Detection -&gt; Candidate Scoring -&gt; Dynamic Crop -&gt; Subtitles\/Post-processing -&gt; Export\/Publish -&gt; Analytics Feedback Loop.LoopPurposeMain outputProductionGenerate videos from the source episodeA ready ShortR&amp;D \/ AnalyticsAnalyze published videos and update heuristicsNew weights and trigger dictionariesCommunityAutomate interaction around the channelReplies, warm-up, engagementAt a high level, my system breaks down into three major loops:Production loop \u2014 the main line that generates videos.R&amp;D \/ Analytics loop \u2014 analysis of already published videos and heuristic updates.Community \/ Interaction loop \u2014 additional automation around audience interaction.Let\u2019s go through each of them in more detail.1. Production loop: from episode to finished ShortThis is the heart of the whole system. This is where the source media content goes through all processing stages and becomes a final vertical video.Stage 1. Getting the source materialTo make the pipeline easier to debug, I intentionally avoided the \u201cone giant script that does everything\u201d approach and instead went for explicit intermediate artifacts.episode_001\/  source.mp4  transcript.json  audio_features.json  scene_cuts.json  faces.json  candidates.json  crop_path.json  subtitles.srt  metadata.json  final_short_01.mp4This structure is important not for aesthetics, but because it allows individual stages to be recomputed independently. For example, I can rebuild crop_path without retranscribing the entire episode, or change subtitle logic without rerunning scene analysis.At the pipeline entrance, an episode arrives. For the system, it is just raw material: a video file that must be parsed, indexed, scored, and turned into several potential short-clip candidates.Even at this stage, it was important not to build something that simply \u201cdownloads the file and moves on,\u201d but to introduce a proper artifact structure. For each episode, the system stores separate intermediate results: metadata, transcripts, timestamped clip candidates, CV analysis results, detected faces, crop parameters, and final renders. That may sound like a boring infrastructure detail, but it is exactly what makes the system maintainable.If I had to rerun the entire episode from scratch every time, development would have been painful. With this design, I can recompute only dynamic cropping or only subtitle logic without touching the rest of the pipeline.Stage 2. Transcription and working with speechThe next layer is turning audio into timestamped text. At this point, the system gets not just one continuous transcript, but speech segments tied to time. This matters for two reasons:First, the text itself already provides a strong signal about scene content.Second, the same segments are later used for subtitles and for binding semantic fragments back to the video.But I quickly discovered that \u201ctake the transcript and search for interesting lines\u201d is not enough.Multimedia content has an unpleasant property: the emotional force of a scene is not always in the text. Sometimes the text is neutral, but the scene has powerful music, a tense pause, a camera cut, or a strong facial expression. Sometimes it is the opposite: the line itself is strong, but without visual context it does not work.So for me, the transcript is one signal \u2014 not the single source of truth.Stage 3. Audio analysisIn simplified form, one of the internal audio passes looks like this:def extract_audio_signal(window):    speech_density = measure_speech_density(window)    loudness_peak = detect_loudness_peak(window)    energy_delta = detect_energy_change(window)    return (        0.45 * speech_density +        0.35 * loudness_peak +        0.20 * energy_delta    )Of course, the real implementation is more complex: it includes normalization, thresholds, protection against false spikes, and combinations with other signals. But the core idea is the same: audio is not used as a standalone oracle, but as another layer in evaluating a moment.A timeline with audio peaks and highlighted windows where the system sees increased emotional density.In parallel with text, the system analyzes the audio track itself. I look not only at the presence of speech, but also at the energy structure: loudness peaks, emotional spikes, transitions, sections with pronounced sound dynamics, musical pressure, and so on.The purpose of this stage is not to blindly choose the loudest chunk, but to add another axis of evaluation. In real videos, what often works is the combination of:a strong short line,a pronounced audio transition,a visual accent in the frame.If you use only text, you miss these scenes. If you use only audio, you collect meaningless explosions and screams. Together, the signals work much better.Stage 4. Computer Vision: scenes, faces, and visual eventsIn simplified form, useful visual signal detection looks something like this:def analyze_frame(frame):    faces = detect_faces(frame)    scene_score = detect_scene_change(frame)    face_focus_score = estimate_face_focus(faces, frame)    return {        &#171;faces&#187;: faces,        &#171;scene_score&#187;: scene_score,        &#171;face_focus_score&#187;: face_focus_score,    }In practice, what matters here is not just the fact that face detection exists, but how that data is used downstream: can we confidently build a vertical crop window, does it make sense to hold on one character, is there a transition between characters, does the composition fall apart?The next major block is computer vision. Here the system solves several tasks at once:detects scene changes;determines whether there is a face in the frame and where it is;estimates whether the frame is suitable for vertical focus;extracts visual features that later participate in candidate scoring.In practice, this turned out to be one of the most useful layers in the whole system. Without faces and scene analysis, vertical adaptation was too crude. A centered crop destroys a large part of the image\u2019s meaning: one character may stand on&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-475494","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/475494","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=475494"}],"version-history":[{"count":0,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/475494\/revisions"}],"wp:attachment":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=475494"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=475494"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=475494"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}