{"id":475534,"date":"2026-04-11T13:58:42","date_gmt":"2026-04-11T13:58:42","guid":{"rendered":"https:\/\/savepearlharbor.com\/?p=475534"},"modified":"-0001-11-30T00:00:00","modified_gmt":"-0001-11-29T21:00:00","slug":"","status":"publish","type":"post","link":"https:\/\/savepearlharbor.com\/?p=475534","title":{"rendered":"I Taught a Virtual Camera to Behave Like a Human Operator: How a Face Tracking Algorithm for Shorts\/Reels Works"},"content":{"rendered":"<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/622\/e46\/f93\/622e46f9391982d6f4bffbb6ea37234e.png\" width=\"925\" height=\"258\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/622\/e46\/f93\/622e46f9391982d6f4bffbb6ea37234e.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/622\/e46\/f93\/622e46f9391982d6f4bffbb6ea37234e.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>In <a href=\"https:\/\/habr.com\/ru\/articles\/1022250\" rel=\"noopener noreferrer nofollow\">the previous article<\/a> I described my \u201canime factory\u201d in detail \u2014 a pipeline that automatically turns episodes into finished Shorts. But inside that system there is one especially important module that deserves a separate deep dive: a virtual camera for automatic reframing.<\/p>\n<p>In this article, I will break down not just an \u201cauto-crop function,\u201d but a full virtual camera algorithm for vertical video. This is exactly the kind of task that looks simple at first glance: you have a horizontal video, you need to turn it into 9:16, keep a person in frame, and avoid making the result look like a jittery autofocus camera from the early 2010s.<\/p>\n<p>But as soon as you try to build it not for a demo, but for a real pipeline, engineering problems immediately show up:<\/p>\n<ul>\n<li>\n<p>the face detector is noisy;<\/p>\n<\/li>\n<li>\n<p>the face periodically disappears;<\/p>\n<\/li>\n<li>\n<p>the target moves unevenly;<\/p>\n<\/li>\n<li>\n<p>simply \u201cfollowing the center of the box\u201d is not enough;<\/p>\n<\/li>\n<li>\n<p>a perfectly accurate camera often looks unnatural and can even look worse than a slightly \u201chuman\u201d one.<\/p>\n<\/li>\n<\/ul>\n<p>In the end, I needed a system that behaves not like a soulless cropper, but like a camera operator: smoothly, with inertia, with motion prediction, with composition-aware corrections, and with a sane fallback mode for cases where there are no faces in the frame at all.<\/p>\n<p>In this article, we will go through the entire algorithm end to end:<\/p>\n<ul>\n<li>\n<p>a three-level face detection fallback: MediaPipe \u2192 YuNet \u2192 Haar Cascade;<\/p>\n<\/li>\n<li>\n<p>simple but practical face tracking between frames;<\/p>\n<\/li>\n<li>\n<p>anti-jerk and low-pass filtering;<\/p>\n<\/li>\n<li>\n<p>a virtual camera modeled as a damped oscillator;<\/p>\n<\/li>\n<li>\n<p>composition rules: rule of thirds, side bias, eye-level lift, face margin;<\/p>\n<\/li>\n<li>\n<p>a Ken Burns fallback when the face is lost or absent;<\/p>\n<\/li>\n<li>\n<p>camera path interpolation and applying the virtual crop to video.<\/p>\n<\/li>\n<\/ul>\n<h4>1. Why Making Vertical Video Is Harder Than It Looks<\/h4>\n<p>Let\u2019s say we have a regular 16:9 horizontal video. We want to convert it into 9:16 for YouTube Shorts, TikTok, or Reels.<\/p>\n<p>The naive approach looks like this:<\/p>\n<pre><code>take the center of the frame \u2192 cut out a vertical window \u2192 done<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:87px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>Formally, yes. In practice, no.<\/p>\n<p>If a person shifts to the left, the camera will crop them out. If there are two people in the frame, the composition falls apart. If the face is moving, the crop will either lag behind or jump around. If there are no faces, the video becomes either static or simply meaningless.<\/p>\n<p>So what we need here is not a crop, but a virtual camera \u2014 an entity that has:<\/p>\n<ul>\n<li>\n<p>an observation target;<\/p>\n<\/li>\n<li>\n<p>inertia;<\/p>\n<\/li>\n<li>\n<p>speed and acceleration limits;<\/p>\n<\/li>\n<li>\n<p>reaction delay;<\/p>\n<\/li>\n<li>\n<p>composition rules;<\/p>\n<\/li>\n<li>\n<p>fallback behavior.<\/p>\n<\/li>\n<\/ul>\n<p>That is exactly what turns \u201cauto-crop\u201d into a system that looks like the work of a real camera operator.<\/p>\n<h4>2. The Full Architecture of the Solution<\/h4>\n<p>If we remove the details, the entire pipeline looks like this:<\/p>\n<figure class=\"bordered \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/88c\/dde\/2da\/88cdde2da5bf24178e1c67116677bf57.png\" width=\"334\" height=\"673\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/88c\/dde\/2da\/88cdde2da5bf24178e1c67116677bf57.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/88c\/dde\/2da\/88cdde2da5bf24178e1c67116677bf57.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>The key idea: the algorithm does not make a decision \u201cfrom a single frame.\u201d It lives over time. And the quality here comes not so much from the detector\u2019s accuracy, but from how the system handles imperfect data.<\/p>\n<h4>3. Face Detection: A Three-Level Fallback<\/h4>\n<p>The first part of the system is not a single detector, but a cascade of three backends.<\/p>\n<h3>3.1 Why One Detector Is Not Enough<\/h3>\n<p>Any face detector makes mistakes from time to time:<\/p>\n<ul>\n<li>\n<p>it loses the face during head turns;<\/p>\n<\/li>\n<li>\n<p>it works worse in difficult lighting;<\/p>\n<\/li>\n<li>\n<p>it breaks on non-photorealistic faces;<\/p>\n<\/li>\n<li>\n<p>it may simply not be installed in the environment.<\/p>\n<\/li>\n<\/ul>\n<p>That is why in practice it is more useful to build not an \u201cideal detector,\u201d but a robust degradation system.<\/p>\n<h3>3.2 Fallback Chain Diagram<\/h3>\n<figure class=\"bordered \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/89a\/994\/65c\/89a99465cd50d8467b3376b639e97ce4.png\" width=\"326\" height=\"636\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/89a\/994\/65c\/89a99465cd50d8467b3376b639e97ce4.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/89a\/994\/65c\/89a99465cd50d8467b3376b639e97ce4.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>This is a simple but very practical pattern: first use the best option, then a backup, then the \u201clast resort.\u201d<\/p>\n<h3>3.3 MediaPipe as the Primary Detector<\/h3>\n<p>MediaPipe is the default working option here.<\/p>\n<p>Advantages:<\/p>\n<ul>\n<li>\n<p>it runs fast on CPU;<\/p>\n<\/li>\n<li>\n<p>it provides confidence;<\/p>\n<\/li>\n<li>\n<p>it usually catches faces well even at an angle and in imperfect lighting;<\/p>\n<\/li>\n<li>\n<p>it returns a convenient bounding box in normalized coordinates.<\/p>\n<\/li>\n<\/ul>\n<p>Initialization example:<\/p>\n<pre><code class=\"python\">mp_face_detection = mp.solutions.face_detection.FaceDetection(    model_selection=1,    min_detection_confidence=0.75)results = mp_face_detection.process(frame_bgr)if results.detections:    for detection in results.detections:        box = detection.location_data.relative_bounding_box        confidence = float(detection.score[0])<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>For a production pipeline, what matters is that at the output we normalize everything into a single shape: face center, size, confidence, and bounding box.<\/p>\n<h3>3.4 YuNet as the Backup Artillery<\/h3>\n<p>YuNet is needed not because MediaPipe is bad, but because production systems love \u201csomething went wrong\u201d scenarios.<\/p>\n<p>YuNet is useful when:<\/p>\n<ul>\n<li>\n<p>MediaPipe is unavailable in the environment;<\/p>\n<\/li>\n<li>\n<p>MediaPipe did not find a face, but the face is clearly there;<\/p>\n<\/li>\n<li>\n<p>you need an alternative ONNX backend through OpenCV.<\/p>\n<\/li>\n<\/ul>\n<p>Example:<\/p>\n<figure class=\"bordered \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/adb\/9fd\/312\/adb9fd3125c15697b57e247f8aba8d0c.png\" width=\"307\" height=\"670\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/adb\/9fd\/312\/adb9fd3125c15697b57e247f8aba8d0c.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/adb\/9fd\/312\/adb9fd3125c15697b57e247f8aba8d0c.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>YuNet is slower, but it provides a solid second line of defense.<\/p>\n<h3>3.5 Haar Cascade as \u201cAt Least Don\u2019t Go Completely Blind\u201d<\/h3>\n<p>Haar Cascade is not the best detector in terms of quality. But it:<\/p>\n<ul>\n<li>\n<p>is available almost everywhere;<\/p>\n<\/li>\n<li>\n<p>does not require heavy dependencies;<\/p>\n<\/li>\n<li>\n<p>sometimes saves the day when everything else has failed.<\/p>\n<\/li>\n<\/ul>\n<p>Example:<\/p>\n<pre><code class=\"python\">cascade = cv2.CascadeClassifier(    cv2.data.haarcascades + \"haarcascade_frontalface_default.xml\")faces = cascade.detectMultiScale(gray, 1.1, 5, minSize=(50, 50))<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>From an engineering point of view, the value of Haar is not accuracy, but the fact that even in degraded mode the system does not become completely blind.<\/p>\n<h3>3.6 Unified Detector Interface<\/h3>\n<p>Externally, the whole chain is hidden behind one interface:<\/p>\n<pre><code class=\"python\">backend = _DetectorBackend(min_confidence=0.75)detections = backend.detect(frame_rgb, min_size=50)<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>At the output we get a list of detections with a unified structure:<\/p>\n<ul>\n<li>\n<p><code>center = (cx, cy)<\/code><\/p>\n<\/li>\n<li>\n<p><code>size = (w, h)<\/code><\/p>\n<\/li>\n<li>\n<p><code>score<\/code><\/p>\n<\/li>\n<li>\n<p><code>box<\/code><\/p>\n<\/li>\n<\/ul>\n<p>This is an important architectural point: the rest of the algorithm does not know who exactly found the face.<\/p>\n<h4>4. From Detection to Tracking: Following the Face Across Frames<\/h4>\n<p>The detector answers the question: \u201cwhat is in this frame?\u201d<\/p>\n<p>But the virtual camera needs a different answer: \u201cwhich object are we following over time?\u201d<\/p>\n<p>Without this layer, if there are two faces or if confidence jumps, the camera will keep switching between objects endlessly.<\/p>\n<h3>4.1 Simple Nearest-Neighbor Tracking<\/h3>\n<p>In this case, we do not need a heavy multi-object tracker. A simple rule is enough:<\/p>\n<ul>\n<li>\n<p>take the face center from the previous frame;<\/p>\n<\/li>\n<li>\n<p>compute distances to all current detections;<\/p>\n<\/li>\n<li>\n<p>pick the nearest one;<\/p>\n<\/li>\n<li>\n<p>if the distance is within tolerance, treat it as the same object.<\/p>\n<\/li>\n<\/ul>\n<p>Code:<\/p>\n<pre><code class=\"python\">mp_face_detection = mp.solutions.face_detection.FaceDetection(    model_selection=1,  # 1 = more accurate model, slightly slower    min_detection_confidence=0.75  # confidence threshold)results = mp_face_detection.process(frame_bgr)if results.detections:    for detection in results.detections:        # detection.location_data contains bounding box (x, y, w, h)        # detection.score[0] contains confidence \u2208 [0, 1]        box = detection.location_data.relative_bounding_box        confidence = float(detection.score[0])<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>Why does it work? Because for most talking-head and similar scenarios, face movement between neighboring analyzed frames is limited. So the nearest valid detection is almost always the continuation of the current track.<\/p>\n<h3>4.2 Handling Losses: Inertial Tracking<\/h3>\n<p>One of the most unpleasant problems in face tracking is short-term misses:<\/p>\n<ul>\n<li>\n<p>the person turned away;<\/p>\n<\/li>\n<li>\n<p>the light produced a glare;<\/p>\n<\/li>\n<li>\n<p>a hand occluded part of the face;<\/p>\n<\/li>\n<li>\n<p>the detector just blinked.<\/p>\n<\/li>\n<\/ul>\n<p>If at that moment we abruptly switch to fallback, the camera will jerk. So we need a grace period: for a short time, we continue to trust the last known position.<\/p>\n<pre><code class=\"python\">if new_center is None and tracked_face is not None and (t - last_seen_time) &lt; max_miss_time:    new_center = tracked_face.center<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>This improves subjective quality a lot. During short dropouts, the camera looks stable rather than nervous.<\/p>\n<h3>4.3 Why We Do Not Need a \u201cSmart AI Tracker\u201d Here<\/h3>\n<p>You could add optical flow, a Kalman filter, appearance embeddings, or a full MOT pipeline. But if the task is vertical auto-crop for videos with a limited number of faces, then simple distance-based tracking gives enough quality with minimal complexity and high reproducibility.<\/p>\n<p>Sometimes the best algorithm is not the one that looks smarter on paper, but the one that is easier to tune and fix in production.<\/p>\n<h4>5. Stabilizing the Input Signal: Anti-Jerk and Low-Pass Filter<\/h4>\n<p>Even if the detector finds the right face, the coordinates are still noisy. That is a fundamental property of the system.<\/p>\n<p>Typical problems:<\/p>\n<ul>\n<li>\n<p>the box center shakes slightly from frame to frame;<\/p>\n<\/li>\n<li>\n<p>the face size jumps;<\/p>\n<\/li>\n<li>\n<p>sometimes a false but \u201cconfident\u201d detection appears far from the previous point.<\/p>\n<\/li>\n<\/ul>\n<p>If we feed that directly into the camera, we get jitter.<\/p>\n<h3>5.1 Anti-Jerk: Hard Limiting of Jumps<\/h3>\n<p>First, we need to cut away completely unreasonable jumps.<\/p>\n<pre><code class=\"python\">if filtered_face_center is not None:    delta_face = new_center - filtered_face_center    max_face_step = 0.04    dist = float(np.linalg.norm(delta_face))    if dist &gt; max_face_step:        new_center = filtered_face_center + delta_face * (max_face_step \/ dist)<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>Formula:<\/p>\n<figure class=\"\"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/e89\/850\/4ef\/e898504ef142bd6ffd8593cda138b10e.png\" sizes=\"(max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/e89\/850\/4ef\/e898504ef142bd6ffd8593cda138b10e.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/e89\/850\/4ef\/e898504ef142bd6ffd8593cda138b10e.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>The idea is simple: a face cannot teleport across half the screen between neighboring analysis frames. If \u201cthe detector says so,\u201d then it is noise or a false positive.<\/p>\n<h3>5.2 Low-Pass Filter: Smoothing Residual Noise<\/h3>\n<p>After the hard clamp, normal but still noisy fluctuations remain. Exponential smoothing works well for them:<\/p>\n<pre><code class=\"python\">yunet = cv2.FaceDetectorYN.create(    model=path_to_onnx,    config=\"\",    input_size=(320, 320),    score_threshold=0.75,    nms_threshold=0.3,    top_k=5000)_, faces = yunet.detect(frame_bgr)  # faces \u2014 array [x, y, w, h, conf]<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>In code:<\/p>\n<pre><code class=\"python\">if filtered_face_center is None:    filtered_face_center = new_center.astype(np.float32)else:    filtered_face_center = (        filtered_face_center * face_filter +        new_center.astype(np.float32) * (1.0 - face_filter)    )    new_center = filtered_face_center<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>To simplify the idea: we do not fully trust a single measurement. We carefully blend the new estimate with the already stabilized past.<\/p>\n<h3>5.3 Why a Filter Alone Is Not Enough<\/h3>\n<p>This is an important point.<\/p>\n<p>Anti-jerk alone is a crude instrument. It cuts large outliers, but does not remove small shaking.<\/p>\n<p>Low-pass alone is also not enough. It smooths noise, but with a large outlier it will still drag the signal to the side.<\/p>\n<p>So we need both steps together:<\/p>\n<pre><code>detection \u2192 hard clamp \u2192 low-pass \u2192 camera<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>This exact composition is what makes the input signal suitable for the downstream physical model.<\/p>\n<h4>6. The Virtual Camera as a Physical System<\/h4>\n<p>This is where the part begins that really separates a natural-looking result from a \u201csmart crop.\u201d<\/p>\n<p>If the camera instantly snaps to the target point, it looks robotic. A real camera operator does not work like that. There is inertia, acceleration limits, and natural damping.<\/p>\n<p>That is why it is convenient to model the camera as a damped oscillator.<\/p>\n<h3>6.1 Mathematical Model<\/h3>\n<p>We take the classic spring-damper system:<\/p>\n<figure class=\"\"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/c7d\/5e8\/c17\/c7d5e8c179802d189fb98eeb2c0bd57e.png\" sizes=\"(max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/c7d\/5e8\/c17\/c7d5e8c179802d189fb98eeb2c0bd57e.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/c7d\/5e8\/c17\/c7d5e8c179802d189fb98eeb2c0bd57e.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>Where:<\/p>\n<ul>\n<li>\n<p><code>m<\/code> is the effective mass;<\/p>\n<\/li>\n<li>\n<p><code>k<\/code> is the spring stiffness;<\/p>\n<\/li>\n<li>\n<p><code>c<\/code> is damping;<\/p>\n<\/li>\n<li>\n<p><code>x<\/code> is the current camera position;<\/p>\n<\/li>\n<li>\n<p><code>x_target<\/code> is the target position;<\/p>\n<\/li>\n<li>\n<p><code>v<\/code> is the camera velocity.<\/p>\n<\/li>\n<\/ul>\n<p>Intuitively:<\/p>\n<ul>\n<li>\n<p>the spring pulls the camera toward the target;<\/p>\n<\/li>\n<li>\n<p>damping prevents it from oscillating forever;<\/p>\n<\/li>\n<li>\n<p>speed and acceleration limits make the motion believable.<\/p>\n<\/li>\n<\/ul>\n<h3>6.2 Numerical Integration Per Frame<\/h3>\n<p>At each analysis step:<\/p>\n<pre><code class=\"python\">error = target_center - prev_centeraccel = error * follow_stiffness - velocity * follow_dampingacc_norm = np.linalg.norm(accel)if acc_norm &gt; max_center_accel:    accel = accel * (max_center_accel \/ acc_norm)velocity = velocity + accel * dtvelocity *= velocity_softenvelocity *= (1.0 - velocity_decay)speed = np.linalg.norm(velocity)if speed &gt; max_center_speed:    velocity = velocity * (max_center_speed \/ speed)new_pos = prev_center + velocity * dt<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>This scheme is simple, but gives very good control over camera behavior.<\/p>\n<h3>6.3 Meaning of the Key Parameters<\/h3>\n<p>To avoid tuning blindly, it is important to understand what the parameters do.<\/p>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">Parameter<\/p>\n<\/th>\n<th>\n<p align=\"left\">Meaning<\/p>\n<\/th>\n<th>\n<p align=\"left\">Effect when increased<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\"><code>follow_stiffness<\/code><\/p>\n<\/td>\n<td>\n<p align=\"left\">how strongly the camera is pulled toward the target<\/p>\n<\/td>\n<td>\n<p align=\"left\">faster reaction, but higher overshoot risk<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\"><code>follow_damping<\/code><\/p>\n<\/td>\n<td>\n<p align=\"left\">resistance to movement<\/p>\n<\/td>\n<td>\n<p align=\"left\">less oscillation, more conservative camera<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\"><code>max_center_accel<\/code><\/p>\n<\/td>\n<td>\n<p align=\"left\">acceleration limit<\/p>\n<\/td>\n<td>\n<p align=\"left\">the camera cannot burst forward abruptly<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\"><code>max_center_speed<\/code><\/p>\n<\/td>\n<td>\n<p align=\"left\">speed limit<\/p>\n<\/td>\n<td>\n<p align=\"left\">the camera will not fly faster than the allowed pace<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\"><code>velocity_soften<\/code><\/p>\n<\/td>\n<td>\n<p align=\"left\">additional velocity softening<\/p>\n<\/td>\n<td>\n<p align=\"left\">fewer high-frequency oscillations<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\"><code>velocity_decay<\/code><\/p>\n<\/td>\n<td>\n<p align=\"left\">exponential decay<\/p>\n<\/td>\n<td>\n<p align=\"left\">the camera settles down faster<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>A camera is not a coordinate. It is a dynamic system. That is exactly what makes it visually believable.<\/p>\n<h3>6.4 Predictive Lead: An Operator Does Not Look Exactly Where the Face Is Right Now<\/h3>\n<p>If a person moves quickly, the camera should not simply chase them. Otherwise, it will always be slightly behind.<\/p>\n<p>So it is useful to add a light prediction:<\/p>\n<pre><code class=\"python\">cascade = cv2.CascadeClassifier(    cv2.data.haarcascades + \"haarcascade_frontalface_default.xml\")faces = cascade.detectMultiScale(gray, 1.1, 5, minSize=(50, 50))# faces \u2014 array of rectangles [x, y, w, h]<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>This is a small extrapolation based on the current motion. Visually, it makes the tracking feel more \u201chuman.\u201d<\/p>\n<h3>6.5 Human Lag: Paradoxically, Sometimes You Need to Slow the Camera Down<\/h3>\n<p>A perfectly responsive system often looks worse than a live operator.<\/p>\n<p>A human camera operator does not teleport to a new point instantly. There is always a micro-delay in reaction. That is why a slight lag can actually improve perception:<\/p>\n<pre><code class=\"python\">backend = _DetectorBackend(min_confidence=0.75)detections = backend.detect(frame_rgb, min_size=50)# detections \u2014 list of FaceDetection with (center, size, score, box)<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>This is a subtle but important point: realism is not always equal to maximum accuracy.<\/p>\n<h4>7. Composition: The Camera Should Not Only Track, but Frame Nicely<\/h4>\n<p>Even a perfectly stable camera can still produce a bad image if the composition is primitive.<\/p>\n<p>If you always place the face strictly in the center, the shot quickly starts to look flat and \u201cmachine-made.\u201d So we need to add composition heuristics on top of the physics.<\/p>\n<h3>7.1 Rule of Thirds and Side Bias<\/h3>\n<p>When there are multiple faces or the scene does not require strict centering, it is useful to shift the subject toward the rule-of-thirds lines.<\/p>\n<pre><code class=\"python\">if side_bias &gt; 0.0 and not single_face_active:    biased_target = 0.5 - side_bias if cx &lt; 0.5 else 0.5 + side_bias    edge_proximity = abs(cx - 0.5) * 2    adaptive_bias = side_bias_strength * (1 - edge_proximity)    cx = cx * (1.0 - adaptive_bias) + biased_target * adaptive_bias<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>What matters here is that the shift is adaptive. If the face is already near the edge, the bias weakens; otherwise, you can make the frame worse instead of better.<\/p>\n<h3>7.2 Single-Face Mode<\/h3>\n<p>If the video consistently contains one face, the camera should behave differently:<\/p>\n<ul>\n<li>\n<p>fewer composition experiments;<\/p>\n<\/li>\n<li>\n<p>more stabilization;<\/p>\n<\/li>\n<li>\n<p>more conservative speed;<\/p>\n<\/li>\n<li>\n<p>better retention of a talking-head shot.<\/p>\n<\/li>\n<\/ul>\n<p>Parameter adaptation example:<\/p>\n<pre><code class=\"python\">if single_face_active and stabilization_strength &gt; 0.0:    effective_face_filter = min(0.995, face_filter + stabilization_strength * 0.1)    effective_smoothing = min(0.985, smoothing + stabilization_strength * 0.08)    effective_center_dead_zone = min(0.35, center_dead_zone + stabilization_strength * 0.08)    effective_max_center_speed = max(0.05, max_center_speed * (1.0 - 0.25 * stabilization_strength))<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>These conditional tuning rules are exactly what usually separates a \u201cworking system\u201d from an abstract algorithm.<\/p>\n<h3>7.3 Eye-Level Lift: Aim at the Eyes, Not at the Geometric Center<\/h3>\n<p>A face bounding box is not composition yet. If we aim strictly at the center of the box, we often get a frame where the camera is looking at the nose.<\/p>\n<p>It is much better to raise the attention point slightly toward eye level:<\/p>\n<pre><code class=\"python\">cy = np.clip(cy - tracked_face.size[1] * eye_level_lift, 0.0, 1.0)<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>It is a small correction, but it has a strong effect on the subjective quality of the frame.<\/p>\n<h3>7.4 Dead Zone: Ignore Micromovements<\/h3>\n<p>If a person moved slightly or the detector produced micro-noise, the camera should not react to every tiny change.<\/p>\n<pre><code class=\"python\">delta = target_center - prev_centerdist = np.linalg.norm(delta)if dist &lt; effective_center_dead_zone:    target_center = prev_center<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>Dead zone is one of the most underrated parameters. Without it, the camera looks nervous even with good detection.<\/p>\n<h3>7.5 Face Margin: Never Press the Face Against the Edge<\/h3>\n<p>Even if mathematics allows us to move the center anywhere, real composition needs a safety margin.<\/p>\n<pre><code class=\"python\">if face_margin &gt; 0.0:    half_crop_w = min(0.5, (target_width \/ sw) \/ (2.0 * z))    half_crop_h = min(0.5, (target_height \/ sh) \/ (2.0 * z))    guard_x = max(face_margin, half_crop_w)    guard_y = max(face_margin, half_crop_h)    cx = np.clip(cx, guard_x, 1.0 - guard_x)    cy = np.clip(cy, guard_y, 1.0 - guard_y)<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>This protects the face from unpleasant trimming of ears, hair, gestures, and generally makes the shot feel more \u201cairy.\u201d<\/p>\n<h4>8. What to Do When There Are No Faces: Ken Burns Fallback<\/h4>\n<p>A system that can only work in \u201cI see a face\u201d mode breaks on any more complex video.<\/p>\n<p>We need a fallback mode for when:<\/p>\n<ul>\n<li>\n<p>the face is lost;<\/p>\n<\/li>\n<li>\n<p>there is no face at all;<\/p>\n<\/li>\n<li>\n<p>the frame contains a static scene;<\/p>\n<\/li>\n<li>\n<p>the detector failed.<\/p>\n<\/li>\n<\/ul>\n<p>This is where the classic Ken Burns effect helps \u2014 gentle panning and zooming.<\/p>\n<h3>8.1 Ken Burns Motion Model<\/h3>\n<p>The simplest version is based on sines:<\/p>\n<pre><code class=\"python\">if tracked_face is not None:    prev_center = tracked_face.center  # where the face used to be    # find the nearest detection    dists = [np.linalg.norm(d.center - prev_center) for d in dets]    j = int(np.argmin(dists))  # index of the nearest one    # if the distance is acceptable \u2014 update the track    if dists[j] &lt; match_tolerance:  # usually 0.15-0.20        tracked_face = dets[j]        new_center = dets[j].center        last_seen_time = t    # otherwise the track is lost, and Ken Burns will be used<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>This gives controlled, smooth movement without random jerking.<\/p>\n<h3>8.2 Why Ken Burns Is Better Than a Static Center<\/h3>\n<p>A static crop when there are no faces looks like a bug. Ken Burns creates the impression that the system is still \u201cholding the scene\u201d instead of simply freezing.<\/p>\n<p>For many videos, that alone is enough for the fallback not to be perceived as degradation.<\/p>\n<h3>8.3 Soft Transition from Face Tracking to Fallback<\/h3>\n<p>The transition should also be smooth.<\/p>\n<pre><code class=\"python\">else:    kc, kz = _ken_burns_motion(...)    camera_states.append(CameraState(t, kc, kz, False))    velocity *= 0.5<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>We do not teleport into a new logic branch \u2014 we gradually damp the accumulated camera velocity.<\/p>\n<p>This is exactly the kind of detail that is not always immediately visible in code, but is very visible in the final result.<\/p>\n<h4>9. From Discrete States to a Continuous Camera Path<\/h4>\n<p>At the analysis stage, we compute camera states at, for example, 8 FPS. But the final video may be 30 FPS or 60 FPS.<\/p>\n<p>If we apply the camera path as-is, the motion will be step-like. So we need a continuous path via interpolation.<\/p>\n<h3>9.1 State Interpolation<\/h3>\n<p>The basic option is linear interpolation between neighboring points:<\/p>\n<pre><code class=\"python\">if new_center is None and tracked_face is not None and (t - last_seen_time) &lt; max_miss_time:    new_center = tracked_face.center  # use the previous value<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>That is enough because the physical model itself already makes the trajectory sufficiently smooth.<\/p>\n<h3>9.2 Applying the Virtual Camera<\/h3>\n<p>Once we have a <code>path(t)<\/code> function, the rest is pretty straightforward:<\/p>\n<ol>\n<li>\n<p>get the source frame;<\/p>\n<\/li>\n<li>\n<p>take the center and zoom from <code>path(t)<\/code>;<\/p>\n<\/li>\n<li>\n<p>compute the ROI;<\/p>\n<\/li>\n<li>\n<p>cut out the region;<\/p>\n<\/li>\n<li>\n<p>scale it to the final 9:16 size.<\/p>\n<\/li>\n<\/ol>\n<p>At this point, the algorithm turns from an analytical model into a real video.<\/p>\n<h4>10. Working Parameter Profile: Operator Mode<\/h4>\n<p>The most interesting thing in such systems is not just the formulas, but how they are actually tuned in practice.<\/p>\n<p>Below is a profile oriented toward a \u201chuman operator\u201d for talking-head and similar scenarios:<\/p>\n<pre><code class=\"yaml\">analysis_fps: 8min_face_ratio: 0.06min_face_confidence: 0.75match_tolerance: 0.18max_miss_time: 2.7smoothing: 0.93stabilization_strength: 0.26follow_stiffness: 8.4follow_damping: 2.35max_center_accel: 0.85predictive_lead: 0.065human_lag: 0.02velocity_soften: 0.86velocity_decay: 0.10face_filter: 0.80face_margin: 0.085side_bias: 0.22side_bias_strength: 0.37center_dead_zone: 0.052max_center_speed: 0.40eye_level_lift: 0.10zoom_smoothing: 0.75ken_burns_period: 12ken_burns_pan_amplitude: 4.0ken_burns_tilt_amplitude: 2.0ken_burns_zoom_amplitude: 0.0<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>This configuration does not claim to be universal, but it illustrates an important idea well: the quality here is born from a balance of parameters, not from one magical neural network.<\/p>\n<h3>10.1 Profiles for Different Styles<\/h3>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">Style<\/p>\n<\/th>\n<th>\n<p align=\"left\">Stiffness<\/p>\n<\/th>\n<th>\n<p align=\"left\">Damping<\/p>\n<\/th>\n<th>\n<p align=\"left\">Max speed<\/p>\n<\/th>\n<th>\n<p align=\"left\">Idea<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Static \/ robotic<\/p>\n<\/td>\n<td>\n<p align=\"left\">2.0<\/p>\n<\/td>\n<td>\n<p align=\"left\">5.0<\/p>\n<\/td>\n<td>\n<p align=\"left\">0.05<\/p>\n<\/td>\n<td>\n<p align=\"left\">almost no movement<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Operator mode<\/p>\n<\/td>\n<td>\n<p align=\"left\">8.4<\/p>\n<\/td>\n<td>\n<p align=\"left\">2.35<\/p>\n<\/td>\n<td>\n<p align=\"left\">0.40<\/p>\n<\/td>\n<td>\n<p align=\"left\">lively but controlled movement<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Action<\/p>\n<\/td>\n<td>\n<p align=\"left\">15.0<\/p>\n<\/td>\n<td>\n<p align=\"left\">1.2<\/p>\n<\/td>\n<td>\n<p align=\"left\">0.80<\/p>\n<\/td>\n<td>\n<p align=\"left\">fast reaction<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Cinematic<\/p>\n<\/td>\n<td>\n<p align=\"left\">4.0<\/p>\n<\/td>\n<td>\n<p align=\"left\">4.0<\/p>\n<\/td>\n<td>\n<p align=\"left\">0.15<\/p>\n<\/td>\n<td>\n<p align=\"left\">slow and soft<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>This is a convenient way to think not in terms of separate numbers, but in terms of camera behavior profiles.<\/p>\n<h4>11. Practical Edge Cases<\/h4>\n<p>Any article about an algorithm feels incomplete if it does not explain where it breaks and how to fix it.<\/p>\n<h3>11.1 Face at an Angle or in Profile<\/h3>\n<p>If a person turns into profile, the detector may lose the track. What helps:<\/p>\n<ul>\n<li>\n<p>increase <code>max_miss_time<\/code>;<\/p>\n<\/li>\n<li>\n<p>loosen <code>match_tolerance<\/code>;<\/p>\n<\/li>\n<li>\n<p>connect YuNet as an additional detector;<\/p>\n<\/li>\n<li>\n<p>lower <code>min_face_confidence<\/code> if input quality is unstable.<\/p>\n<\/li>\n<\/ul>\n<h3>11.2 Anime and Non-Photorealistic Faces<\/h3>\n<p>If the model is trained on photographs while the input is anime, the problem is not \u201cbad code,\u201d but domain mismatch.<\/p>\n<p>Practical options:<\/p>\n<ul>\n<li>\n<p>loosen the thresholds;<\/p>\n<\/li>\n<li>\n<p>reduce <code>min_face_ratio<\/code>;<\/p>\n<\/li>\n<li>\n<p>use an alternative backend;<\/p>\n<\/li>\n<li>\n<p>if needed, switch to a specialized detector.<\/p>\n<\/li>\n<\/ul>\n<h3>11.3 Multiple Faces in the Frame<\/h3>\n<p>When there are two people in the frame, an overly aggressive single-face mode will only make things worse.<\/p>\n<p>Then it is better to do this:<\/p>\n<pre><code class=\"python\">if filtered_face_center is not None:    delta_face = new_center - filtered_face_center    max_face_step = 0.04  # max movement per frame (fraction of the screen)    dist = float(np.linalg.norm(delta_face))    if dist &gt; max_face_step:        # jump is too large \u2192 cut it        new_center = filtered_face_center + delta_face * (max_face_step \/ dist)<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>And let composition hold two speakers more naturally.<\/p>\n<h3>11.4 Fast Motion<\/h3>\n<p>For sharp movement, an adaptive boost is useful:<\/p>\n<figure class=\"\"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/44d\/efe\/c8f\/44defec8fd8cf1601d6edc846ffc48c9.png\" sizes=\"(max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/44d\/efe\/c8f\/44defec8fd8cf1601d6edc846ffc48c9.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/44d\/efe\/c8f\/44defec8fd8cf1601d6edc846ffc48c9.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>This gives the camera a chance to \u201cwake up\u201d when motion becomes energetic, without making the whole system permanently hyperactive.<\/p>\n<h4>12. Why This Architecture Looks Good<\/h4>\n<p>In my opinion, the engineering value of this solution lies not in separate formulas, but in the overall system approach.<\/p>\n<p>What matters here:<\/p>\n<ul>\n<li>\n<p>Do not rely on one ideal component. Use a fallback architecture instead.<\/p>\n<\/li>\n<li>\n<p>Do not trust raw data. Detection goes through stabilization.<\/p>\n<\/li>\n<li>\n<p>Model dynamics explicitly. The camera is a physical system, not a pile of <code>if<\/code>s.<\/p>\n<\/li>\n<li>\n<p>Account for human perception. Human lag, eye-level lift, dead zone, composition rules.<\/p>\n<\/li>\n<li>\n<p>Design degradation. When there are no faces, the system still produces a reasonable result.<\/p>\n<\/li>\n<li>\n<p>Think not only about accuracy, but also about subjective quality.<\/p>\n<\/li>\n<\/ul>\n<h4>13. Performance and Computational Cost<\/h4>\n<p>For completeness, it is worth estimating the pipeline cost too.<\/p>\n<p>Let:<\/p>\n<ul>\n<li>\n<p><code>N<\/code> be the number of analyzed frames;<\/p>\n<\/li>\n<li>\n<p><code>D<\/code> be the face detection time per frame;<\/p>\n<\/li>\n<li>\n<p><code>P<\/code> be the tracking and camera physics time.<\/p>\n<\/li>\n<\/ul>\n<p>Then:<\/p>\n<pre><code>filtered_center = \u03b1 \u00b7 filtered_center + (1 - \u03b1) \u00b7 new_center<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>For a 5-minute video at <code>analysis_fps = 8<\/code>, we get:<\/p>\n<pre><code>N = 300 \u00d7 8 = 2400<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>Then we also add virtual camera application on every output frame:<\/p>\n<pre><code>T_apply \u2248 M \u00b7 R<\/code><div class=\"code-explainer\"><a href=\"https:\/\/sourcecraft.dev\/\" class=\"tm-button code-explainer__link\" style=\"visibility: hidden;\"><img style=\"width:14px;height:14px;object-fit:cover;object-position:left;\"\/><\/a><\/div><\/pre>\n<p>Where:<\/p>\n<ul>\n<li>\n<p><code>M<\/code> is the number of output frames;<\/p>\n<\/li>\n<li>\n<p><code>R<\/code> is the cost of crop + resize.<\/p>\n<\/li>\n<\/ul>\n<p>In a real pipeline, this is quite practical, especially if the analysis runs on a downscaled copy of the video while the final crop is applied to the original.<\/p>\n<h3>13.1 How to Reduce Analysis Cost<\/h3>\n<p>The three most practical steps:<\/p>\n<ul>\n<li>\n<p>reduce resolution at the detection stage;<\/p>\n<\/li>\n<li>\n<p>reduce <code>analysis_fps<\/code> if the scene is slow;<\/p>\n<\/li>\n<li>\n<p>do not keep all fallback backends active if the main one works stably.<\/p>\n<\/li>\n<\/ul>\n<p>These are boring optimizations, but they usually give the best ROI.<\/p>\n<h4>14. Final Result<\/h4>\n<p>If we put everything together, the algorithm looks like this:<\/p>\n<figure class=\"bordered full-width \"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/438\/cd2\/b68\/438cd2b688b5cead62f780261fa5dbbb.png\" width=\"1024\" height=\"1536\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/438\/cd2\/b68\/438cd2b688b5cead62f780261fa5dbbb.png 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/438\/cd2\/b68\/438cd2b688b5cead62f780261fa5dbbb.png 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>In practice, this gives a system that:<\/p>\n<ul>\n<li>\n<p>does not jitter on detector noise;<\/p>\n<\/li>\n<li>\n<p>does not look robotic;<\/p>\n<\/li>\n<li>\n<p>degrades gracefully;<\/p>\n<\/li>\n<li>\n<p>handles talking-head scenes and regular videos much better than a simple \u201ccenter crop.\u201d<\/p>\n<\/li>\n<\/ul>\n<\/div>\n<p>\u0441\u0441\u044b\u043b\u043a\u0430 \u043d\u0430 \u043e\u0440\u0438\u0433\u0438\u043d\u0430\u043b \u0441\u0442\u0430\u0442\u044c\u0438 <a href=\"https:\/\/habr.com\/ru\/articles\/1022298\/\">https:\/\/habr.com\/ru\/articles\/1022298\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the previous article I described my \u201canime factory\u201d in detail \u2014 a pipeline that automatically turns episodes into finished Shorts. But inside that system there is one especially important module that deserves a separate deep dive: a virtual camera for automatic reframing.In this article, I will break down not just an \u201cauto-crop function,\u201d but a full virtual camera algorithm for vertical video. This is exactly the kind of task that looks simple at first glance: you have a horizontal video, you need to turn it into 9:16, keep a person in frame, and avoid making the result look like a jittery autofocus camera from the early 2010s.But as soon as you try to build it not for a demo, but for a real pipeline, engineering problems immediately show up:the face detector is noisy;the face periodically disappears;the target moves unevenly;simply \u201cfollowing the center of the box\u201d is not enough;a perfectly accurate camera often looks unnatural and can even look worse than a slightly \u201chuman\u201d one.In the end, I needed a system that behaves not like a soulless cropper, but like a camera operator: smoothly, with inertia, with motion prediction, with composition-aware corrections, and with a sane fallback mode for cases where there are no faces in the frame at all.In this article, we will go through the entire algorithm end to end:a three-level face detection fallback: MediaPipe \u2192 YuNet \u2192 Haar Cascade;simple but practical face tracking between frames;anti-jerk and low-pass filtering;a virtual camera modeled as a damped oscillator;composition rules: rule of thirds, side bias, eye-level lift, face margin;a Ken Burns fallback when the face is lost or absent;camera path interpolation and applying the virtual crop to video.1. Why Making Vertical Video Is Harder Than It LooksLet\u2019s say we have a regular 16:9 horizontal video. We want to convert it into 9:16 for YouTube Shorts, TikTok, or Reels.The naive approach looks like this:take the center of the frame \u2192 cut out a vertical window \u2192 doneFormally, yes. In practice, no.If a person shifts to the left, the camera will crop them out. If there are two people in the frame, the composition falls apart. If the face is moving, the crop will either lag behind or jump around. If there are no faces, the video becomes either static or simply meaningless.So what we need here is not a crop, but a virtual camera \u2014 an entity that has:an observation target;inertia;speed and acceleration limits;reaction delay;composition rules;fallback behavior.That is exactly what turns \u201cauto-crop\u201d into a system that looks like the work of a real camera operator.2. The Full Architecture of the SolutionIf we remove the details, the entire pipeline looks like this:The key idea: the algorithm does not make a decision \u201cfrom a single frame.\u201d It lives over time. And the quality here comes not so much from the detector\u2019s accuracy, but from how the system handles imperfect data.3. Face Detection: A Three-Level FallbackThe first part of the system is not a single detector, but a cascade of three backends.3.1 Why One Detector Is Not EnoughAny face detector makes mistakes from time to time:it loses the face during head turns;it works worse in difficult lighting;it breaks on non-photorealistic faces;it may simply not be installed in the environment.That is why in practice it is more useful to build not an \u201cideal detector,\u201d but a robust degradation system.3.2 Fallback Chain DiagramThis is a simple but very practical pattern: first use the best option, then a backup, then the \u201clast resort.\u201d3.3 MediaPipe as the Primary DetectorMediaPipe is the default working option here.Advantages:it runs fast on CPU;it provides confidence;it usually catches faces well even at an angle and in imperfect lighting;it returns a convenient bounding box in normalized coordinates.Initialization example:mp_face_detection = mp.solutions.face_detection.FaceDetection(    model_selection=1,    min_detection_confidence=0.75)results = mp_face_detection.process(frame_bgr)if results.detections:    for detection in results.detections:        box = detection.location_data.relative_bounding_box        confidence = float(detection.score[0])For a production pipeline, what matters is that at the output we normalize everything into a single shape: face center, size, confidence, and bounding box.3.4 YuNet as the Backup ArtilleryYuNet is needed not because MediaPipe is bad, but because production systems love \u201csomething went wrong\u201d scenarios.YuNet is useful when:MediaPipe is unavailable in the environment;MediaPipe did not find a face, but the face is clearly there;you need an alternative ONNX backend through OpenCV.Example:YuNet is slower, but it provides a solid second line of defense.3.5 Haar Cascade as \u201cAt Least Don\u2019t Go Completely Blind\u201dHaar Cascade is not the best detector in terms of quality. But it:is available almost everywhere;does not require heavy dependencies;sometimes saves the day when everything else has failed.Example:cascade = cv2.CascadeClassifier(    cv2.data.haarcascades + &#171;haarcascade_frontalface_default.xml&#187;)faces = cascade.detectMultiScale(gray, 1.1, 5, minSize=(50, 50))From an engineering point of view, the value of Haar is not accuracy, but the fact that even in degraded mode the system does not become completely blind.3.6 Unified Detector InterfaceExternally, the whole chain is hidden behind one interface:backend = _DetectorBackend(min_confidence=0.75)detections = backend.detect(frame_rgb, min_size=50)At the output we get a list of detections with a unified structure:center = (cx, cy)size = (w, h)scoreboxThis is an important architectural point: the rest of the algorithm does not know who exactly found the face.4. From Detection to Tracking: Following the Face Across FramesThe detector answers the question: \u201cwhat is in this frame?\u201dBut the virtual camera needs a different answer: \u201cwhich object are we following over time?\u201dWithout this layer, if there are two faces or if confidence jumps, the camera will keep switching between objects endlessly.4.1 Simple Nearest-Neighbor TrackingIn this case, we do not need a heavy multi-object tracker. A simple rule is enough:take the face center from the previous frame;compute distances to all current detections;pick the nearest one;if the distance is within tolerance, treat it as the same object.Code:mp_face_detection = mp.solutions.face_detection.FaceDetection(    model_selection=1,  # 1 = more accurate model, slightly slower    min_detection_confidence=0.75  # confidence threshold)results = mp_face_detection.process(frame_bgr)if results.detections:    for detection in results.detections:        # detection.location_data contains bounding box (x, y, w, h)        # detection.score[0] contains confidence \u2208 [0, 1]        box = detection.location_data.relative_bounding_box        confidence = float(detection.score[0])Why does it work? Because for most talking-head and similar scenarios, face movement between neighboring analyzed frames is limited. So the nearest valid detection is almost always the continuation of the current track.4.2 Handling Losses: Inertial TrackingOne of the most unpleasant problems in face tracking is short-term misses:the person turned away;the light produced a glare;a hand occluded part of the face;the detector just blinked.If at that moment we abruptly switch to fallback, the camera will jerk. So we need a grace period: for a short time, we continue to trust the last known position.if new_center is None and tracked_face is not None and (t &#8212; last_seen_time) &lt; max_miss_time:    new_center = tracked_face.centerThis improves subjective quality a lot. During short dropouts, the camera looks stable rather than nervous.4.3 Why We Do Not Need a \u201cSmart AI Tracker\u201d HereYou could add optical flow, a Kalman filter, appearance embeddings, or a full MOT pipeline. But if the task is vertical auto-crop for videos with a limited number of faces, then simple distance-based tracking gives enough quality with minimal complexity and high reproducibility.Sometimes the best algorithm is not the one that looks smarter on paper, but the one that is easier to tune and fix in production.5. Stabilizing the Input Signal: Anti-Jerk and Low-Pass FilterEven if the detector finds the right face, the coordinates are still noisy. That is a fundamental property of the system.Typical problems:the box center shakes slightly from frame to frame;the face size jumps;sometimes a false but \u201cconfident\u201d detection appears far from the previous point.If we feed that directly into the camera, we get jitter.5.1 Anti-Jerk: Hard Limiting of JumpsFirst, we need to cut away completely unreasonable jumps.if filtered_face_center is not None:    delta_face = new_center &#8212; filtered_face_center    max_face_step = 0.04    dist = float(np.linalg.norm(delta_face))    if dist &gt; max_face_step:        new_center = filtered_face_center + delta_face * (max_face_step \/ dist)Formula:The idea is simple: a face cannot teleport across half the screen between neighboring analysis frames. If \u201cthe detector says so,\u201d then it is noise or a false positive.5.2 Low-Pass Filter: Smoothing Residual NoiseAfter the hard clamp, normal but still noisy fluctuations remain. Exponential smoothing works well for them:yunet = cv2.FaceDetectorYN.create(    model=path_to_onnx,    config=&#187;&#187;,    input_size=(320, 320),    score_threshold=0.75,    nms_threshold=0.3,    top_k=5000)_, faces = yunet.detect(frame_bgr)  # faces \u2014 array [x, y, w, h, conf]In code:if filtered_face_center is None:    filtered_face_center = new_center.astype(np.float32)else:    filtered_face_center = (        filtered_face_center * face_filter +        new_center.astype(np.float32) * (1.0 &#8212; face_filter)    )    new_center = filtered_face_centerTo simplify the idea: we do not fully trust a single measurement. We carefully blend the new estimate with the already stabilized past.5.3 Why a Filter Alone Is Not EnoughThis is an important point.Anti-jerk alone is a crude instrument. It cuts large outliers,&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-475534","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/475534","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=475534"}],"version-history":[{"count":0,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/475534\/revisions"}],"wp:attachment":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=475534"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=475534"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=475534"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}