
In the previous article I described my “anime factory” in detail — a pipeline that automatically turns episodes into finished Shorts. But inside that system there is one especially important module that deserves a separate deep dive: a virtual camera for automatic reframing.
In this article, I will break down not just an “auto-crop function,” but a full virtual camera algorithm for vertical video. This is exactly the kind of task that looks simple at first glance: you have a horizontal video, you need to turn it into 9:16, keep a person in frame, and avoid making the result look like a jittery autofocus camera from the early 2010s.
But as soon as you try to build it not for a demo, but for a real pipeline, engineering problems immediately show up:
-
the face detector is noisy;
-
the face periodically disappears;
-
the target moves unevenly;
-
simply “following the center of the box” is not enough;
-
a perfectly accurate camera often looks unnatural and can even look worse than a slightly “human” one.
In the end, I needed a system that behaves not like a soulless cropper, but like a camera operator: smoothly, with inertia, with motion prediction, with composition-aware corrections, and with a sane fallback mode for cases where there are no faces in the frame at all.
In this article, we will go through the entire algorithm end to end:
-
a three-level face detection fallback: MediaPipe → YuNet → Haar Cascade;
-
simple but practical face tracking between frames;
-
anti-jerk and low-pass filtering;
-
a virtual camera modeled as a damped oscillator;
-
composition rules: rule of thirds, side bias, eye-level lift, face margin;
-
a Ken Burns fallback when the face is lost or absent;
-
camera path interpolation and applying the virtual crop to video.
1. Why Making Vertical Video Is Harder Than It Looks
Let’s say we have a regular 16:9 horizontal video. We want to convert it into 9:16 for YouTube Shorts, TikTok, or Reels.
The naive approach looks like this:
take the center of the frame → cut out a vertical window → done
Formally, yes. In practice, no.
If a person shifts to the left, the camera will crop them out. If there are two people in the frame, the composition falls apart. If the face is moving, the crop will either lag behind or jump around. If there are no faces, the video becomes either static or simply meaningless.
So what we need here is not a crop, but a virtual camera — an entity that has:
-
an observation target;
-
inertia;
-
speed and acceleration limits;
-
reaction delay;
-
composition rules;
-
fallback behavior.
That is exactly what turns “auto-crop” into a system that looks like the work of a real camera operator.
2. The Full Architecture of the Solution
If we remove the details, the entire pipeline looks like this:

The key idea: the algorithm does not make a decision “from a single frame.” It lives over time. And the quality here comes not so much from the detector’s accuracy, but from how the system handles imperfect data.
3. Face Detection: A Three-Level Fallback
The first part of the system is not a single detector, but a cascade of three backends.
3.1 Why One Detector Is Not Enough
Any face detector makes mistakes from time to time:
-
it loses the face during head turns;
-
it works worse in difficult lighting;
-
it breaks on non-photorealistic faces;
-
it may simply not be installed in the environment.
That is why in practice it is more useful to build not an “ideal detector,” but a robust degradation system.
3.2 Fallback Chain Diagram

This is a simple but very practical pattern: first use the best option, then a backup, then the “last resort.”
3.3 MediaPipe as the Primary Detector
MediaPipe is the default working option here.
Advantages:
-
it runs fast on CPU;
-
it provides confidence;
-
it usually catches faces well even at an angle and in imperfect lighting;
-
it returns a convenient bounding box in normalized coordinates.
Initialization example:
mp_face_detection = mp.solutions.face_detection.FaceDetection( model_selection=1, min_detection_confidence=0.75)results = mp_face_detection.process(frame_bgr)if results.detections: for detection in results.detections: box = detection.location_data.relative_bounding_box confidence = float(detection.score[0])
For a production pipeline, what matters is that at the output we normalize everything into a single shape: face center, size, confidence, and bounding box.
3.4 YuNet as the Backup Artillery
YuNet is needed not because MediaPipe is bad, but because production systems love “something went wrong” scenarios.
YuNet is useful when:
-
MediaPipe is unavailable in the environment;
-
MediaPipe did not find a face, but the face is clearly there;
-
you need an alternative ONNX backend through OpenCV.
Example:

YuNet is slower, but it provides a solid second line of defense.
3.5 Haar Cascade as “At Least Don’t Go Completely Blind”
Haar Cascade is not the best detector in terms of quality. But it:
-
is available almost everywhere;
-
does not require heavy dependencies;
-
sometimes saves the day when everything else has failed.
Example:
cascade = cv2.CascadeClassifier( cv2.data.haarcascades + "haarcascade_frontalface_default.xml")faces = cascade.detectMultiScale(gray, 1.1, 5, minSize=(50, 50))
From an engineering point of view, the value of Haar is not accuracy, but the fact that even in degraded mode the system does not become completely blind.
3.6 Unified Detector Interface
Externally, the whole chain is hidden behind one interface:
backend = _DetectorBackend(min_confidence=0.75)detections = backend.detect(frame_rgb, min_size=50)
At the output we get a list of detections with a unified structure:
-
center = (cx, cy) -
size = (w, h) -
score -
box
This is an important architectural point: the rest of the algorithm does not know who exactly found the face.
4. From Detection to Tracking: Following the Face Across Frames
The detector answers the question: “what is in this frame?”
But the virtual camera needs a different answer: “which object are we following over time?”
Without this layer, if there are two faces or if confidence jumps, the camera will keep switching between objects endlessly.
4.1 Simple Nearest-Neighbor Tracking
In this case, we do not need a heavy multi-object tracker. A simple rule is enough:
-
take the face center from the previous frame;
-
compute distances to all current detections;
-
pick the nearest one;
-
if the distance is within tolerance, treat it as the same object.
Code:
mp_face_detection = mp.solutions.face_detection.FaceDetection( model_selection=1, # 1 = more accurate model, slightly slower min_detection_confidence=0.75 # confidence threshold)results = mp_face_detection.process(frame_bgr)if results.detections: for detection in results.detections: # detection.location_data contains bounding box (x, y, w, h) # detection.score[0] contains confidence ∈ [0, 1] box = detection.location_data.relative_bounding_box confidence = float(detection.score[0])
Why does it work? Because for most talking-head and similar scenarios, face movement between neighboring analyzed frames is limited. So the nearest valid detection is almost always the continuation of the current track.
4.2 Handling Losses: Inertial Tracking
One of the most unpleasant problems in face tracking is short-term misses:
-
the person turned away;
-
the light produced a glare;
-
a hand occluded part of the face;
-
the detector just blinked.
If at that moment we abruptly switch to fallback, the camera will jerk. So we need a grace period: for a short time, we continue to trust the last known position.
if new_center is None and tracked_face is not None and (t - last_seen_time) < max_miss_time: new_center = tracked_face.center
This improves subjective quality a lot. During short dropouts, the camera looks stable rather than nervous.
4.3 Why We Do Not Need a “Smart AI Tracker” Here
You could add optical flow, a Kalman filter, appearance embeddings, or a full MOT pipeline. But if the task is vertical auto-crop for videos with a limited number of faces, then simple distance-based tracking gives enough quality with minimal complexity and high reproducibility.
Sometimes the best algorithm is not the one that looks smarter on paper, but the one that is easier to tune and fix in production.
5. Stabilizing the Input Signal: Anti-Jerk and Low-Pass Filter
Even if the detector finds the right face, the coordinates are still noisy. That is a fundamental property of the system.
Typical problems:
-
the box center shakes slightly from frame to frame;
-
the face size jumps;
-
sometimes a false but “confident” detection appears far from the previous point.
If we feed that directly into the camera, we get jitter.
5.1 Anti-Jerk: Hard Limiting of Jumps
First, we need to cut away completely unreasonable jumps.
if filtered_face_center is not None: delta_face = new_center - filtered_face_center max_face_step = 0.04 dist = float(np.linalg.norm(delta_face)) if dist > max_face_step: new_center = filtered_face_center + delta_face * (max_face_step / dist)
Formula:

The idea is simple: a face cannot teleport across half the screen between neighboring analysis frames. If “the detector says so,” then it is noise or a false positive.
5.2 Low-Pass Filter: Smoothing Residual Noise
After the hard clamp, normal but still noisy fluctuations remain. Exponential smoothing works well for them:
yunet = cv2.FaceDetectorYN.create( model=path_to_onnx, config="", input_size=(320, 320), score_threshold=0.75, nms_threshold=0.3, top_k=5000)_, faces = yunet.detect(frame_bgr) # faces — array [x, y, w, h, conf]
In code:
if filtered_face_center is None: filtered_face_center = new_center.astype(np.float32)else: filtered_face_center = ( filtered_face_center * face_filter + new_center.astype(np.float32) * (1.0 - face_filter) ) new_center = filtered_face_center
To simplify the idea: we do not fully trust a single measurement. We carefully blend the new estimate with the already stabilized past.
5.3 Why a Filter Alone Is Not Enough
This is an important point.
Anti-jerk alone is a crude instrument. It cuts large outliers, but does not remove small shaking.
Low-pass alone is also not enough. It smooths noise, but with a large outlier it will still drag the signal to the side.
So we need both steps together:
detection → hard clamp → low-pass → camera
This exact composition is what makes the input signal suitable for the downstream physical model.
6. The Virtual Camera as a Physical System
This is where the part begins that really separates a natural-looking result from a “smart crop.”
If the camera instantly snaps to the target point, it looks robotic. A real camera operator does not work like that. There is inertia, acceleration limits, and natural damping.
That is why it is convenient to model the camera as a damped oscillator.
6.1 Mathematical Model
We take the classic spring-damper system:

Where:
-
mis the effective mass; -
kis the spring stiffness; -
cis damping; -
xis the current camera position; -
x_targetis the target position; -
vis the camera velocity.
Intuitively:
-
the spring pulls the camera toward the target;
-
damping prevents it from oscillating forever;
-
speed and acceleration limits make the motion believable.
6.2 Numerical Integration Per Frame
At each analysis step:
error = target_center - prev_centeraccel = error * follow_stiffness - velocity * follow_dampingacc_norm = np.linalg.norm(accel)if acc_norm > max_center_accel: accel = accel * (max_center_accel / acc_norm)velocity = velocity + accel * dtvelocity *= velocity_softenvelocity *= (1.0 - velocity_decay)speed = np.linalg.norm(velocity)if speed > max_center_speed: velocity = velocity * (max_center_speed / speed)new_pos = prev_center + velocity * dt
This scheme is simple, but gives very good control over camera behavior.
6.3 Meaning of the Key Parameters
To avoid tuning blindly, it is important to understand what the parameters do.
|
Parameter |
Meaning |
Effect when increased |
|---|---|---|
|
|
how strongly the camera is pulled toward the target |
faster reaction, but higher overshoot risk |
|
|
resistance to movement |
less oscillation, more conservative camera |
|
|
acceleration limit |
the camera cannot burst forward abruptly |
|
|
speed limit |
the camera will not fly faster than the allowed pace |
|
|
additional velocity softening |
fewer high-frequency oscillations |
|
|
exponential decay |
the camera settles down faster |
A camera is not a coordinate. It is a dynamic system. That is exactly what makes it visually believable.
6.4 Predictive Lead: An Operator Does Not Look Exactly Where the Face Is Right Now
If a person moves quickly, the camera should not simply chase them. Otherwise, it will always be slightly behind.
So it is useful to add a light prediction:
cascade = cv2.CascadeClassifier( cv2.data.haarcascades + "haarcascade_frontalface_default.xml")faces = cascade.detectMultiScale(gray, 1.1, 5, minSize=(50, 50))# faces — array of rectangles [x, y, w, h]
This is a small extrapolation based on the current motion. Visually, it makes the tracking feel more “human.”
6.5 Human Lag: Paradoxically, Sometimes You Need to Slow the Camera Down
A perfectly responsive system often looks worse than a live operator.
A human camera operator does not teleport to a new point instantly. There is always a micro-delay in reaction. That is why a slight lag can actually improve perception:
backend = _DetectorBackend(min_confidence=0.75)detections = backend.detect(frame_rgb, min_size=50)# detections — list of FaceDetection with (center, size, score, box)
This is a subtle but important point: realism is not always equal to maximum accuracy.
7. Composition: The Camera Should Not Only Track, but Frame Nicely
Even a perfectly stable camera can still produce a bad image if the composition is primitive.
If you always place the face strictly in the center, the shot quickly starts to look flat and “machine-made.” So we need to add composition heuristics on top of the physics.
7.1 Rule of Thirds and Side Bias
When there are multiple faces or the scene does not require strict centering, it is useful to shift the subject toward the rule-of-thirds lines.
if side_bias > 0.0 and not single_face_active: biased_target = 0.5 - side_bias if cx < 0.5 else 0.5 + side_bias edge_proximity = abs(cx - 0.5) * 2 adaptive_bias = side_bias_strength * (1 - edge_proximity) cx = cx * (1.0 - adaptive_bias) + biased_target * adaptive_bias
What matters here is that the shift is adaptive. If the face is already near the edge, the bias weakens; otherwise, you can make the frame worse instead of better.
7.2 Single-Face Mode
If the video consistently contains one face, the camera should behave differently:
-
fewer composition experiments;
-
more stabilization;
-
more conservative speed;
-
better retention of a talking-head shot.
Parameter adaptation example:
if single_face_active and stabilization_strength > 0.0: effective_face_filter = min(0.995, face_filter + stabilization_strength * 0.1) effective_smoothing = min(0.985, smoothing + stabilization_strength * 0.08) effective_center_dead_zone = min(0.35, center_dead_zone + stabilization_strength * 0.08) effective_max_center_speed = max(0.05, max_center_speed * (1.0 - 0.25 * stabilization_strength))
These conditional tuning rules are exactly what usually separates a “working system” from an abstract algorithm.
7.3 Eye-Level Lift: Aim at the Eyes, Not at the Geometric Center
A face bounding box is not composition yet. If we aim strictly at the center of the box, we often get a frame where the camera is looking at the nose.
It is much better to raise the attention point slightly toward eye level:
cy = np.clip(cy - tracked_face.size[1] * eye_level_lift, 0.0, 1.0)
It is a small correction, but it has a strong effect on the subjective quality of the frame.
7.4 Dead Zone: Ignore Micromovements
If a person moved slightly or the detector produced micro-noise, the camera should not react to every tiny change.
delta = target_center - prev_centerdist = np.linalg.norm(delta)if dist < effective_center_dead_zone: target_center = prev_center
Dead zone is one of the most underrated parameters. Without it, the camera looks nervous even with good detection.
7.5 Face Margin: Never Press the Face Against the Edge
Even if mathematics allows us to move the center anywhere, real composition needs a safety margin.
if face_margin > 0.0: half_crop_w = min(0.5, (target_width / sw) / (2.0 * z)) half_crop_h = min(0.5, (target_height / sh) / (2.0 * z)) guard_x = max(face_margin, half_crop_w) guard_y = max(face_margin, half_crop_h) cx = np.clip(cx, guard_x, 1.0 - guard_x) cy = np.clip(cy, guard_y, 1.0 - guard_y)
This protects the face from unpleasant trimming of ears, hair, gestures, and generally makes the shot feel more “airy.”
8. What to Do When There Are No Faces: Ken Burns Fallback
A system that can only work in “I see a face” mode breaks on any more complex video.
We need a fallback mode for when:
-
the face is lost;
-
there is no face at all;
-
the frame contains a static scene;
-
the detector failed.
This is where the classic Ken Burns effect helps — gentle panning and zooming.
8.1 Ken Burns Motion Model
The simplest version is based on sines:
if tracked_face is not None: prev_center = tracked_face.center # where the face used to be # find the nearest detection dists = [np.linalg.norm(d.center - prev_center) for d in dets] j = int(np.argmin(dists)) # index of the nearest one # if the distance is acceptable — update the track if dists[j] < match_tolerance: # usually 0.15-0.20 tracked_face = dets[j] new_center = dets[j].center last_seen_time = t # otherwise the track is lost, and Ken Burns will be used
This gives controlled, smooth movement without random jerking.
8.2 Why Ken Burns Is Better Than a Static Center
A static crop when there are no faces looks like a bug. Ken Burns creates the impression that the system is still “holding the scene” instead of simply freezing.
For many videos, that alone is enough for the fallback not to be perceived as degradation.
8.3 Soft Transition from Face Tracking to Fallback
The transition should also be smooth.
else: kc, kz = _ken_burns_motion(...) camera_states.append(CameraState(t, kc, kz, False)) velocity *= 0.5
We do not teleport into a new logic branch — we gradually damp the accumulated camera velocity.
This is exactly the kind of detail that is not always immediately visible in code, but is very visible in the final result.
9. From Discrete States to a Continuous Camera Path
At the analysis stage, we compute camera states at, for example, 8 FPS. But the final video may be 30 FPS or 60 FPS.
If we apply the camera path as-is, the motion will be step-like. So we need a continuous path via interpolation.
9.1 State Interpolation
The basic option is linear interpolation between neighboring points:
if new_center is None and tracked_face is not None and (t - last_seen_time) < max_miss_time: new_center = tracked_face.center # use the previous value
That is enough because the physical model itself already makes the trajectory sufficiently smooth.
9.2 Applying the Virtual Camera
Once we have a path(t) function, the rest is pretty straightforward:
-
get the source frame;
-
take the center and zoom from
path(t); -
compute the ROI;
-
cut out the region;
-
scale it to the final 9:16 size.
At this point, the algorithm turns from an analytical model into a real video.
10. Working Parameter Profile: Operator Mode
The most interesting thing in such systems is not just the formulas, but how they are actually tuned in practice.
Below is a profile oriented toward a “human operator” for talking-head and similar scenarios:
analysis_fps: 8min_face_ratio: 0.06min_face_confidence: 0.75match_tolerance: 0.18max_miss_time: 2.7smoothing: 0.93stabilization_strength: 0.26follow_stiffness: 8.4follow_damping: 2.35max_center_accel: 0.85predictive_lead: 0.065human_lag: 0.02velocity_soften: 0.86velocity_decay: 0.10face_filter: 0.80face_margin: 0.085side_bias: 0.22side_bias_strength: 0.37center_dead_zone: 0.052max_center_speed: 0.40eye_level_lift: 0.10zoom_smoothing: 0.75ken_burns_period: 12ken_burns_pan_amplitude: 4.0ken_burns_tilt_amplitude: 2.0ken_burns_zoom_amplitude: 0.0
This configuration does not claim to be universal, but it illustrates an important idea well: the quality here is born from a balance of parameters, not from one magical neural network.
10.1 Profiles for Different Styles
|
Style |
Stiffness |
Damping |
Max speed |
Idea |
|---|---|---|---|---|
|
Static / robotic |
2.0 |
5.0 |
0.05 |
almost no movement |
|
Operator mode |
8.4 |
2.35 |
0.40 |
lively but controlled movement |
|
Action |
15.0 |
1.2 |
0.80 |
fast reaction |
|
Cinematic |
4.0 |
4.0 |
0.15 |
slow and soft |
This is a convenient way to think not in terms of separate numbers, but in terms of camera behavior profiles.
11. Practical Edge Cases
Any article about an algorithm feels incomplete if it does not explain where it breaks and how to fix it.
11.1 Face at an Angle or in Profile
If a person turns into profile, the detector may lose the track. What helps:
-
increase
max_miss_time; -
loosen
match_tolerance; -
connect YuNet as an additional detector;
-
lower
min_face_confidenceif input quality is unstable.
11.2 Anime and Non-Photorealistic Faces
If the model is trained on photographs while the input is anime, the problem is not “bad code,” but domain mismatch.
Practical options:
-
loosen the thresholds;
-
reduce
min_face_ratio; -
use an alternative backend;
-
if needed, switch to a specialized detector.
11.3 Multiple Faces in the Frame
When there are two people in the frame, an overly aggressive single-face mode will only make things worse.
Then it is better to do this:
if filtered_face_center is not None: delta_face = new_center - filtered_face_center max_face_step = 0.04 # max movement per frame (fraction of the screen) dist = float(np.linalg.norm(delta_face)) if dist > max_face_step: # jump is too large → cut it new_center = filtered_face_center + delta_face * (max_face_step / dist)
And let composition hold two speakers more naturally.
11.4 Fast Motion
For sharp movement, an adaptive boost is useful:

This gives the camera a chance to “wake up” when motion becomes energetic, without making the whole system permanently hyperactive.
12. Why This Architecture Looks Good
In my opinion, the engineering value of this solution lies not in separate formulas, but in the overall system approach.
What matters here:
-
Do not rely on one ideal component. Use a fallback architecture instead.
-
Do not trust raw data. Detection goes through stabilization.
-
Model dynamics explicitly. The camera is a physical system, not a pile of
ifs. -
Account for human perception. Human lag, eye-level lift, dead zone, composition rules.
-
Design degradation. When there are no faces, the system still produces a reasonable result.
-
Think not only about accuracy, but also about subjective quality.
13. Performance and Computational Cost
For completeness, it is worth estimating the pipeline cost too.
Let:
-
Nbe the number of analyzed frames; -
Dbe the face detection time per frame; -
Pbe the tracking and camera physics time.
Then:
filtered_center = α · filtered_center + (1 - α) · new_center
For a 5-minute video at analysis_fps = 8, we get:
N = 300 × 8 = 2400
Then we also add virtual camera application on every output frame:
T_apply ≈ M · R
Where:
-
Mis the number of output frames; -
Ris the cost of crop + resize.
In a real pipeline, this is quite practical, especially if the analysis runs on a downscaled copy of the video while the final crop is applied to the original.
13.1 How to Reduce Analysis Cost
The three most practical steps:
-
reduce resolution at the detection stage;
-
reduce
analysis_fpsif the scene is slow; -
do not keep all fallback backends active if the main one works stably.
These are boring optimizations, but they usually give the best ROI.
14. Final Result
If we put everything together, the algorithm looks like this:

In practice, this gives a system that:
-
does not jitter on detector noise;
-
does not look robotic;
-
degrades gracefully;
-
handles talking-head scenes and regular videos much better than a simple “center crop.”
ссылка на оригинал статьи https://habr.com/ru/articles/1022298/