I Taught a Virtual Camera to Behave Like a Human Operator: How a Face Tracking Algorithm for Shorts/Reels Works

от автора

In the previous article I described my “anime factory” in detail — a pipeline that automatically turns episodes into finished Shorts. But inside that system there is one especially important module that deserves a separate deep dive: a virtual camera for automatic reframing.

In this article, I will break down not just an “auto-crop function,” but a full virtual camera algorithm for vertical video. This is exactly the kind of task that looks simple at first glance: you have a horizontal video, you need to turn it into 9:16, keep a person in frame, and avoid making the result look like a jittery autofocus camera from the early 2010s.

But as soon as you try to build it not for a demo, but for a real pipeline, engineering problems immediately show up:

  • the face detector is noisy;

  • the face periodically disappears;

  • the target moves unevenly;

  • simply “following the center of the box” is not enough;

  • a perfectly accurate camera often looks unnatural and can even look worse than a slightly “human” one.

In the end, I needed a system that behaves not like a soulless cropper, but like a camera operator: smoothly, with inertia, with motion prediction, with composition-aware corrections, and with a sane fallback mode for cases where there are no faces in the frame at all.

In this article, we will go through the entire algorithm end to end:

  • a three-level face detection fallback: MediaPipe → YuNet → Haar Cascade;

  • simple but practical face tracking between frames;

  • anti-jerk and low-pass filtering;

  • a virtual camera modeled as a damped oscillator;

  • composition rules: rule of thirds, side bias, eye-level lift, face margin;

  • a Ken Burns fallback when the face is lost or absent;

  • camera path interpolation and applying the virtual crop to video.

1. Why Making Vertical Video Is Harder Than It Looks

Let’s say we have a regular 16:9 horizontal video. We want to convert it into 9:16 for YouTube Shorts, TikTok, or Reels.

The naive approach looks like this:

take the center of the frame → cut out a vertical window → done

Formally, yes. In practice, no.

If a person shifts to the left, the camera will crop them out. If there are two people in the frame, the composition falls apart. If the face is moving, the crop will either lag behind or jump around. If there are no faces, the video becomes either static or simply meaningless.

So what we need here is not a crop, but a virtual camera — an entity that has:

  • an observation target;

  • inertia;

  • speed and acceleration limits;

  • reaction delay;

  • composition rules;

  • fallback behavior.

That is exactly what turns “auto-crop” into a system that looks like the work of a real camera operator.

2. The Full Architecture of the Solution

If we remove the details, the entire pipeline looks like this:

The key idea: the algorithm does not make a decision “from a single frame.” It lives over time. And the quality here comes not so much from the detector’s accuracy, but from how the system handles imperfect data.

3. Face Detection: A Three-Level Fallback

The first part of the system is not a single detector, but a cascade of three backends.

3.1 Why One Detector Is Not Enough

Any face detector makes mistakes from time to time:

  • it loses the face during head turns;

  • it works worse in difficult lighting;

  • it breaks on non-photorealistic faces;

  • it may simply not be installed in the environment.

That is why in practice it is more useful to build not an “ideal detector,” but a robust degradation system.

3.2 Fallback Chain Diagram

This is a simple but very practical pattern: first use the best option, then a backup, then the “last resort.”

3.3 MediaPipe as the Primary Detector

MediaPipe is the default working option here.

Advantages:

  • it runs fast on CPU;

  • it provides confidence;

  • it usually catches faces well even at an angle and in imperfect lighting;

  • it returns a convenient bounding box in normalized coordinates.

Initialization example:

mp_face_detection = mp.solutions.face_detection.FaceDetection(    model_selection=1,    min_detection_confidence=0.75)results = mp_face_detection.process(frame_bgr)if results.detections:    for detection in results.detections:        box = detection.location_data.relative_bounding_box        confidence = float(detection.score[0])

For a production pipeline, what matters is that at the output we normalize everything into a single shape: face center, size, confidence, and bounding box.

3.4 YuNet as the Backup Artillery

YuNet is needed not because MediaPipe is bad, but because production systems love “something went wrong” scenarios.

YuNet is useful when:

  • MediaPipe is unavailable in the environment;

  • MediaPipe did not find a face, but the face is clearly there;

  • you need an alternative ONNX backend through OpenCV.

Example:

YuNet is slower, but it provides a solid second line of defense.

3.5 Haar Cascade as “At Least Don’t Go Completely Blind”

Haar Cascade is not the best detector in terms of quality. But it:

  • is available almost everywhere;

  • does not require heavy dependencies;

  • sometimes saves the day when everything else has failed.

Example:

cascade = cv2.CascadeClassifier(    cv2.data.haarcascades + "haarcascade_frontalface_default.xml")faces = cascade.detectMultiScale(gray, 1.1, 5, minSize=(50, 50))

From an engineering point of view, the value of Haar is not accuracy, but the fact that even in degraded mode the system does not become completely blind.

3.6 Unified Detector Interface

Externally, the whole chain is hidden behind one interface:

backend = _DetectorBackend(min_confidence=0.75)detections = backend.detect(frame_rgb, min_size=50)

At the output we get a list of detections with a unified structure:

  • center = (cx, cy)

  • size = (w, h)

  • score

  • box

This is an important architectural point: the rest of the algorithm does not know who exactly found the face.

4. From Detection to Tracking: Following the Face Across Frames

The detector answers the question: “what is in this frame?”

But the virtual camera needs a different answer: “which object are we following over time?”

Without this layer, if there are two faces or if confidence jumps, the camera will keep switching between objects endlessly.

4.1 Simple Nearest-Neighbor Tracking

In this case, we do not need a heavy multi-object tracker. A simple rule is enough:

  • take the face center from the previous frame;

  • compute distances to all current detections;

  • pick the nearest one;

  • if the distance is within tolerance, treat it as the same object.

Code:

mp_face_detection = mp.solutions.face_detection.FaceDetection(    model_selection=1,  # 1 = more accurate model, slightly slower    min_detection_confidence=0.75  # confidence threshold)results = mp_face_detection.process(frame_bgr)if results.detections:    for detection in results.detections:        # detection.location_data contains bounding box (x, y, w, h)        # detection.score[0] contains confidence ∈ [0, 1]        box = detection.location_data.relative_bounding_box        confidence = float(detection.score[0])

Why does it work? Because for most talking-head and similar scenarios, face movement between neighboring analyzed frames is limited. So the nearest valid detection is almost always the continuation of the current track.

4.2 Handling Losses: Inertial Tracking

One of the most unpleasant problems in face tracking is short-term misses:

  • the person turned away;

  • the light produced a glare;

  • a hand occluded part of the face;

  • the detector just blinked.

If at that moment we abruptly switch to fallback, the camera will jerk. So we need a grace period: for a short time, we continue to trust the last known position.

if new_center is None and tracked_face is not None and (t - last_seen_time) < max_miss_time:    new_center = tracked_face.center

This improves subjective quality a lot. During short dropouts, the camera looks stable rather than nervous.

4.3 Why We Do Not Need a “Smart AI Tracker” Here

You could add optical flow, a Kalman filter, appearance embeddings, or a full MOT pipeline. But if the task is vertical auto-crop for videos with a limited number of faces, then simple distance-based tracking gives enough quality with minimal complexity and high reproducibility.

Sometimes the best algorithm is not the one that looks smarter on paper, but the one that is easier to tune and fix in production.

5. Stabilizing the Input Signal: Anti-Jerk and Low-Pass Filter

Even if the detector finds the right face, the coordinates are still noisy. That is a fundamental property of the system.

Typical problems:

  • the box center shakes slightly from frame to frame;

  • the face size jumps;

  • sometimes a false but “confident” detection appears far from the previous point.

If we feed that directly into the camera, we get jitter.

5.1 Anti-Jerk: Hard Limiting of Jumps

First, we need to cut away completely unreasonable jumps.

if filtered_face_center is not None:    delta_face = new_center - filtered_face_center    max_face_step = 0.04    dist = float(np.linalg.norm(delta_face))    if dist > max_face_step:        new_center = filtered_face_center + delta_face * (max_face_step / dist)

Formula:

The idea is simple: a face cannot teleport across half the screen between neighboring analysis frames. If “the detector says so,” then it is noise or a false positive.

5.2 Low-Pass Filter: Smoothing Residual Noise

After the hard clamp, normal but still noisy fluctuations remain. Exponential smoothing works well for them:

yunet = cv2.FaceDetectorYN.create(    model=path_to_onnx,    config="",    input_size=(320, 320),    score_threshold=0.75,    nms_threshold=0.3,    top_k=5000)_, faces = yunet.detect(frame_bgr)  # faces — array [x, y, w, h, conf]

In code:

if filtered_face_center is None:    filtered_face_center = new_center.astype(np.float32)else:    filtered_face_center = (        filtered_face_center * face_filter +        new_center.astype(np.float32) * (1.0 - face_filter)    )    new_center = filtered_face_center

To simplify the idea: we do not fully trust a single measurement. We carefully blend the new estimate with the already stabilized past.

5.3 Why a Filter Alone Is Not Enough

This is an important point.

Anti-jerk alone is a crude instrument. It cuts large outliers, but does not remove small shaking.

Low-pass alone is also not enough. It smooths noise, but with a large outlier it will still drag the signal to the side.

So we need both steps together:

detection → hard clamp → low-pass → camera

This exact composition is what makes the input signal suitable for the downstream physical model.

6. The Virtual Camera as a Physical System

This is where the part begins that really separates a natural-looking result from a “smart crop.”

If the camera instantly snaps to the target point, it looks robotic. A real camera operator does not work like that. There is inertia, acceleration limits, and natural damping.

That is why it is convenient to model the camera as a damped oscillator.

6.1 Mathematical Model

We take the classic spring-damper system:

Where:

  • m is the effective mass;

  • k is the spring stiffness;

  • c is damping;

  • x is the current camera position;

  • x_target is the target position;

  • v is the camera velocity.

Intuitively:

  • the spring pulls the camera toward the target;

  • damping prevents it from oscillating forever;

  • speed and acceleration limits make the motion believable.

6.2 Numerical Integration Per Frame

At each analysis step:

error = target_center - prev_centeraccel = error * follow_stiffness - velocity * follow_dampingacc_norm = np.linalg.norm(accel)if acc_norm > max_center_accel:    accel = accel * (max_center_accel / acc_norm)velocity = velocity + accel * dtvelocity *= velocity_softenvelocity *= (1.0 - velocity_decay)speed = np.linalg.norm(velocity)if speed > max_center_speed:    velocity = velocity * (max_center_speed / speed)new_pos = prev_center + velocity * dt

This scheme is simple, but gives very good control over camera behavior.

6.3 Meaning of the Key Parameters

To avoid tuning blindly, it is important to understand what the parameters do.

Parameter

Meaning

Effect when increased

follow_stiffness

how strongly the camera is pulled toward the target

faster reaction, but higher overshoot risk

follow_damping

resistance to movement

less oscillation, more conservative camera

max_center_accel

acceleration limit

the camera cannot burst forward abruptly

max_center_speed

speed limit

the camera will not fly faster than the allowed pace

velocity_soften

additional velocity softening

fewer high-frequency oscillations

velocity_decay

exponential decay

the camera settles down faster

A camera is not a coordinate. It is a dynamic system. That is exactly what makes it visually believable.

6.4 Predictive Lead: An Operator Does Not Look Exactly Where the Face Is Right Now

If a person moves quickly, the camera should not simply chase them. Otherwise, it will always be slightly behind.

So it is useful to add a light prediction:

cascade = cv2.CascadeClassifier(    cv2.data.haarcascades + "haarcascade_frontalface_default.xml")faces = cascade.detectMultiScale(gray, 1.1, 5, minSize=(50, 50))# faces — array of rectangles [x, y, w, h]

This is a small extrapolation based on the current motion. Visually, it makes the tracking feel more “human.”

6.5 Human Lag: Paradoxically, Sometimes You Need to Slow the Camera Down

A perfectly responsive system often looks worse than a live operator.

A human camera operator does not teleport to a new point instantly. There is always a micro-delay in reaction. That is why a slight lag can actually improve perception:

backend = _DetectorBackend(min_confidence=0.75)detections = backend.detect(frame_rgb, min_size=50)# detections — list of FaceDetection with (center, size, score, box)

This is a subtle but important point: realism is not always equal to maximum accuracy.

7. Composition: The Camera Should Not Only Track, but Frame Nicely

Even a perfectly stable camera can still produce a bad image if the composition is primitive.

If you always place the face strictly in the center, the shot quickly starts to look flat and “machine-made.” So we need to add composition heuristics on top of the physics.

7.1 Rule of Thirds and Side Bias

When there are multiple faces or the scene does not require strict centering, it is useful to shift the subject toward the rule-of-thirds lines.

if side_bias > 0.0 and not single_face_active:    biased_target = 0.5 - side_bias if cx < 0.5 else 0.5 + side_bias    edge_proximity = abs(cx - 0.5) * 2    adaptive_bias = side_bias_strength * (1 - edge_proximity)    cx = cx * (1.0 - adaptive_bias) + biased_target * adaptive_bias

What matters here is that the shift is adaptive. If the face is already near the edge, the bias weakens; otherwise, you can make the frame worse instead of better.

7.2 Single-Face Mode

If the video consistently contains one face, the camera should behave differently:

  • fewer composition experiments;

  • more stabilization;

  • more conservative speed;

  • better retention of a talking-head shot.

Parameter adaptation example:

if single_face_active and stabilization_strength > 0.0:    effective_face_filter = min(0.995, face_filter + stabilization_strength * 0.1)    effective_smoothing = min(0.985, smoothing + stabilization_strength * 0.08)    effective_center_dead_zone = min(0.35, center_dead_zone + stabilization_strength * 0.08)    effective_max_center_speed = max(0.05, max_center_speed * (1.0 - 0.25 * stabilization_strength))

These conditional tuning rules are exactly what usually separates a “working system” from an abstract algorithm.

7.3 Eye-Level Lift: Aim at the Eyes, Not at the Geometric Center

A face bounding box is not composition yet. If we aim strictly at the center of the box, we often get a frame where the camera is looking at the nose.

It is much better to raise the attention point slightly toward eye level:

cy = np.clip(cy - tracked_face.size[1] * eye_level_lift, 0.0, 1.0)

It is a small correction, but it has a strong effect on the subjective quality of the frame.

7.4 Dead Zone: Ignore Micromovements

If a person moved slightly or the detector produced micro-noise, the camera should not react to every tiny change.

delta = target_center - prev_centerdist = np.linalg.norm(delta)if dist < effective_center_dead_zone:    target_center = prev_center

Dead zone is one of the most underrated parameters. Without it, the camera looks nervous even with good detection.

7.5 Face Margin: Never Press the Face Against the Edge

Even if mathematics allows us to move the center anywhere, real composition needs a safety margin.

if face_margin > 0.0:    half_crop_w = min(0.5, (target_width / sw) / (2.0 * z))    half_crop_h = min(0.5, (target_height / sh) / (2.0 * z))    guard_x = max(face_margin, half_crop_w)    guard_y = max(face_margin, half_crop_h)    cx = np.clip(cx, guard_x, 1.0 - guard_x)    cy = np.clip(cy, guard_y, 1.0 - guard_y)

This protects the face from unpleasant trimming of ears, hair, gestures, and generally makes the shot feel more “airy.”

8. What to Do When There Are No Faces: Ken Burns Fallback

A system that can only work in “I see a face” mode breaks on any more complex video.

We need a fallback mode for when:

  • the face is lost;

  • there is no face at all;

  • the frame contains a static scene;

  • the detector failed.

This is where the classic Ken Burns effect helps — gentle panning and zooming.

8.1 Ken Burns Motion Model

The simplest version is based on sines:

if tracked_face is not None:    prev_center = tracked_face.center  # where the face used to be    # find the nearest detection    dists = [np.linalg.norm(d.center - prev_center) for d in dets]    j = int(np.argmin(dists))  # index of the nearest one    # if the distance is acceptable — update the track    if dists[j] < match_tolerance:  # usually 0.15-0.20        tracked_face = dets[j]        new_center = dets[j].center        last_seen_time = t    # otherwise the track is lost, and Ken Burns will be used

This gives controlled, smooth movement without random jerking.

8.2 Why Ken Burns Is Better Than a Static Center

A static crop when there are no faces looks like a bug. Ken Burns creates the impression that the system is still “holding the scene” instead of simply freezing.

For many videos, that alone is enough for the fallback not to be perceived as degradation.

8.3 Soft Transition from Face Tracking to Fallback

The transition should also be smooth.

else:    kc, kz = _ken_burns_motion(...)    camera_states.append(CameraState(t, kc, kz, False))    velocity *= 0.5

We do not teleport into a new logic branch — we gradually damp the accumulated camera velocity.

This is exactly the kind of detail that is not always immediately visible in code, but is very visible in the final result.

9. From Discrete States to a Continuous Camera Path

At the analysis stage, we compute camera states at, for example, 8 FPS. But the final video may be 30 FPS or 60 FPS.

If we apply the camera path as-is, the motion will be step-like. So we need a continuous path via interpolation.

9.1 State Interpolation

The basic option is linear interpolation between neighboring points:

if new_center is None and tracked_face is not None and (t - last_seen_time) < max_miss_time:    new_center = tracked_face.center  # use the previous value

That is enough because the physical model itself already makes the trajectory sufficiently smooth.

9.2 Applying the Virtual Camera

Once we have a path(t) function, the rest is pretty straightforward:

  1. get the source frame;

  2. take the center and zoom from path(t);

  3. compute the ROI;

  4. cut out the region;

  5. scale it to the final 9:16 size.

At this point, the algorithm turns from an analytical model into a real video.

10. Working Parameter Profile: Operator Mode

The most interesting thing in such systems is not just the formulas, but how they are actually tuned in practice.

Below is a profile oriented toward a “human operator” for talking-head and similar scenarios:

analysis_fps: 8min_face_ratio: 0.06min_face_confidence: 0.75match_tolerance: 0.18max_miss_time: 2.7smoothing: 0.93stabilization_strength: 0.26follow_stiffness: 8.4follow_damping: 2.35max_center_accel: 0.85predictive_lead: 0.065human_lag: 0.02velocity_soften: 0.86velocity_decay: 0.10face_filter: 0.80face_margin: 0.085side_bias: 0.22side_bias_strength: 0.37center_dead_zone: 0.052max_center_speed: 0.40eye_level_lift: 0.10zoom_smoothing: 0.75ken_burns_period: 12ken_burns_pan_amplitude: 4.0ken_burns_tilt_amplitude: 2.0ken_burns_zoom_amplitude: 0.0

This configuration does not claim to be universal, but it illustrates an important idea well: the quality here is born from a balance of parameters, not from one magical neural network.

10.1 Profiles for Different Styles

Style

Stiffness

Damping

Max speed

Idea

Static / robotic

2.0

5.0

0.05

almost no movement

Operator mode

8.4

2.35

0.40

lively but controlled movement

Action

15.0

1.2

0.80

fast reaction

Cinematic

4.0

4.0

0.15

slow and soft

This is a convenient way to think not in terms of separate numbers, but in terms of camera behavior profiles.

11. Practical Edge Cases

Any article about an algorithm feels incomplete if it does not explain where it breaks and how to fix it.

11.1 Face at an Angle or in Profile

If a person turns into profile, the detector may lose the track. What helps:

  • increase max_miss_time;

  • loosen match_tolerance;

  • connect YuNet as an additional detector;

  • lower min_face_confidence if input quality is unstable.

11.2 Anime and Non-Photorealistic Faces

If the model is trained on photographs while the input is anime, the problem is not “bad code,” but domain mismatch.

Practical options:

  • loosen the thresholds;

  • reduce min_face_ratio;

  • use an alternative backend;

  • if needed, switch to a specialized detector.

11.3 Multiple Faces in the Frame

When there are two people in the frame, an overly aggressive single-face mode will only make things worse.

Then it is better to do this:

if filtered_face_center is not None:    delta_face = new_center - filtered_face_center    max_face_step = 0.04  # max movement per frame (fraction of the screen)    dist = float(np.linalg.norm(delta_face))    if dist > max_face_step:        # jump is too large → cut it        new_center = filtered_face_center + delta_face * (max_face_step / dist)

And let composition hold two speakers more naturally.

11.4 Fast Motion

For sharp movement, an adaptive boost is useful:

This gives the camera a chance to “wake up” when motion becomes energetic, without making the whole system permanently hyperactive.

12. Why This Architecture Looks Good

In my opinion, the engineering value of this solution lies not in separate formulas, but in the overall system approach.

What matters here:

  • Do not rely on one ideal component. Use a fallback architecture instead.

  • Do not trust raw data. Detection goes through stabilization.

  • Model dynamics explicitly. The camera is a physical system, not a pile of ifs.

  • Account for human perception. Human lag, eye-level lift, dead zone, composition rules.

  • Design degradation. When there are no faces, the system still produces a reasonable result.

  • Think not only about accuracy, but also about subjective quality.

13. Performance and Computational Cost

For completeness, it is worth estimating the pipeline cost too.

Let:

  • N be the number of analyzed frames;

  • D be the face detection time per frame;

  • P be the tracking and camera physics time.

Then:

filtered_center = α · filtered_center + (1 - α) · new_center

For a 5-minute video at analysis_fps = 8, we get:

N = 300 × 8 = 2400

Then we also add virtual camera application on every output frame:

T_apply ≈ M · R

Where:

  • M is the number of output frames;

  • R is the cost of crop + resize.

In a real pipeline, this is quite practical, especially if the analysis runs on a downscaled copy of the video while the final crop is applied to the original.

13.1 How to Reduce Analysis Cost

The three most practical steps:

  • reduce resolution at the detection stage;

  • reduce analysis_fps if the scene is slow;

  • do not keep all fallback backends active if the main one works stably.

These are boring optimizations, but they usually give the best ROI.

14. Final Result

If we put everything together, the algorithm looks like this:

In practice, this gives a system that:

  • does not jitter on detector noise;

  • does not look robotic;

  • degrades gracefully;

  • handles talking-head scenes and regular videos much better than a simple “center crop.”

ссылка на оригинал статьи https://habr.com/ru/articles/1022298/