Wildlife Tracking Uruguay: Solving the Tail-Wagging Paradox (Part 3)

Posted Oct 10, 2025

Automated wildlife recognition pipeline for Uruguayan camera traps

By 12 min read

Quick Recap

In Part 1, we introduced the 3-stage wildlife tracking pipeline. In Part 2, we calibrated MegaDetector to achieve 2x speedup + 15% recall gain by systematically testing 27 parameter configurations.

Now we tackle Stage 2: ByteTrack for Wildlife - the most challenging technical problem in the entire pipeline.

Stage 2: ByteTrack for Wildlife – The Tail-Wagging Paradox

Chapter Summary: Adapting ByteTrack to non-rigid wildlife required ultra-permissive IoU thresholds, longer memory, and strict track creation to avoid fragmentation.

Results Preview:
Tracks per video: 4.2 → 1.2
Key tweak: match_thresh=0.35 with 2.5s buffer
Outcome: Continuous animal IDs with minimal false merges

After MegaDetector gave me bounding boxes for each frame, I faced a new challenge: which detections belong to the same animal?

This is the tracking problem. If a capybara appears in frames 10, 15, 20, and 25, I need to recognize that these four detections represent one animal traversing the scene—not four independent sightings. Without tracking, I’d classify the same animal hundreds of times and lose all sense of individual identity.

I chose ByteTrack as my tracking algorithm. Originally designed for pedestrian tracking in crowded urban scenes, ByteTrack uses a clever two-pass approach:

HIGH pass: Match high-confidence detections to existing tracks
LOW pass: Recover lost tracks using lower-confidence detections

The intuition: sometimes an animal gets partially occluded or moves quickly, causing a brief confidence drop. Rather than losing the track, ByteTrack uses those LOW-confidence detections to bridge gaps.

Sounds great in theory. In practice, wildlife broke it spectacularly.

The Tail-Wagging Paradox

Here’s what happened when I ran ByteTrack with “textbook” parameters on this video of a cow:

Expected: 1 track (or 2) Actual: 4 tracks (!)

The algorithm was fragmenting a single animal into multiple IDs. After staring at the visualized tracks, I realized the problem:

Animals aren’t rigid objects.

ByteTrack (and most tracking algorithms) assume pedestrians are roughly rigid—a person’s bounding box doesn’t change shape dramatically frame-to-frame. But watching animals in trap camaras, they often: wags its tail, shifts its head, changes pose from standing to grazing. Each of these movements alters the bounding box dimensions, shrinking or expanding it.

Traditional tracking uses IoU (Intersection over Union) to match detections across frames:

IoU = Area of Overlap / Area of Union

If IoU > threshold (typically 0.6), the detection “matches” an existing track. But here’s the paradox:

A cow wagging its tail can drop IoU below 0.6 even though it’s the same animal.

The bounding box shifts position and shape due to biological motion. The algorithm sees this as a “new object” and creates a new track. Hence: 4 fragmented tracks for 1 cow.

Experiment 002: Tracking Parameter Tuning

I couldn’t accept fragmented tracks—they’d ruin the downstream pipeline. So I designed Experiment 002 to systematically tune ByteTrack for wildlife.

Test Setup

Dataset: 16 videos with diverse species (cow, margay, capybara, bird, etc.)
Metric: Average tracks per video (target: ≈1 for single-animal scenes)
Goal: Minimize fragmentation while avoiding false merges

Iteration	Config Highlights	Tracks / Video	Takeaway
Baseline	`match_thresh=0.60`, buffer 0.5 s	4.2	Pedestrian defaults shatter animal tracks
Conservative	`match_thresh=0.50`, buffer 1.5 s	2.1	Better, but tail motion still breaks continuity
Ultra-conservative	`match_thresh=0.35`, buffer 2.5 s	1.2	Stable IDs with minimal merges

Iteration 1: Textbook Config (Baseline)

Baseline YAML config

  
tracking:
  track_thresh: 0.60      # HIGH confidence for new tracks
  det_thresh: 0.40        # LOW confidence threshold
  match_thresh: 0.60      # IoU for matching
  track_buffer_s: 0.5     # Memory for lost tracks
  min_track_len: 3        # Minimum frames to output

Result: 4.2 tracks per video (severe fragmentation)

Analysis: The match_thresh=0.6 IoU was too strict. Biological motion was breaking associations.

Iteration 2: Conservative Config

I tried loosening the IoU threshold and extending the track buffer:

Conservative YAML config

  
tracking:
  track_thresh: 0.65
  det_thresh: 0.35
  match_thresh: 0.50      # More permissive IoU
  track_buffer_s: 1.5     # Longer memory
  min_track_len: 5

Result: 2.1 tracks per video (better, but still fragmenting)

Analysis: Improvement, but still seeing multi-track splits on cows and birds. The issue: even IoU=0.5 is too strict for extreme pose changes.

Iteration 3: Ultra-Conservative Config (Final)

I went aggressive—prioritizing track consolidation over avoiding false merges:

Final production YAML config

  
tracking:
  track_thresh: 0.70      # Stricter new track creation
  det_thresh: 0.25        # Very permissive LOW recovery
  match_thresh: 0.35      # Ultra-permissive IoU
  track_buffer_s: 2.5     # Long memory (2.5 seconds)
  min_track_len: 8        # Filter spurious tracks
  nms_iou: 0.85          # Aggressive per-frame deduplication

Result: 1.2 tracks per video ✓

Success! This config consolidates nearly all single-animal scenes into 1 track, with minimal false merges.

Parameter Rationale

Let me unpack each choice:

1. track_thresh=0.70 (high)

Creates new tracks ONLY from very confident detections
Prevents spurious detections from spawning junk tracks
Forces the system to be conservative about “new animals”

2. det_thresh=0.25 (low)

Allows LOW-confidence detections to recover lost tracks
Critical for maintaining continuity during occlusion or motion blur
These detections never create new tracks, only extend existing ones

3. match_thresh=0.35 (ultra-permissive)

The key to solving tail-wagging
Accepts IoU as low as 0.35 for track-detection association
Rationale: better to over-associate (merge what might be 2 animals) than under-associate (fragment 1 animal into many)
Camera trap assumption: most videos have 0-1 animals, so false merges are rare

4. track_buffer_s=2.5 (long memory)

Keeps “lost” tracks alive for 2.5 seconds
Handles temporary occlusions (animal behind grass, tree)
At ~15 effective FPS (frame_stride=2 @ 30fps), this is ~37 frames of buffer

5. min_track_len=8 (aggressive filtering)

Spurious detections (wind-blown grass, shadows) rarely persist >8 frames
Legitimate animals almost always appear for >8 frames
This filter cleans up noise without losing real animals

6. nms_iou=0.85 (aggressive NMS)

Per-frame Non-Maximum Suppression removes duplicate detections
Prevents the tracker from seeing “2 overlapping detections” and creating 2 tracks
Especially important for large animals where MegaDetector might output multiple boxes

The ByteTrack Algorithm: A Deep Dive

Let me show you how ByteTrack’s two-pass algorithm works, adapted for wildlife.

Core Idea: HIGH vs LOW Detections

Click to expand tracking update loop

  
# From scripts/20_run_tracking.py (simplified)

def track_frame(detections, tracks, config):
    """
    Two-pass ByteTrack: HIGH creates tracks, LOW recovers them.
    """
    # Step 1: Split detections by confidence
    high_dets = [d for d in detections if d.conf > config['track_thresh']]
    low_dets = [d for d in detections
                if config['det_thresh'] < d.conf <= config['track_thresh']]

    # Step 2: HIGH pass - match to active tracks
    active_tracks = [t for t in tracks if t.state == 'active']
    matched_high, unmatched_high = hungarian_match(
        high_dets,
        active_tracks,
        iou_thresh=config['match_thresh']
    )

    # Update matched tracks
    for track, det in matched_high:
        track.update(det, source='HIGH')

    # Create new tracks from unmatched HIGH
    for det in unmatched_high:
        tracks.append(Track(det))

    # Step 3: LOW pass - recover lost tracks ONLY
    lost_tracks = [t for t in tracks if t.state == 'lost']
    matched_low, _ = hungarian_match(
        low_dets,
        lost_tracks,
        iou_thresh=config['match_thresh']
    )

    # Recover lost tracks (no new track creation!)
    for track, det in matched_low:
        track.update(det, source='LOW')

    # Step 4: Age out dead tracks
    for track in tracks:
        if track.frames_since_update > config['track_buffer_frames']:
            track.state = 'finished'

    return tracks

Hungarian Assignment

ByteTrack uses the Hungarian algorithm for optimal bipartite matching. Here’s the IoU cost matrix approach:

Hungarian matching helper (Python)

  
from scipy.optimize import linear_sum_assignment

def hungarian_match(detections, tracks, iou_thresh):
    """
    Optimal assignment of detections to tracks via Hungarian algorithm.
    """
    if not detections or not tracks:
        return [], detections

    # Build cost matrix: 1 - IoU (lower is better)
    cost_matrix = np.zeros((len(detections), len(tracks)))
    for i, det in enumerate(detections):
        for j, track in enumerate(tracks):
            iou = calculate_iou(det.bbox, track.predict_bbox())
            cost_matrix[i, j] = 1 - iou  # Convert similarity to cost

    # Solve assignment problem
    det_indices, track_indices = linear_sum_assignment(cost_matrix)

    # Filter by IoU threshold
    matched = []
    unmatched_dets = set(range(len(detections)))

    for det_idx, track_idx in zip(det_indices, track_indices):
        iou = 1 - cost_matrix[det_idx, track_idx]
        if iou >= iou_thresh:
            matched.append((tracks[track_idx], detections[det_idx]))
            unmatched_dets.discard(det_idx)

    unmatched = [detections[i] for i in unmatched_dets]

    return matched, unmatched

Why Hungarian over greedy?

Greedy matching can get stuck in local optima (match A→1, then B can’t match optimally)
Hungarian guarantees globally optimal assignment in O(n³) time
For camera traps (typically 1-3 animals), n is tiny, so it’s fast

IoU Calculation

IoU helper (Python)

  
def calculate_iou(box1, box2):
    """
    Intersection over Union for two bounding boxes.
    box format: [x1, y1, x2, y2]
    """
    # Intersection coordinates
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    # Intersection area
    intersection = max(0, x2 - x1) * max(0, y2 - y1)

    # Union area
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    return intersection / union if union > 0 else 0

The Track Class: State Management

Each track maintains its own state, history, and metadata:

Track state container

  
class Track:
    """
    Persistent track for a single animal across frames.
    """
    def __init__(self, detection):
        self.track_id = generate_id()
        self.frames = [detection.frame_num]
        self.boxes = [detection.bbox]
        self.confs = [detection.conf]
        self.sources = ['HIGH']  # Track detection source (HIGH/LOW)
        self.state = 'active'
        self.frames_since_update = 0

    def update(self, detection, source='HIGH'):
        """Update track with new detection."""
        self.frames.append(detection.frame_num)
        self.boxes.append(detection.bbox)
        self.confs.append(detection.conf)
        self.sources.append(source)
        self.state = 'active'
        self.frames_since_update = 0

    def predict_bbox(self):
        """
        Naive motion prediction: use last known bbox.
        (A Kalman filter would be better, but this works for static cameras)
        """
        return self.boxes[-1]

    def mark_lost(self):
        """Mark track as lost (but keep in buffer)."""
        self.state = 'lost'
        self.frames_since_update += 1

    def get_representative_frame(self):
        """Return frame index with highest confidence."""
        max_conf_idx = np.argmax(self.confs)
        return self.frames[max_conf_idx]

Key State Transitions:

active → lost (no match for 1 frame)
lost → active (LOW recovery succeeds)
lost → finished (exceeds track_buffer)

Lessons from the Tail-Wagging Paradox

This tracking challenge taught me several valuable lessons:

Domain Matters More Than Algorithm

ByteTrack is excellent for pedestrians in crowded urban scenes. But wildlife has fundamentally different motion patterns:

Pedestrians: Rigid, predictable trajectories
Wildlife: Biological motion (tails, wings, pose changes), stationary behavior (grazing, resting)

I couldn’t blindly use “state-of-the-art” tracking—I had to adapt it to the domain.

Tune for Your Failure Mode

In camera traps, the dominant failure mode is false splits (fragmenting 1 animal into many tracks), not false merges (combining 2 animals into 1 track). Why?

Most videos have 0-1 animals (sparse scenes)
False merges are rare when animals are sparse
False splits destroy downstream data quality (duplicate the same animal)

So I tuned aggressively to avoid splits, even at the cost of occasional merges. This is the opposite of what you’d do in crowded pedestrian tracking!

Ultra-Permissive IoU Was Counterintuitive

My AI assistant (Claude Code) initially suggested match_thresh=0.4. I experimented and found 0.35 worked even better. This felt too permissive compared to literature (which uses 0.5-0.7), but the data validated it.

Lesson: Don’t be afraid to go outside published ranges if your domain justifies it.

LOW-Confidence Recovery Is Critical

21% of detections were LOW-confidence (between 0.25 and 0.70). Without the LOW pass, these would be discarded, causing track fragmentation whenever confidence dipped due to:

Motion blur (fast head turn)
Partial occlusion (behind grass)
Lighting changes (shadow crossing the animal)

ByteTrack’s two-pass design is brilliant for wildlife precisely because of this recovery mechanism.

What About Kalman Filters?

Sharp readers will notice: my predict_bbox() function just returns the last known bbox. No motion model!

Why not use a Kalman filter? (a standard technique in tracking)

I chose to start simple. The config even has use_kalman: false. Here’s my reasoning:

Camera traps are static: No camera motion to model (unlike vehicle-mounted cameras)
Animals are non-linear: A grazing cow doesn’t follow linear motion—it wanders, stops, changes direction unpredictably
Frame stride gaps: With stride=2, I’m skipping frames. Velocity estimates between sampled frames would be noisy
“Start simple” philosophy: Get the baseline working first, add complexity only if needed

Kalman filtering assumes roughly linear motion between frames. Wildlife violates this assumption constantly. The naive “last known position” baseline turned out to be good enough—1.2 tracks/video without motion prediction.

Future work: If I see track fragmentation on fast-moving animals (birds in flight, fleeing deer), a Kalman filter or learned motion model might help. But for the current dataset (mostly stationary/slow animals), it wasn’t necessary.

Reproducibility: Experiment 002 Artifacts

All tracking experiments are archived in experiments/exp_002_tracking/:

metrics/tracking_metrics.csv - Raw metrics for all 3 configs × 16 videos
reports/evaluation_report.md - Comprehensive analysis with species breakdown
Visualization plots - Track overlays, recovery effectiveness charts

You can reproduce this experiment:

  
# Run tracking with a specific config
python scripts/20_run_tracking.py \
  --config config/pipeline.yaml \
  --md-json data/md_json \
  --video-root data/dataset-v1 \
  --output data/tracking_json

# Evaluate tracking performance
python scripts/22_evaluate_tracking.py \
  --config config/pipeline.yaml \
  --tracks-json data/tracking_json \
  --output experiments/exp_002_tracking

See scripts/22_evaluate_tracking.py for the evaluation script.

This is Part 3 of 5 in the Wildlife Tracking Uruguay series.

Part 1: Overview & Introduction
Part 2: MegaDetector Calibration
Part 3: The Tail-Wagging Paradox (You are here)
Part 4: Weak Supervision & Training
Part 5: Results & Impact

Previous: Part 2: MegaDetector Calibration

Next: Part 4: Weak Supervision & Training - Learn how I reduced labeling time from 76 hours to 3 hours using video-level weak supervision and quality guardrails to auto-label 912 crops.

Computer Vision, Conservation AI

Object Detection Multi-Object Tracking Transfer Learning Weak Supervision ByteTrack MegaDetector ResNet Video Analytics MLOps Reproducibility Series

This post is licensed under CC BY 4.0 by the author.