Wildlife Tracking Uruguay: Solving the Tail-Wagging Paradox (Part 3)
Quick Recap
In Part 1, we introduced the 3-stage wildlife tracking pipeline. In Part 2, we calibrated MegaDetector to achieve 2x speedup + 15% recall gain by systematically testing 27 parameter configurations.
Now we tackle Stage 2: ByteTrack for Wildlife - the most challenging technical problem in the entire pipeline.
Stage 2: ByteTrack for Wildlife – The Tail-Wagging Paradox
Chapter Summary: Adapting ByteTrack to non-rigid wildlife required ultra-permissive IoU thresholds, longer memory, and strict track creation to avoid fragmentation.
Results Preview:
- Tracks per video: 4.2 → 1.2
- Key tweak:
match_thresh=0.35with 2.5s buffer- Outcome: Continuous animal IDs with minimal false merges
After MegaDetector gave me bounding boxes for each frame, I faced a new challenge: which detections belong to the same animal?
This is the tracking problem. If a capybara appears in frames 10, 15, 20, and 25, I need to recognize that these four detections represent one animal traversing the scene—not four independent sightings. Without tracking, I’d classify the same animal hundreds of times and lose all sense of individual identity.
I chose ByteTrack as my tracking algorithm. Originally designed for pedestrian tracking in crowded urban scenes, ByteTrack uses a clever two-pass approach:
- HIGH pass: Match high-confidence detections to existing tracks
- LOW pass: Recover lost tracks using lower-confidence detections
The intuition: sometimes an animal gets partially occluded or moves quickly, causing a brief confidence drop. Rather than losing the track, ByteTrack uses those LOW-confidence detections to bridge gaps.
Sounds great in theory. In practice, wildlife broke it spectacularly.
The Tail-Wagging Paradox
Here’s what happened when I ran ByteTrack with “textbook” parameters on this video of a cow:
Expected: 1 track (or 2) Actual: 4 tracks (!)
The algorithm was fragmenting a single animal into multiple IDs. After staring at the visualized tracks, I realized the problem:
Animals aren’t rigid objects.
ByteTrack (and most tracking algorithms) assume pedestrians are roughly rigid—a person’s bounding box doesn’t change shape dramatically frame-to-frame. But watching animals in trap camaras, they often: wags its tail, shifts its head, changes pose from standing to grazing. Each of these movements alters the bounding box dimensions, shrinking or expanding it.
Traditional tracking uses IoU (Intersection over Union) to match detections across frames:
1
IoU = Area of Overlap / Area of Union
If IoU > threshold (typically 0.6), the detection “matches” an existing track. But here’s the paradox:
A cow wagging its tail can drop IoU below 0.6 even though it’s the same animal.
The bounding box shifts position and shape due to biological motion. The algorithm sees this as a “new object” and creates a new track. Hence: 4 fragmented tracks for 1 cow.
Experiment 002: Tracking Parameter Tuning
I couldn’t accept fragmented tracks—they’d ruin the downstream pipeline. So I designed Experiment 002 to systematically tune ByteTrack for wildlife.
Test Setup
- Dataset: 16 videos with diverse species (cow, margay, capybara, bird, etc.)
- Metric: Average tracks per video (target: ≈1 for single-animal scenes)
- Goal: Minimize fragmentation while avoiding false merges
| Iteration | Config Highlights | Tracks / Video | Takeaway |
|---|---|---|---|
| Baseline | match_thresh=0.60, buffer 0.5 s | 4.2 | Pedestrian defaults shatter animal tracks |
| Conservative | match_thresh=0.50, buffer 1.5 s | 2.1 | Better, but tail motion still breaks continuity |
| Ultra-conservative | match_thresh=0.35, buffer 2.5 s | 1.2 | Stable IDs with minimal merges |
Iteration 1: Textbook Config (Baseline)
Baseline YAML config
1
2
3
4
5
6
tracking:
track_thresh: 0.60 # HIGH confidence for new tracks
det_thresh: 0.40 # LOW confidence threshold
match_thresh: 0.60 # IoU for matching
track_buffer_s: 0.5 # Memory for lost tracks
min_track_len: 3 # Minimum frames to output
Result: 4.2 tracks per video (severe fragmentation)
Analysis: The match_thresh=0.6 IoU was too strict. Biological motion was breaking associations.
Iteration 2: Conservative Config
I tried loosening the IoU threshold and extending the track buffer:
Conservative YAML config
1
2
3
4
5
6
tracking:
track_thresh: 0.65
det_thresh: 0.35
match_thresh: 0.50 # More permissive IoU
track_buffer_s: 1.5 # Longer memory
min_track_len: 5
Result: 2.1 tracks per video (better, but still fragmenting)
Analysis: Improvement, but still seeing multi-track splits on cows and birds. The issue: even IoU=0.5 is too strict for extreme pose changes.
Iteration 3: Ultra-Conservative Config (Final)
I went aggressive—prioritizing track consolidation over avoiding false merges:
Final production YAML config
1
2
3
4
5
6
7
tracking:
track_thresh: 0.70 # Stricter new track creation
det_thresh: 0.25 # Very permissive LOW recovery
match_thresh: 0.35 # Ultra-permissive IoU
track_buffer_s: 2.5 # Long memory (2.5 seconds)
min_track_len: 8 # Filter spurious tracks
nms_iou: 0.85 # Aggressive per-frame deduplication
Result: 1.2 tracks per video ✓
Success! This config consolidates nearly all single-animal scenes into 1 track, with minimal false merges.
Parameter Rationale
Let me unpack each choice:
1. track_thresh=0.70 (high)
- Creates new tracks ONLY from very confident detections
- Prevents spurious detections from spawning junk tracks
- Forces the system to be conservative about “new animals”
2. det_thresh=0.25 (low)
- Allows LOW-confidence detections to recover lost tracks
- Critical for maintaining continuity during occlusion or motion blur
- These detections never create new tracks, only extend existing ones
3. match_thresh=0.35 (ultra-permissive)
- The key to solving tail-wagging
- Accepts IoU as low as 0.35 for track-detection association
- Rationale: better to over-associate (merge what might be 2 animals) than under-associate (fragment 1 animal into many)
- Camera trap assumption: most videos have 0-1 animals, so false merges are rare
4. track_buffer_s=2.5 (long memory)
- Keeps “lost” tracks alive for 2.5 seconds
- Handles temporary occlusions (animal behind grass, tree)
- At ~15 effective FPS (frame_stride=2 @ 30fps), this is ~37 frames of buffer
5. min_track_len=8 (aggressive filtering)
- Spurious detections (wind-blown grass, shadows) rarely persist >8 frames
- Legitimate animals almost always appear for >8 frames
- This filter cleans up noise without losing real animals
6. nms_iou=0.85 (aggressive NMS)
- Per-frame Non-Maximum Suppression removes duplicate detections
- Prevents the tracker from seeing “2 overlapping detections” and creating 2 tracks
- Especially important for large animals where MegaDetector might output multiple boxes
The ByteTrack Algorithm: A Deep Dive
Let me show you how ByteTrack’s two-pass algorithm works, adapted for wildlife.
Core Idea: HIGH vs LOW Detections
Click to expand tracking update loop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# From scripts/20_run_tracking.py (simplified)
def track_frame(detections, tracks, config):
"""
Two-pass ByteTrack: HIGH creates tracks, LOW recovers them.
"""
# Step 1: Split detections by confidence
high_dets = [d for d in detections if d.conf > config['track_thresh']]
low_dets = [d for d in detections
if config['det_thresh'] < d.conf <= config['track_thresh']]
# Step 2: HIGH pass - match to active tracks
active_tracks = [t for t in tracks if t.state == 'active']
matched_high, unmatched_high = hungarian_match(
high_dets,
active_tracks,
iou_thresh=config['match_thresh']
)
# Update matched tracks
for track, det in matched_high:
track.update(det, source='HIGH')
# Create new tracks from unmatched HIGH
for det in unmatched_high:
tracks.append(Track(det))
# Step 3: LOW pass - recover lost tracks ONLY
lost_tracks = [t for t in tracks if t.state == 'lost']
matched_low, _ = hungarian_match(
low_dets,
lost_tracks,
iou_thresh=config['match_thresh']
)
# Recover lost tracks (no new track creation!)
for track, det in matched_low:
track.update(det, source='LOW')
# Step 4: Age out dead tracks
for track in tracks:
if track.frames_since_update > config['track_buffer_frames']:
track.state = 'finished'
return tracks
Hungarian Assignment
ByteTrack uses the Hungarian algorithm for optimal bipartite matching. Here’s the IoU cost matrix approach:
Hungarian matching helper (Python)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from scipy.optimize import linear_sum_assignment
def hungarian_match(detections, tracks, iou_thresh):
"""
Optimal assignment of detections to tracks via Hungarian algorithm.
"""
if not detections or not tracks:
return [], detections
# Build cost matrix: 1 - IoU (lower is better)
cost_matrix = np.zeros((len(detections), len(tracks)))
for i, det in enumerate(detections):
for j, track in enumerate(tracks):
iou = calculate_iou(det.bbox, track.predict_bbox())
cost_matrix[i, j] = 1 - iou # Convert similarity to cost
# Solve assignment problem
det_indices, track_indices = linear_sum_assignment(cost_matrix)
# Filter by IoU threshold
matched = []
unmatched_dets = set(range(len(detections)))
for det_idx, track_idx in zip(det_indices, track_indices):
iou = 1 - cost_matrix[det_idx, track_idx]
if iou >= iou_thresh:
matched.append((tracks[track_idx], detections[det_idx]))
unmatched_dets.discard(det_idx)
unmatched = [detections[i] for i in unmatched_dets]
return matched, unmatched
Why Hungarian over greedy?
- Greedy matching can get stuck in local optima (match A→1, then B can’t match optimally)
- Hungarian guarantees globally optimal assignment in O(n³) time
- For camera traps (typically 1-3 animals), n is tiny, so it’s fast
IoU Calculation
IoU helper (Python)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def calculate_iou(box1, box2):
"""
Intersection over Union for two bounding boxes.
box format: [x1, y1, x2, y2]
"""
# Intersection coordinates
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
# Intersection area
intersection = max(0, x2 - x1) * max(0, y2 - y1)
# Union area
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = area1 + area2 - intersection
return intersection / union if union > 0 else 0
The Track Class: State Management
Each track maintains its own state, history, and metadata:
Track state container
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class Track:
"""
Persistent track for a single animal across frames.
"""
def __init__(self, detection):
self.track_id = generate_id()
self.frames = [detection.frame_num]
self.boxes = [detection.bbox]
self.confs = [detection.conf]
self.sources = ['HIGH'] # Track detection source (HIGH/LOW)
self.state = 'active'
self.frames_since_update = 0
def update(self, detection, source='HIGH'):
"""Update track with new detection."""
self.frames.append(detection.frame_num)
self.boxes.append(detection.bbox)
self.confs.append(detection.conf)
self.sources.append(source)
self.state = 'active'
self.frames_since_update = 0
def predict_bbox(self):
"""
Naive motion prediction: use last known bbox.
(A Kalman filter would be better, but this works for static cameras)
"""
return self.boxes[-1]
def mark_lost(self):
"""Mark track as lost (but keep in buffer)."""
self.state = 'lost'
self.frames_since_update += 1
def get_representative_frame(self):
"""Return frame index with highest confidence."""
max_conf_idx = np.argmax(self.confs)
return self.frames[max_conf_idx]
Key State Transitions:
1
2
3
active → lost (no match for 1 frame)
lost → active (LOW recovery succeeds)
lost → finished (exceeds track_buffer)
Lessons from the Tail-Wagging Paradox
This tracking challenge taught me several valuable lessons:
Domain Matters More Than Algorithm
ByteTrack is excellent for pedestrians in crowded urban scenes. But wildlife has fundamentally different motion patterns:
- Pedestrians: Rigid, predictable trajectories
- Wildlife: Biological motion (tails, wings, pose changes), stationary behavior (grazing, resting)
I couldn’t blindly use “state-of-the-art” tracking—I had to adapt it to the domain.
Tune for Your Failure Mode
In camera traps, the dominant failure mode is false splits (fragmenting 1 animal into many tracks), not false merges (combining 2 animals into 1 track). Why?
- Most videos have 0-1 animals (sparse scenes)
- False merges are rare when animals are sparse
- False splits destroy downstream data quality (duplicate the same animal)
So I tuned aggressively to avoid splits, even at the cost of occasional merges. This is the opposite of what you’d do in crowded pedestrian tracking!
Ultra-Permissive IoU Was Counterintuitive
My AI assistant (Claude Code) initially suggested match_thresh=0.4. I experimented and found 0.35 worked even better. This felt too permissive compared to literature (which uses 0.5-0.7), but the data validated it.
Lesson: Don’t be afraid to go outside published ranges if your domain justifies it.
LOW-Confidence Recovery Is Critical
21% of detections were LOW-confidence (between 0.25 and 0.70). Without the LOW pass, these would be discarded, causing track fragmentation whenever confidence dipped due to:
- Motion blur (fast head turn)
- Partial occlusion (behind grass)
- Lighting changes (shadow crossing the animal)
ByteTrack’s two-pass design is brilliant for wildlife precisely because of this recovery mechanism.
What About Kalman Filters?
Sharp readers will notice: my predict_bbox() function just returns the last known bbox. No motion model!
Why not use a Kalman filter? (a standard technique in tracking)
I chose to start simple. The config even has use_kalman: false. Here’s my reasoning:
- Camera traps are static: No camera motion to model (unlike vehicle-mounted cameras)
- Animals are non-linear: A grazing cow doesn’t follow linear motion—it wanders, stops, changes direction unpredictably
- Frame stride gaps: With stride=2, I’m skipping frames. Velocity estimates between sampled frames would be noisy
- “Start simple” philosophy: Get the baseline working first, add complexity only if needed
Kalman filtering assumes roughly linear motion between frames. Wildlife violates this assumption constantly. The naive “last known position” baseline turned out to be good enough—1.2 tracks/video without motion prediction.
Future work: If I see track fragmentation on fast-moving animals (birds in flight, fleeing deer), a Kalman filter or learned motion model might help. But for the current dataset (mostly stationary/slow animals), it wasn’t necessary.
Reproducibility: Experiment 002 Artifacts
All tracking experiments are archived in experiments/exp_002_tracking/:
metrics/tracking_metrics.csv- Raw metrics for all 3 configs × 16 videosreports/evaluation_report.md- Comprehensive analysis with species breakdown- Visualization plots - Track overlays, recovery effectiveness charts
You can reproduce this experiment:
1
2
3
4
5
6
7
8
9
10
11
12
# Run tracking with a specific config
python scripts/20_run_tracking.py \
--config config/pipeline.yaml \
--md-json data/md_json \
--video-root data/dataset-v1 \
--output data/tracking_json
# Evaluate tracking performance
python scripts/22_evaluate_tracking.py \
--config config/pipeline.yaml \
--tracks-json data/tracking_json \
--output experiments/exp_002_tracking
See scripts/22_evaluate_tracking.py for the evaluation script.
Series Navigation
This is Part 3 of 5 in the Wildlife Tracking Uruguay series.
- Part 1: Overview & Introduction
- Part 2: MegaDetector Calibration
- Part 3: The Tail-Wagging Paradox (You are here)
- Part 4: Weak Supervision & Training
- Part 5: Results & Impact
Previous: Part 2: MegaDetector Calibration
Next: Part 4: Weak Supervision & Training - Learn how I reduced labeling time from 76 hours to 3 hours using video-level weak supervision and quality guardrails to auto-label 912 crops.