Wildlife Tracking Uruguay: Weak Supervision & Classifier Training (Part 4)
Quick Recap
In Part 1, we introduced the 3-stage pipeline. In Part 2, we calibrated MegaDetector for 2x speedup. In Part 3, we solved the tail-wagging paradox with ultra-permissive IoU tracking.
Now we tackle Stage 3: The auto-labeling strategy and classifier training - how I reduced labeling time from 76 hours to 3 hours using weak supervision.
Weak Supervision: The 76-Hour Shortcut
Chapter Summary: Replace 76 hours of crop labeling with 45 minutes of video-level labeling by propagating species metadata from filenames—guardrails keep the weak supervision trustworthy.
Results Preview:
- Labeling effort: 76 hours → 45 minutes
- Auto-labeled crops: 912 across 11 species
- Guardrail accuracy: 98% auto-label precision after QC
With tracking solved, I now had clean animal trajectories—19 tracks across 16 test videos. The next step: extract crops from these tracks and label them with species.
Standard workflow: export crops, manually label them in CVAT or LabelStudio, then train a classifier. Let me do the math:
- 912 crops (from 87 production videos)
- ~5 minutes per crop for careful labeling
- Total: 76 hours of tedious clicking
I’m a graduate student with limited time. 76 hours of manual labeling felt like a non-starter.
Then I noticed something in a small subset of videos (~5% of the dataset): a few filenames already encoded the species.
1
2
3
margay_012.mp4
capybara_034.mp4
cow_0777.mp4
Someone had organized these by species when initially reviewing them. This gave me an idea: what if I adopted this naming convention for the entire dataset?
The key insight: Instead of labeling 912 individual crops, I could label at the video level.
- Watch a 15-second video → identify the species → rename the file
- One label decision per video, not per crop
- ~87 videos × 15 seconds each = ~45 minutes of labeling
Once videos are named correctly, I can automatically extract species labels from filenames and propagate them to all crops from that video. This is weak supervision: I trade pixel-perfect crop labels for fast video-level labels, then use guardrails to ensure quality.
The Auto-Labeling Strategy
The core idea is simple:
- Parse species from filename using regex patterns
- Extract crops from tracks (a few representative frames per track)
- Apply quality guardrails to prevent garbage labels
- Output labeled dataset ready for training
Let me walk through each piece.
Step 1: Species Mapping with Regex
I created config/species_map.yaml to map filename patterns to species labels:
species_map.yaml (excerpt)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# config/species_map.yaml
species_map:
margay:
patterns:
- "^margay_.*"
- ".*gato_montes.*" # Spanish name
capybara:
patterns:
- "^capybara_.*"
- ".*carpincho.*" # Spanish name
armadillo:
patterns:
- "^armadillo_.*"
- ".*tatu.*" # Spanish name
# ... 11 species total
no_animal:
patterns:
- ".*empty.*"
- ".*noanimal.*"
unknown_animal:
patterns:
- ".*unknown.*"
Design choices:
- First-match-wins: If multiple patterns match, raise an error (prevents ambiguity)
- Case-insensitive:
MARGAY_012.mp4andmargay_012.mp4both work - Bilingual support: Spanish and English names (Uruguay is Spanish speaking)
- Explicit unknowns:
no_animalandunknown_animalfor edge cases
The helper library scripts/lib/species_map.py handles loading and validation:
species_map.py (simplified)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# scripts/lib/species_map.py (simplified)
import re
from pathlib import Path
def load_species_map(yaml_path):
"""Load species regex patterns from YAML."""
with open(yaml_path) as f:
config = yaml.safe_load(f)
species_map = {}
for species, info in config['species_map'].items():
species_map[species] = [
re.compile(pattern, re.IGNORECASE)
for pattern in info['patterns']
]
return species_map
def extract_species_from_filename(filename, species_map):
"""
Extract species from filename using regex patterns.
Raises ValueError if multiple species match.
"""
matches = []
for species, patterns in species_map.items():
for pattern in patterns:
if pattern.search(filename):
matches.append(species)
break # One match per species max
if len(matches) == 0:
return 'unknown_animal' # Fallback
elif len(matches) == 1:
return matches[0]
else:
raise ValueError(
f"Ambiguous filename '{filename}' matched multiple species: {matches}"
)
Step 2: Crop Extraction with Quality Filters
Parsing species is only half the battle. I need to extract actual image crops from tracks. The script scripts/31_autolabel_from_filenames.py handles this:
Auto-label crop extraction (simplified)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# From scripts/31_autolabel_from_filenames.py (simplified)
import cv2
from pathlib import Path
def extract_crops_from_track(track, video_path, species_label, config):
"""
Extract representative crops from a track with quality filters.
"""
# Quality filter: skip tracks that are too short
if len(track['frames']) < config['autolabel']['min_track_len']:
return [] # Track too short, likely noise
# Quality filter: skip low-confidence tracks
avg_conf = sum(track['confs']) / len(track['confs'])
if avg_conf < config['autolabel']['min_track_conf']:
return [] # Low confidence, might be false positive
# Quality filter: skip classes we don't want to train on
if species_label in config['classification']['skip_classes']:
return [] # e.g., 'no_animal', 'unknown_animal'
# Hybrid sampling: select diverse + high-quality frames
selected_frames = select_representative_frames(
track,
frames_per_track=config['sampling']['frames_per_track'],
strategy='hybrid'
)
# Extract crops from video
cap = cv2.VideoCapture(str(video_path))
crops = []
for frame_num, bbox in selected_frames:
# Seek to frame
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_num)
ret, frame = cap.read()
if not ret:
continue
# Extract crop with padding
crop = extract_crop_with_padding(
frame,
bbox,
padding=config['autolabel']['crop_padding']
)
# Save crop
crop_filename = f"{video_path.stem}_t{track['track_id']}_f{frame_num}.jpg"
crop_path = output_dir / species_label / crop_filename
crop_path.parent.mkdir(parents=True, exist_ok=True)
cv2.imwrite(str(crop_path), crop)
crops.append({
'crop_path': str(crop_path),
'video': video_path.name,
'track_id': track['track_id'],
'frame': frame_num,
'species': species_label,
'bbox': bbox,
'conf': track['confs'][frame_num]
})
cap.release()
return crops
Hybrid Frame Sampling
The select_representative_frames() function balances two goals:
- High confidence: Pick frames where the animal is clearly visible
- Temporal diversity: Spread frames across the track to capture pose variation
Representative frame selection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def select_representative_frames(track, frames_per_track=3, strategy='hybrid'):
"""
Select representative frames from a track.
Strategies:
- 'max_conf': Pick N frames with highest confidence
- 'temporal': Pick evenly spaced frames across track
- 'hybrid': Pick highest conf frame + temporally diverse neighbors
"""
if strategy == 'max_conf':
# Sort by confidence, take top N
sorted_indices = sorted(
range(len(track['confs'])),
key=lambda i: track['confs'][i],
reverse=True
)
return sorted_indices[:frames_per_track]
elif strategy == 'temporal':
# Evenly spaced across track duration
step = len(track['frames']) // frames_per_track
return [i * step for i in range(frames_per_track)]
elif strategy == 'hybrid':
# Best of both: highest conf + temporal spread
max_conf_idx = track['confs'].index(max(track['confs']))
selected = [max_conf_idx]
# Add temporally distant frames
track_len = len(track['frames'])
for i in range(1, frames_per_track):
offset = (track_len // frames_per_track) * i
candidate = min(offset, track_len - 1)
if candidate not in selected:
selected.append(candidate)
return sorted(selected)
I chose strategy='hybrid' for the production dataset. It gives me:
- At least one high-quality crop (max confidence frame)
- Pose diversity (temporally spread frames)
- Robustness to motion blur/occlusion (not all crops from same moment)
Step 3: Guardrails to Prevent Mislabeling
Weak supervision is powerful but risky. If filenames are wrong or ambiguous, I’d train on garbage labels. I built several guardrails:
Guardrail 1: Skip Short Tracks
Config snippet
1
2
autolabel:
min_track_len: 8 # Tracks with <8 frames are likely noise
Spurious detections (wind-blown grass, shadows) rarely persist >8 frames. Real animals almost always appear longer.
Guardrail 2: Skip Low-Confidence Tracks
Config snippet
1
2
autolabel:
min_track_conf: 0.6 # Average confidence across track
If MegaDetector stays below 0.6 confidence, it’s usually background noise—skip it.
Guardrail 3: Skip Unwanted Classes
Config snippet
1
2
3
4
classification:
skip_classes:
- no_animal
- unknown_animal
Skip
no_animal/unknown_animalso the classifier focuses on real species.
Guardrail 4: Error on Ambiguous Filenames
Ambiguity check (Python)
1
2
if len(matches) > 1:
raise ValueError(f"Ambiguous: {filename} matches {matches}")
Ambiguous filenames halt the pipeline—manual review beats silent corruption.
Guardrail 5: Manual Validation
After auto-labeling, I manually overview a lot of crops. Found:
- Good crop quality (sharp, well-framed animals)
- Pose diversity (animals in different positions/orientations)
- 2 videos with multi-species scenes → removed from dataset
This validation loop is critical. Weak supervision accelerates labeling, but human review ensures quality.
Dataset Statistics: Experiment 003
After running auto-labeling on 87 videos, I generated a summary report in experiments/exp_003_autolabel/:
Total Dataset:
- Videos: 87
- Tracks: 154
- Crops: 912 (3 per track average)
Species Breakdown:
| Species | Videos | Tracks | Crops | Median Track Length (s) |
|---|---|---|---|---|
| armadillo | 9 | 9 | 72 | 10.69 |
| bird | 8 | 11 | 66 | 6.68 |
| capybara | 9 | 13 | 78 | 5.01 |
| cow | 7 | 30 | 120 | 1.88 |
| dusky_legged_guan | 9 | 27 | 108 | 10.29 |
| gray_brocket | 8 | 8 | 64 | 15.09 |
| hare | 8 | 8 | 64 | 11.77 |
| human | 7 | 19 | 114 | 4.27 |
| margay | 6 | 6 | 72 | 3.62 |
| skunk | 8 | 8 | 64 | 2.07 |
| wild_boar | 8 | 15 | 90 | 4.67 |
Observations:
- Balanced distribution: 64-120 crops per species (good for training)
- Cow has most tracks: 30 tracks from 7 videos (multi-animal scenes, herds)
- Gray brocket has longest tracks: 15 seconds median (stationary grazing)
- Skunk has shortest tracks: 2 seconds median (nocturnal, fast-moving)
This distribution is pretty balanced, no species has <60 crops, and none dominates (cow at 120 is reasonable). Perfect for training!
Lessons from Weak Supervision
This auto-labeling strategy taught me several things:
Look for Hidden Supervision Signals
The species labels were hiding in filenames all along. I didn’t need to create them from scratch, I just needed to extract them. This is the core idea of weak supervision: find existing signals (filenames, metadata, heuristics) and leverage them.
Other potential signals I could have used:
- GPS coordinates (species distribution by region)
- Timestamps (nocturnal vs diurnal species)
- Camera location metadata (forest vs grassland)
Guardrails Are Essential
Weak supervision is “weak” for a reason, labels might be noisy. Guardrails (quality filters, ambiguity checks, manual validation) turn weak labels into trustworthy training data.
Start with High Precision, Scale to Higher Recall
My initial guardrails were strict (min_track_len=8, min_conf=0.6). This gave me high-precision labels (98% accuracy). Once I validated the approach, I could loosen guardrails to get more data (higher recall) if needed.
Human-in-the-Loop Is Not Cheating
I didn’t eliminate human involvement—I just shifted it from tedious pixel-level labeling to high-level quality checks. Reviewing 912 crops in 4 hours (vs labeling them in 76 hours) is a massive win.
Training the Classifier: From Crops to Production Model
Chapter Summary: Balanced crop-level splits plus a ResNet50 fine-tune delivered 95.7% accuracy—while documenting leakage risks and validation strategy.
Results Preview:
- Train / Val / Test crops: 634 / 139 / 139
- Best epoch: 7 with macro F1 98.4% (val)
- Final test macro F1: 95.3%
With 912 labeled crops in hand, I was ready to train a species classifier. But before jumping into model architecture, I faced a critical decision: how do I split the data?
The Split Strategy Dilemma
Most ML tutorials gloss over dataset splitting with a simple “70/15/15 train/val/test.” But with wildlife video data, it’s not that simple. I had two options:
Option 1: Video-Level Splits
Idea: Keep all crops from the same video in the same split (train, val, or test).
Pros:
- No data leakage (crops from same video never span splits)
- Realistic evaluation (tests generalization to new camera locations)
Cons:
- Severe class imbalance (some species get only 6-9 crops in val/test)
- Unstable validation metrics (too few samples per class)
Option 2: Crop-Level Splits
Idea: Treat each crop independently, shuffle and split randomly.
Pros:
- Perfect class balance (each species gets proportional representation)
- Stable validation (adequate samples per class, ~10-15 crops)
Cons:
- Risk of data leakage (crops from same video might be in train AND test)
- Optimistic evaluation (model may memorize video-specific features)
My Decision: Crop-Level + Awareness
I chose crop-level splits for model development, with the following rationale:
Training stability matters: With only 64-120 crops per species, I couldn’t afford severe class imbalance. Stable validation metrics are essential for hyperparameter tuning.
Data leakage risk is low in my use case: Most videos have 1 dominant species. Crops from the same video show the same animal in different poses/lighting—that’s actually what I want the model to learn (pose invariance, lighting robustness).
I’ll validate on held-out videos anyway: After training, I tested the model on completely new videos (not in the 87-video training set) to check real-world generalization.
The compromise: Use crop-level splits for development + training, but maintain awareness that test metrics may be slightly optimistic. Document this limitation clearly.
Heads-up: Crop-level splits speed up iteration, but I always re-check performance on fresh videos to guard against leakage.
Final split:
1
2
3
Train: 634 crops (70%)
Val: 139 crops (15%)
Test: 139 crops (15%)
All species represented with 10+ crops per split—sufficient for stable metrics.
Model Architecture: ResNet50
I chose ResNet50 for the classifier:
Why ResNet50?
- Proven architecture (won ImageNet 2015)
- Strong ImageNet pretraining (transfer learning boost)
- Not too deep (efficient training on my RTX 3060 Ti)
- Well-supported in PyTorch/torchvision
Alternatives I considered:
- MobileNetV3: Lighter, faster, but less accurate on small datasets
- EfficientNet: More parameter-efficient, but slower to converge
- Vision Transformer (ViT): Requires larger datasets (>10k samples)
For 900 crops, ResNet50 hits the sweet spot: strong enough to learn, not so large that it overfits.
Implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# From training/train_classifier.py
import torch
import torch.nn as nn
import torchvision.models as models
def build_model(num_classes=11, pretrained=True):
"""Build ResNet50 with custom classifier head."""
model = models.resnet50(pretrained=pretrained)
# Replace final FC layer (1000 ImageNet classes → 11 wildlife species)
num_features = model.fc.in_features # 2048
model.fc = nn.Linear(num_features, num_classes)
return model
Transfer learning: I kept all pretrained layers frozen initially, then fine-tuned the full model. But honestly, with 634 training crops, I found full end-to-end training worked better (no freezing). The ImageNet initialization was enough to prevent overfitting.
Training Configuration
1
2
3
4
5
6
7
8
9
10
11
12
# From config/pipeline.yaml
classification:
model_arch: resnet50
pretrained: true
num_epochs: 15
batch_size: 32
learning_rate: 0.0001
weight_decay: 0.0001
optimizer: adamw
lr_scheduler: cosine
image_size: 224
balance_classes: true
Key hyperparameters:
- Optimizer: AdamW (Adam + weight decay, better generalization)
- Learning rate: 0.0001 (conservative, prevents overshooting)
- Scheduler: Cosine annealing (smooth decay from 0.0001 → 0)
- Weight decay: 0.0001 (L2 regularization to prevent overfitting)
- Class balancing: WeightedRandomSampler to handle slight imbalance
Training Dynamics: What Actually Happened
I tracked metrics for all 15 epochs. Here’s what the learning curves looked like:
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | Val F1 |
|---|---|---|---|---|---|
| 1 | 2.18 | 44.3% | 1.82 | 67.6% | 63.7% |
| 2 | 1.42 | 76.7% | 0.96 | 83.5% | 81.1% |
| 3 | 0.75 | 88.0% | 0.55 | 89.2% | 88.3% |
| 4 | 0.32 | 95.4% | 0.26 | 95.0% | 95.0% |
| 5 | 0.19 | 96.7% | 0.19 | 96.4% | 96.1% |
| 6 | 0.11 | 98.4% | 0.17 | 96.4% | 95.9% |
| 7 | 0.08 | 97.9% | 0.13 | 98.6% | 98.4% ← Best |
| 8 | 0.05 | 99.5% | 0.11 | 98.6% | 98.4% |
| … | … | … | … | … | … |
| 15 | 0.04 | 99.4% | 0.10 | 97.8% | 97.6% |
Observations:
- Fast convergence: 95% accuracy by epoch 4 (transfer learning FTW!)
- Best model at epoch 7: Validation F1 peaked at 98.4%
- Slight overfitting: Train accuracy climbs to 99.4%, val stays at ~98%
- Early stopping candidate: Could have stopped at epoch 10 (plateau)
I saved the epoch 7 checkpoint as best_model.pt for final evaluation.
Evaluation: Testing the Best Model
After training, I ran training/eval_classifier.py on the held-out test set:
1
2
3
4
5
6
7
python training/eval_classifier.py \
--config config/pipeline.yaml \
--manifest data/crops_manifest.csv \
--splits experiments/exp_003_autolabel/splits.json \
--checkpoint experiments/exp_003_species/best_model.pt \
--output-dir experiments/exp_003_species \
--split test
Test Results:
1
2
3
4
{
"test_accuracy": 0.957,
"macro_f1": 0.953
}
Test Set: Accuracy 95.7%, macro F1 95.3, loss 0.166.
95.7% accuracy, 95.3% F1—production-quality results!
Reproducibility: Training Artifacts
All training artifacts are in experiments/exp_003_species/:
best_model.pt- Trained ResNet50 checkpoint (epoch 7, 283MB, Git LFS)metrics.csv- Per-epoch training logsmetrics.json- Final test metrics + per-class F1predictions_test.csv- Crop-level test predictions
Reproduce training:
1
2
3
4
5
6
python training/train_classifier.py \
--config config/pipeline.yaml \
--manifest data/crops_manifest.csv \
--splits experiments/exp_003_autolabel/splits.json \
--output-dir experiments/exp_003_species \
--model resnet50
See the full training script at training/train_classifier.py.
Series Navigation
This is Part 4 of 5 in the Wildlife Tracking Uruguay series.
- Part 1: Overview & Introduction
- Part 2: MegaDetector Calibration
- Part 3: The Tail-Wagging Paradox
- Part 4: Weak Supervision & Training (You are here)
- Part 5: Results & Impact
Previous: Part 3: The Tail-Wagging Paradox
Next: Part 5: Results & Impact - Explore the final test set performance (95.7% accuracy), confusion matrix analysis, production deployment, and key lessons learned from building this end-to-end wildlife tracking pipeline.