Post

Wildlife Tracking Uruguay: Weak Supervision & Classifier Training (Part 4)

Quick Recap

In Part 1, we introduced the 3-stage pipeline. In Part 2, we calibrated MegaDetector for 2x speedup. In Part 3, we solved the tail-wagging paradox with ultra-permissive IoU tracking.

Now we tackle Stage 3: The auto-labeling strategy and classifier training - how I reduced labeling time from 76 hours to 3 hours using weak supervision.


Weak Supervision: The 76-Hour Shortcut

Chapter Summary: Replace 76 hours of crop labeling with 45 minutes of video-level labeling by propagating species metadata from filenames—guardrails keep the weak supervision trustworthy.

Results Preview:

  • Labeling effort: 76 hours → 45 minutes
  • Auto-labeled crops: 912 across 11 species
  • Guardrail accuracy: 98% auto-label precision after QC

With tracking solved, I now had clean animal trajectories—19 tracks across 16 test videos. The next step: extract crops from these tracks and label them with species.

Standard workflow: export crops, manually label them in CVAT or LabelStudio, then train a classifier. Let me do the math:

  • 912 crops (from 87 production videos)
  • ~5 minutes per crop for careful labeling
  • Total: 76 hours of tedious clicking

I’m a graduate student with limited time. 76 hours of manual labeling felt like a non-starter.

Then I noticed something in a small subset of videos (~5% of the dataset): a few filenames already encoded the species.

1
2
3
margay_012.mp4
capybara_034.mp4
cow_0777.mp4

Someone had organized these by species when initially reviewing them. This gave me an idea: what if I adopted this naming convention for the entire dataset?

The key insight: Instead of labeling 912 individual crops, I could label at the video level.

  • Watch a 15-second video → identify the species → rename the file
  • One label decision per video, not per crop
  • ~87 videos × 15 seconds each = ~45 minutes of labeling

Once videos are named correctly, I can automatically extract species labels from filenames and propagate them to all crops from that video. This is weak supervision: I trade pixel-perfect crop labels for fast video-level labels, then use guardrails to ensure quality.


The Auto-Labeling Strategy

The core idea is simple:

  1. Parse species from filename using regex patterns
  2. Extract crops from tracks (a few representative frames per track)
  3. Apply quality guardrails to prevent garbage labels
  4. Output labeled dataset ready for training

Let me walk through each piece.


Step 1: Species Mapping with Regex

I created config/species_map.yaml to map filename patterns to species labels:

species_map.yaml (excerpt)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# config/species_map.yaml
species_map:
  margay:
    patterns:
      - "^margay_.*"
      - ".*gato_montes.*"  # Spanish name

  capybara:
    patterns:
      - "^capybara_.*"
      - ".*carpincho.*"    # Spanish name

  armadillo:
    patterns:
      - "^armadillo_.*"
      - ".*tatu.*"         # Spanish name

  # ... 11 species total

  no_animal:
    patterns:
      - ".*empty.*"
      - ".*noanimal.*"

  unknown_animal:
    patterns:
      - ".*unknown.*"

Design choices:

  • First-match-wins: If multiple patterns match, raise an error (prevents ambiguity)
  • Case-insensitive: MARGAY_012.mp4 and margay_012.mp4 both work
  • Bilingual support: Spanish and English names (Uruguay is Spanish speaking)
  • Explicit unknowns: no_animal and unknown_animal for edge cases

The helper library scripts/lib/species_map.py handles loading and validation:

species_map.py (simplified)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# scripts/lib/species_map.py (simplified)
import re
from pathlib import Path

def load_species_map(yaml_path):
    """Load species regex patterns from YAML."""
    with open(yaml_path) as f:
        config = yaml.safe_load(f)

    species_map = {}
    for species, info in config['species_map'].items():
        species_map[species] = [
            re.compile(pattern, re.IGNORECASE)
            for pattern in info['patterns']
        ]

    return species_map

def extract_species_from_filename(filename, species_map):
    """
    Extract species from filename using regex patterns.
    Raises ValueError if multiple species match.
    """
    matches = []

    for species, patterns in species_map.items():
        for pattern in patterns:
            if pattern.search(filename):
                matches.append(species)
                break  # One match per species max

    if len(matches) == 0:
        return 'unknown_animal'  # Fallback
    elif len(matches) == 1:
        return matches[0]
    else:
        raise ValueError(
            f"Ambiguous filename '{filename}' matched multiple species: {matches}"
        )

Step 2: Crop Extraction with Quality Filters

Parsing species is only half the battle. I need to extract actual image crops from tracks. The script scripts/31_autolabel_from_filenames.py handles this:

Auto-label crop extraction (simplified)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# From scripts/31_autolabel_from_filenames.py (simplified)
import cv2
from pathlib import Path

def extract_crops_from_track(track, video_path, species_label, config):
    """
    Extract representative crops from a track with quality filters.
    """
    # Quality filter: skip tracks that are too short
    if len(track['frames']) < config['autolabel']['min_track_len']:
        return []  # Track too short, likely noise

    # Quality filter: skip low-confidence tracks
    avg_conf = sum(track['confs']) / len(track['confs'])
    if avg_conf < config['autolabel']['min_track_conf']:
        return []  # Low confidence, might be false positive

    # Quality filter: skip classes we don't want to train on
    if species_label in config['classification']['skip_classes']:
        return []  # e.g., 'no_animal', 'unknown_animal'

    # Hybrid sampling: select diverse + high-quality frames
    selected_frames = select_representative_frames(
        track,
        frames_per_track=config['sampling']['frames_per_track'],
        strategy='hybrid'
    )

    # Extract crops from video
    cap = cv2.VideoCapture(str(video_path))
    crops = []

    for frame_num, bbox in selected_frames:
        # Seek to frame
        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_num)
        ret, frame = cap.read()
        if not ret:
            continue

        # Extract crop with padding
        crop = extract_crop_with_padding(
            frame,
            bbox,
            padding=config['autolabel']['crop_padding']
        )

        # Save crop
        crop_filename = f"{video_path.stem}_t{track['track_id']}_f{frame_num}.jpg"
        crop_path = output_dir / species_label / crop_filename
        crop_path.parent.mkdir(parents=True, exist_ok=True)
        cv2.imwrite(str(crop_path), crop)

        crops.append({
            'crop_path': str(crop_path),
            'video': video_path.name,
            'track_id': track['track_id'],
            'frame': frame_num,
            'species': species_label,
            'bbox': bbox,
            'conf': track['confs'][frame_num]
        })

    cap.release()
    return crops

Hybrid Frame Sampling

The select_representative_frames() function balances two goals:

  1. High confidence: Pick frames where the animal is clearly visible
  2. Temporal diversity: Spread frames across the track to capture pose variation
Representative frame selection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def select_representative_frames(track, frames_per_track=3, strategy='hybrid'):
    """
    Select representative frames from a track.

    Strategies:
    - 'max_conf': Pick N frames with highest confidence
    - 'temporal': Pick evenly spaced frames across track
    - 'hybrid': Pick highest conf frame + temporally diverse neighbors
    """
    if strategy == 'max_conf':
        # Sort by confidence, take top N
        sorted_indices = sorted(
            range(len(track['confs'])),
            key=lambda i: track['confs'][i],
            reverse=True
        )
        return sorted_indices[:frames_per_track]

    elif strategy == 'temporal':
        # Evenly spaced across track duration
        step = len(track['frames']) // frames_per_track
        return [i * step for i in range(frames_per_track)]

    elif strategy == 'hybrid':
        # Best of both: highest conf + temporal spread
        max_conf_idx = track['confs'].index(max(track['confs']))
        selected = [max_conf_idx]

        # Add temporally distant frames
        track_len = len(track['frames'])
        for i in range(1, frames_per_track):
            offset = (track_len // frames_per_track) * i
            candidate = min(offset, track_len - 1)
            if candidate not in selected:
                selected.append(candidate)

        return sorted(selected)

I chose strategy='hybrid' for the production dataset. It gives me:

  • At least one high-quality crop (max confidence frame)
  • Pose diversity (temporally spread frames)
  • Robustness to motion blur/occlusion (not all crops from same moment)

Step 3: Guardrails to Prevent Mislabeling

Weak supervision is powerful but risky. If filenames are wrong or ambiguous, I’d train on garbage labels. I built several guardrails:

Guardrail 1: Skip Short Tracks

Config snippet
1
2
autolabel:
  min_track_len: 8  # Tracks with <8 frames are likely noise

Spurious detections (wind-blown grass, shadows) rarely persist >8 frames. Real animals almost always appear longer.

Guardrail 2: Skip Low-Confidence Tracks

Config snippet
1
2
autolabel:
  min_track_conf: 0.6  # Average confidence across track

If MegaDetector stays below 0.6 confidence, it’s usually background noise—skip it.

Guardrail 3: Skip Unwanted Classes

Config snippet
1
2
3
4
classification:
  skip_classes:
    - no_animal
    - unknown_animal

Skip no_animal/unknown_animal so the classifier focuses on real species.

Guardrail 4: Error on Ambiguous Filenames

Ambiguity check (Python)
1
2
if len(matches) > 1:
    raise ValueError(f"Ambiguous: {filename} matches {matches}")

Ambiguous filenames halt the pipeline—manual review beats silent corruption.

Guardrail 5: Manual Validation

After auto-labeling, I manually overview a lot of crops. Found:

  • Good crop quality (sharp, well-framed animals)
  • Pose diversity (animals in different positions/orientations)
  • 2 videos with multi-species scenes → removed from dataset

This validation loop is critical. Weak supervision accelerates labeling, but human review ensures quality.


Dataset Statistics: Experiment 003

After running auto-labeling on 87 videos, I generated a summary report in experiments/exp_003_autolabel/:

Total Dataset:

  • Videos: 87
  • Tracks: 154
  • Crops: 912 (3 per track average)

Species Breakdown:

SpeciesVideosTracksCropsMedian Track Length (s)
armadillo997210.69
bird811666.68
capybara913785.01
cow7301201.88
dusky_legged_guan92710810.29
gray_brocket886415.09
hare886411.77
human7191144.27
margay66723.62
skunk88642.07
wild_boar815904.67

Observations:

  • Balanced distribution: 64-120 crops per species (good for training)
  • Cow has most tracks: 30 tracks from 7 videos (multi-animal scenes, herds)
  • Gray brocket has longest tracks: 15 seconds median (stationary grazing)
  • Skunk has shortest tracks: 2 seconds median (nocturnal, fast-moving)

This distribution is pretty balanced, no species has <60 crops, and none dominates (cow at 120 is reasonable). Perfect for training!


Lessons from Weak Supervision

This auto-labeling strategy taught me several things:

Look for Hidden Supervision Signals

The species labels were hiding in filenames all along. I didn’t need to create them from scratch, I just needed to extract them. This is the core idea of weak supervision: find existing signals (filenames, metadata, heuristics) and leverage them.

Other potential signals I could have used:

  • GPS coordinates (species distribution by region)
  • Timestamps (nocturnal vs diurnal species)
  • Camera location metadata (forest vs grassland)

Guardrails Are Essential

Weak supervision is “weak” for a reason, labels might be noisy. Guardrails (quality filters, ambiguity checks, manual validation) turn weak labels into trustworthy training data.

Start with High Precision, Scale to Higher Recall

My initial guardrails were strict (min_track_len=8, min_conf=0.6). This gave me high-precision labels (98% accuracy). Once I validated the approach, I could loosen guardrails to get more data (higher recall) if needed.

Human-in-the-Loop Is Not Cheating

I didn’t eliminate human involvement—I just shifted it from tedious pixel-level labeling to high-level quality checks. Reviewing 912 crops in 4 hours (vs labeling them in 76 hours) is a massive win.


Training the Classifier: From Crops to Production Model

Chapter Summary: Balanced crop-level splits plus a ResNet50 fine-tune delivered 95.7% accuracy—while documenting leakage risks and validation strategy.

Results Preview:

  • Train / Val / Test crops: 634 / 139 / 139
  • Best epoch: 7 with macro F1 98.4% (val)
  • Final test macro F1: 95.3%

With 912 labeled crops in hand, I was ready to train a species classifier. But before jumping into model architecture, I faced a critical decision: how do I split the data?


The Split Strategy Dilemma

Most ML tutorials gloss over dataset splitting with a simple “70/15/15 train/val/test.” But with wildlife video data, it’s not that simple. I had two options:

Option 1: Video-Level Splits

Idea: Keep all crops from the same video in the same split (train, val, or test).

Pros:

  • No data leakage (crops from same video never span splits)
  • Realistic evaluation (tests generalization to new camera locations)

Cons:

  • Severe class imbalance (some species get only 6-9 crops in val/test)
  • Unstable validation metrics (too few samples per class)

Option 2: Crop-Level Splits

Idea: Treat each crop independently, shuffle and split randomly.

Pros:

  • Perfect class balance (each species gets proportional representation)
  • Stable validation (adequate samples per class, ~10-15 crops)

Cons:

  • Risk of data leakage (crops from same video might be in train AND test)
  • Optimistic evaluation (model may memorize video-specific features)

My Decision: Crop-Level + Awareness

I chose crop-level splits for model development, with the following rationale:

  1. Training stability matters: With only 64-120 crops per species, I couldn’t afford severe class imbalance. Stable validation metrics are essential for hyperparameter tuning.

  2. Data leakage risk is low in my use case: Most videos have 1 dominant species. Crops from the same video show the same animal in different poses/lighting—that’s actually what I want the model to learn (pose invariance, lighting robustness).

  3. I’ll validate on held-out videos anyway: After training, I tested the model on completely new videos (not in the 87-video training set) to check real-world generalization.

The compromise: Use crop-level splits for development + training, but maintain awareness that test metrics may be slightly optimistic. Document this limitation clearly.

Heads-up: Crop-level splits speed up iteration, but I always re-check performance on fresh videos to guard against leakage.

Final split:

1
2
3
Train: 634 crops (70%)
Val:   139 crops (15%)
Test:  139 crops (15%)

All species represented with 10+ crops per split—sufficient for stable metrics.


Model Architecture: ResNet50

I chose ResNet50 for the classifier:

Why ResNet50?

  • Proven architecture (won ImageNet 2015)
  • Strong ImageNet pretraining (transfer learning boost)
  • Not too deep (efficient training on my RTX 3060 Ti)
  • Well-supported in PyTorch/torchvision

Alternatives I considered:

  • MobileNetV3: Lighter, faster, but less accurate on small datasets
  • EfficientNet: More parameter-efficient, but slower to converge
  • Vision Transformer (ViT): Requires larger datasets (>10k samples)

For 900 crops, ResNet50 hits the sweet spot: strong enough to learn, not so large that it overfits.

Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# From training/train_classifier.py
import torch
import torch.nn as nn
import torchvision.models as models

def build_model(num_classes=11, pretrained=True):
    """Build ResNet50 with custom classifier head."""
    model = models.resnet50(pretrained=pretrained)

    # Replace final FC layer (1000 ImageNet classes → 11 wildlife species)
    num_features = model.fc.in_features  # 2048
    model.fc = nn.Linear(num_features, num_classes)

    return model

Transfer learning: I kept all pretrained layers frozen initially, then fine-tuned the full model. But honestly, with 634 training crops, I found full end-to-end training worked better (no freezing). The ImageNet initialization was enough to prevent overfitting.


Training Configuration

1
2
3
4
5
6
7
8
9
10
11
12
# From config/pipeline.yaml
classification:
  model_arch: resnet50
  pretrained: true
  num_epochs: 15
  batch_size: 32
  learning_rate: 0.0001
  weight_decay: 0.0001
  optimizer: adamw
  lr_scheduler: cosine
  image_size: 224
  balance_classes: true

Key hyperparameters:

  • Optimizer: AdamW (Adam + weight decay, better generalization)
  • Learning rate: 0.0001 (conservative, prevents overshooting)
  • Scheduler: Cosine annealing (smooth decay from 0.0001 → 0)
  • Weight decay: 0.0001 (L2 regularization to prevent overfitting)
  • Class balancing: WeightedRandomSampler to handle slight imbalance

Training Dynamics: What Actually Happened

I tracked metrics for all 15 epochs. Here’s what the learning curves looked like:

EpochTrain LossTrain AccVal LossVal AccVal F1
12.1844.3%1.8267.6%63.7%
21.4276.7%0.9683.5%81.1%
30.7588.0%0.5589.2%88.3%
40.3295.4%0.2695.0%95.0%
50.1996.7%0.1996.4%96.1%
60.1198.4%0.1796.4%95.9%
70.0897.9%0.1398.6%98.4%Best
80.0599.5%0.1198.6%98.4%
150.0499.4%0.1097.8%97.6%

Observations:

  1. Fast convergence: 95% accuracy by epoch 4 (transfer learning FTW!)
  2. Best model at epoch 7: Validation F1 peaked at 98.4%
  3. Slight overfitting: Train accuracy climbs to 99.4%, val stays at ~98%
  4. Early stopping candidate: Could have stopped at epoch 10 (plateau)

I saved the epoch 7 checkpoint as best_model.pt for final evaluation.


Evaluation: Testing the Best Model

After training, I ran training/eval_classifier.py on the held-out test set:

1
2
3
4
5
6
7
python training/eval_classifier.py \
  --config config/pipeline.yaml \
  --manifest data/crops_manifest.csv \
  --splits experiments/exp_003_autolabel/splits.json \
  --checkpoint experiments/exp_003_species/best_model.pt \
  --output-dir experiments/exp_003_species \
  --split test

Test Results:

1
2
3
4
{
  "test_accuracy": 0.957,
  "macro_f1": 0.953
}

Test Set: Accuracy 95.7%, macro F1 95.3, loss 0.166.

95.7% accuracy, 95.3% F1—production-quality results!


Reproducibility: Training Artifacts

All training artifacts are in experiments/exp_003_species/:

  • best_model.pt - Trained ResNet50 checkpoint (epoch 7, 283MB, Git LFS)
  • metrics.csv - Per-epoch training logs
  • metrics.json - Final test metrics + per-class F1
  • predictions_test.csv - Crop-level test predictions

Reproduce training:

1
2
3
4
5
6
python training/train_classifier.py \
  --config config/pipeline.yaml \
  --manifest data/crops_manifest.csv \
  --splits experiments/exp_003_autolabel/splits.json \
  --output-dir experiments/exp_003_species \
  --model resnet50

See the full training script at training/train_classifier.py.


Series Navigation

This is Part 4 of 5 in the Wildlife Tracking Uruguay series.

Previous: Part 3: The Tail-Wagging Paradox

Next: Part 5: Results & Impact - Explore the final test set performance (95.7% accuracy), confusion matrix analysis, production deployment, and key lessons learned from building this end-to-end wildlife tracking pipeline.

This post is licensed under CC BY 4.0 by the author.