Post

Wildlife Tracking Uruguay: Building an AI Pipeline for Conservation (Part 1)


Introduction & Motivation

Over the past few years, wildlife conservation has increasingly leaned on computer vision to scale monitoring and reduce the manual burden of species identification. In Uruguay, camera traps are deployed in forests, reserves, and farmlands to collect long hours of video footage capturing local fauna. But turning these hours of video into reliable species-level counts remains a serious bottleneck: manual annotation is slow, error-prone, and difficult to scale.

I first encountered this challenge through the Cupybara Kaggle competition, which offered ~2,000 real trap-camera videos of Uruguay’s wildlife. I was drawn by its raw potential: unlike image-based wildlife datasets, these videos present temporal structure, motion, and continuity across frames—a richer but more complex domain to tackle. I wanted to experiment with video-based pipelines rather than static image classification.

Due to university commitments and work, I couldn’t formally enter the competition, but the dataset stuck with me. I saw in it an opportunity: to build something more than just a competition submission—to build an end-to-end system that could serve as a reproducible and extensible wildlife-analytics pipeline for Uruguay.

Why Videos Matter: Unlike static image datasets, video provides temporal continuity that enables tracking individual animals across frames, dramatically reducing redundant classifications and improving accuracy.


My Journey with AI Assistance

At the start of the year, I held doubts about how much “AI assistance” I should use: I had been skeptical of overreliance on large language models or code assistants. But on this project, I treated AI not as a crutch but as a collaborator. Inspired by the notion of a cybernetic teammate, I consulted relevant literature (e.g. The Cybernetic Teammate paper), and experimented with using GPT-5 for ideation and discussion, Claude for scripting and shell tasks, and Codex for coding nontrivial routines.

This case study documents that collaboration. You’ll see throughout where AI generated scaffolding and where I refined based on domain expertise. The goal is transparency: show what AI can and cannot do in real ML system development.


Project Scope & Goals

Mission: Build an automated pipeline that takes raw Uruguayan camera trap videos and outputs species-level counts.

Target species: 11 Uruguayan fauna classes

1
2
armadillo, bird, capybara, cow, dusky_legged_guan, gray_brocket,
hare, human, margay, skunk, wild_boar

Dataset:

  • ~2000 videos total in the Cupybara dataset
  • 87 videos used for training (manually labeled at video-level)
  • 16 videos used for testing/tuning
  • 1080p @ 30fps, 10-60 seconds each

Success criteria:

  • Automated end-to-end pipeline (no manual crop labeling)
  • Production-quality accuracy (>90% on test set)
  • Reproducible experiments (all decisions version-controlled)
  • Real-world deployment (process new videos in minutes, not hours)

Why This Matters

Chapter Summary: Automating the pipeline unlocks faster wildlife monitoring, region-specific models, and reproducible research handoffs.

Beyond the technical challenge, this project addresses real conservation needs:

Scaling Wildlife Monitoring

Manual video review is the bottleneck in camera trap studies. Researchers spend hundreds of hours watching footage to generate species counts. An automated pipeline can:

  • Process videos 17x faster than manual review
  • Enable larger-scale deployments (1000+ cameras feasible)
  • Free researchers to focus on ecological analysis, not annotation

Region-Specific Models

Most wildlife ML tools are trained on North American or African species. Uruguayan fauna (margays, capybaras, dusky-legged guans) are underrepresented. This pipeline demonstrates how to build region-specific classifiers from scratch using weak supervision.

Reproducible ML for Conservation

Conservation projects often lack ML expertise. By open-sourcing this pipeline with comprehensive documentation, I hope other researchers can adapt it to their regions/species without starting from zero.


Technical Approach at a Glance

Chapter Summary: Three modular stages—detection, tracking, classification—turn raw camera trap footage into reliable species counts.

I designed a three-stage modular pipeline:

Pipeline Architecture Three-stage modular pipeline: Detection → Tracking → Classification → Species Counts

Why this architecture?

  • Separation of concerns: Detection (where?), Tracking (which?), Classification (what?)
  • Leverage pretrained models: MegaDetector for detection, ImageNet for classification
  • Exploit temporal structure: Tracking reduces redundant classification (one track = one animal)

Key Innovation - Ultra-Permissive IoU Tracking: Solved the “tail-wagging paradox” by recognizing that animals aren’t rigid objects. Traditional tracking parameters caused 4+ tracks per animal; optimized parameters achieved 1.2 tracks per video.

Other major innovations:

  • Video-level weak supervision - Labeled videos, not crops (73 hours saved)
  • Hybrid frame sampling - Balanced confidence + temporal diversity for crop quality

Development Timeline

Chapter Summary: Three focused weeks of experiments (detection, tracking, classification) brought the pipeline from idea to production.

This project took 3 weeks from start to production:

Week 1: MegaDetector Calibration

  • Tested 27 parameter configurations
  • Found optimal balance: 2x speedup + 15% recall gain
  • Experiment: exp_001_md_calibration

Week 2: ByteTrack Implementation

  • Discovered track fragmentation problem (4+ tracks per animal)
  • Tuned ultra-conservative parameters (1.2 tracks/video)
  • Experiment: exp_002_tracking

Week 3: Classification Pipeline

  • Auto-labeled 912 crops using filename metadata
  • Trained ResNet50 to 95.7% test accuracy
  • Built end-to-end inference wrapper
  • Experiments: exp_003_autolabel, exp_003_species, exp_004_counts

Total development time: ~60 hours (vs. estimated 120-160 hours without AI assistance)


Results Preview

Before diving into the technical details, here’s what we achieved:

Final Results:

  • Test Accuracy: 95.7%
  • Macro F1 Score: 95.3%
  • Perfect F1 (100%) on 3 species: cow, gray_brocket, wild_boar
  • Labeling Time Reduction: 76 hours → 3 hours (96% reduction)
  • Processing Speed: 5 min/video → 18 sec/video (17x faster)
  • Dataset: 912 labeled crops across 11 Uruguayan species

Efficiency Gains:

  • Labeling time: 76 hours → 3 hours (96% reduction)
  • Processing speed: 5 min/video → 18 sec/video (17x faster)
  • Full dataset processing: 14 hours → 1.5 hours

Artifacts Created:

  • 912-crop labeled dataset (11 species)
  • Trained ResNet50 checkpoint (283MB)
  • End-to-end inference pipeline (one-command deployment)
  • Open-source repository (MIT license)

How to Read This Series

This 5-part series is designed for different reading depths. Choose your path:

🚀 Quick Overview (Part 1 only)

Stay on this page! Part 1 gives you the complete story: motivation, architecture, key innovations, and results. Perfect if you want to understand what was built and why it matters without diving into implementation details.

🔬 Technical Deep-Dive (All 5 parts, Recommened!)

Follow the series chronologically to see the full development journey:

  • Part 2: MegaDetector calibration (parameter sweeps, Pareto analysis)
  • Part 3: ByteTrack adaptation (solving the tail-wagging paradox)
  • Part 4: Weak supervision strategy (auto-labeling 912 crops)
  • Part 5: Production deployment (error analysis, lessons learned)

Each part includes code snippets, experimental results, and reproducibility artifacts.

⚡ Reproducibility (Series + GitHub)

Read all parts + clone the repository to run experiments yourself. Each section links to actual scripts, configs, and data artifacts.

Choose your depth and dive in!


Reproducing This Work: Two Paths

Repository Available: All code, configs, and documentation are open-source at github.com/JuanMMaidana/Wildlife-Tracking-Uruguay

If you want to reproduce this work, there are two distinct paths depending on your goal. Click to expand the path you’re interested in:

🔬 Path 1: Full Training Pipeline (From Scratch) - Train your own classifier from raw videos

Use this if you want to:

  • Train your own classifier from raw videos
  • Reproduce the 95.7% accuracy model
  • Experiment with different species or datasets
  • Follow the complete development journey

What you need:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 1. Setup environment
git clone https://github.com/JuanMMaidana/Wildlife-Tracking-Uruguay.git
conda env create -f environment.yml
conda activate megadetector-pipeline

# 2. Download MegaDetector weights
mkdir -p models/detectors
curl -L -o models/detectors/md_v5a.0.0.pt \
  https://github.com/microsoft/CameraTraps/releases/download/v5.0/md_v5a.0.0.pt

# 3. Add videos to data/dataset-v1/
# ⚠️ IMPORTANT: Videos must follow naming convention!
# Format: species_XXX.mp4 (e.g., margay_012.mp4, capybara_045.mp4)
# Get Cupybara dataset from Kaggle competition

# 4. Run full pipeline (Steps 1-7)
python scripts/10_run_md_batch.py --config config/pipeline.yaml --video-dir data/dataset-v1
python scripts/20_run_tracking.py --config config/pipeline.yaml --md-json data/md_json --video-root data/dataset-v1
python scripts/31_autolabel_from_filenames.py --config config/pipeline.yaml --tracks-json data/tracking_json --video-root data/dataset-v1 --out-dir data/crops --manifest data/crops_manifest.csv
python training/prepare_split.py --config config/pipeline.yaml --manifest data/crops_manifest.csv --out-dir experiments/exp_003_autolabel --strategy crop
python training/train_classifier.py --config config/pipeline.yaml --manifest data/crops_manifest.csv --splits experiments/exp_003_autolabel/splits.json --output-dir experiments/exp_003_species --model resnet50
python training/eval_classifier.py --config config/pipeline.yaml --manifest data/crops_manifest.csv --splits experiments/exp_003_autolabel/splits.json --checkpoint experiments/exp_003_species/best_model.pt --output-dir experiments/exp_003_species --split test
python scripts/40_counts_by_species.py --manifest data/crops_manifest.csv --predictions experiments/exp_003_species/predictions_test.csv --out-dir experiments/exp_004_counts

Directory: data/dataset-v1/ (requires species naming convention)

Time: Hours (depends on dataset size + GPU)


⚡ Path 2: Inference Only (Pretrained Model) - Quick demo with trained model

Use this if you want to:

  • Test the trained model on new videos
  • Quick demo without training
  • Production deployment

What you need:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 1. Setup environment (same as above)
git clone https://github.com/JuanMMaidana/Wildlife-Tracking-Uruguay.git
conda env create -f environment.yml
conda activate megadetector-pipeline

# 2. Download MegaDetector weights (same as above)
mkdir -p models/detectors
curl -L -o models/detectors/md_v5a.0.0.pt \
  https://github.com/microsoft/CameraTraps/releases/download/v5.0/md_v5a.0.0.pt

# 3. Add ANY videos to data/video_inference/
# ✅ No naming convention required!
# Just drop your wildlife videos here

# 4. Run one-command inference
python scripts/run_inference_pipeline.py \
  --videos data/video_inference \
  --checkpoint experiments/exp_003_species/best_model.pt \
  --output experiments/exp_005_inference

Directory: data/video_inference/ (any video names work)

Time: Minutes (18 sec/video)


📊 Key Differences Between Paths
AspectPath 1: TrainingPath 2: Inference
Video directorydata/dataset-v1/data/video_inference/
Naming conventionRequired (species_XXX.mp4)Not required (any name)
Pretrained modelOptional (you’ll train your own)Required (best_model.pt)
Time to runHours (depends on dataset size + GPU)Minutes (18 sec/video)
OutputTrained classifier + metricsSpecies predictions + counts
Use caseResearch, experimentation, custom datasetsDemo, production, new videos


Recommendation for First-Time Users: Start with Path 2 (Inference) to quickly see the pipeline in action. Then explore Path 1 (Training) if you want to understand how the model was built or adapt it to your own species.


System Architecture: A Three-Stage Pipeline

Chapter Summary: This section explains the modular pipeline design that transforms raw camera trap videos into species counts through detection, tracking, and classification stages.

When I first looked at the Cupybara video dataset, I faced a fundamental design question: how do you go from raw camera trap footage to species-level counts?

The naive approach might be to throw every video frame through a species classifier. But that’s wasteful—most frames are empty (no animals), and even when animals appear, you’d classify the same individual hundreds of times across consecutive frames. I needed something smarter: a pipeline that respects the temporal structure of videos.

The Three-Stage Design

I settled on a modular, three-stage architecture:

1
2
3
4
5
6
7
Stage 1: Detection (MegaDetector)
  ↓
Stage 2: Tracking (ByteTrack)
  ↓
Stage 3: Classification (ResNet50)
  ↓
Output: Species Counts

Design Principle: Separation of concerns allows each stage to be optimized independently and models to be swapped without rewriting the entire pipeline.

Let me walk through the rationale for each stage.


Stage 1: Detection - “Where is the animal?”

Goal: Find all animal detections across video frames, filtering out empty scenes.

Tool: MegaDetector v5a - a YOLOv5-based object detector pretrained on ~5 million camera trap images.

Why MegaDetector? I didn’t want to train a detector from scratch. MegaDetector is already excellent at the general task of “find animals in camera trap footage,” even if it doesn’t know Uruguayan species specifically. It gives me bounding boxes and confidence scores—that’s all I need for this stage.

Key Design Decision: Frame stride optimization. I don’t process every single frame (wasteful). Instead, I sample frames at a stride (e.g., every 2nd frame). This balances speed vs. recall—animals move slowly in camera traps, so consecutive frames are highly redundant.

View detection output format (JSON)
1
2
3
4
5
6
7
8
9
10
11
{
  "video": "capybara_012.mp4",
  "detections": [
    {
      "frame": 15,
      "detections": [
        {"bbox": [100, 50, 300, 250], "conf": 0.89, "class": 1}
      ]
    }
  ]
}

Code Reference: scripts/10_run_md_batch.py


Stage 2: Tracking - “Which detections belong to the same animal?”

Goal: Group per-frame detections into persistent tracks (one track = one animal’s trajectory through the video).

Tool: ByteTrack - a tracking algorithm originally designed for pedestrian tracking, but which I adapted for wildlife.

Why Tracking? This is where the video structure pays off. If I can identify that detections in frames 10, 15, 20, 25 all belong to the same capybara, I only need to classify it once (or a few times for confidence), not 100 times. Tracking reduces downstream classification load and gives me cleaner data—one track = one animal.

The Challenge: Wildlife tracking is harder than pedestrian tracking. Animals aren’t rigid objects. A cow wagging its tail changes its bounding box shape dramatically—traditional tracking algorithms interpret this as a “new object” and create fragmented tracks. I had to tune ByteTrack’s parameters aggressively to handle biological motion.

Critical Challenge - The Tail-Wagging Paradox: Default ByteTrack parameters caused 4+ tracks per animal due to shape deformation from biological motion. Solution: ultra-permissive IoU thresholds achieved 1.2 tracks/video.

View tracking output format (JSON)
1
2
3
4
5
6
7
8
9
10
11
{
  "video": "capybara_012.mp4",
  "tracks": [
    {
      "track_id": 1,
      "frames": [15, 20, 25, 30],
      "boxes": [[100,50,300,250], [102,51,301,249], ...],
      "confs": [0.89, 0.91, 0.87, 0.90]
    }
  ]
}

Code Reference: scripts/20_run_tracking.py


Stage 3: Classification - “What species is this animal?”

Goal: Identify the species for each track.

Tool: ResNet50 classifier, fine-tuned on Uruguayan fauna crops extracted from tracks.

Why ResNet50? It’s a proven architecture with ImageNet pretraining. I didn’t need cutting-edge transformers—I needed something reliable, fast to train, and well-understood. Transfer learning from ImageNet gave me a strong starting point for 11 Uruguayan species.

The Data Challenge: To train a classifier, I need labeled crops. Manually labeling 900+ crops would take ~76 hours. Instead, I exploited a clever shortcut: video filenames contain species metadata. Files like margay_012.mp4 or capybara_034.mp4 encode the ground truth. I built an auto-labeling pipeline that:

  1. Extracts crops from tracks
  2. Parses species from filenames (with regex validation)
  3. Applies quality filters (minimum track length, confidence thresholds)

Time-Saving Innovation: Video-level weak supervision (using filename metadata) reduced labeling time from 76 hours to 3 hours - a 96% reduction without sacrificing accuracy.

Code References:


Final Stage: Aggregation - “How many of each species per video?”

Goal: Roll up track-level predictions into video-level species counts.

Algorithm: Simple majority vote per track, then count unique tracks per species per video.

Output Format: CSV of species counts

VideoSpeciesTrack Count
capybara_001.mp4capybara2
margay_034.mp4margay1

Code Reference: scripts/40_counts_by_species.py


Why This Architecture?

Three principles guided my design:

Separation of Concerns

Each stage solves one well-defined problem:

  • Detection: spatial localization (where?)
  • Tracking: temporal association (which?)
  • Classification: semantic labeling (what?)

This modularity means I can swap components independently. If a better detector comes out, I can drop it in without touching the classifier.

Leverage Pretrained Models

I didn’t train a detector from scratch. MegaDetector gave me “free” detection performance on camera traps. Transfer learning from ImageNet gave me a strong classifier baseline. I focused my limited training budget where it mattered: fine-tuning on Uruguayan species.

Exploit Video Structure

Unlike image datasets, videos have temporal continuity. Tracking exploits this: instead of treating frames independently, I group them into meaningful trajectories. This reduces redundant computation and gives me cleaner training data (one crop per track, not one per frame).


Repository & Code

GitHub: github.com/JuanMMaidana/Wildlife-Tracking-Uruguay

License: MIT (open for research and commercial use)

Click to view quick start instructions
1
2
3
4
5
6
7
8
9
10
11
12
13
# Clone repository
git clone https://github.com/JuanMMaidana/Wildlife-Tracking-Uruguay
cd Wildlife-Tracking-Uruguay

# Setup environment
conda env create -f environment.yml
conda activate megadetector-pipeline

# Run end-to-end inference
python scripts/run_inference_pipeline.py \
  --videos data/video_inference \
  --checkpoint experiments/exp_003_species/best_model.pt \
  --output experiments/exp_005_inference

Acknowledgments

Tools & Models:

AI Assistants:

  • Claude Code (Anthropic) - Primary coding collaborator
  • GPT-5 (OpenAI) - Design discussions and ideation
  • Codex (OpenAI) - Secondary coding collaborator

Data:


Series Navigation

This is Part 1 of 5 in the Wildlife Tracking Uruguay series.

Next: Part 2: MegaDetector Calibration - Learn how I systematically tested 27 parameter configurations to optimize MegaDetector for small Uruguayan animals, achieving 2x speedup + 15% recall gain over defaults.

This post is licensed under CC BY 4.0 by the author.