Post

Wildlife Tracking Uruguay: Production Deployment & Lessons Learned (Part 5)

Quick Recap: The Full Pipeline

In this 5-part series, we’ve built an end-to-end wildlife tracking pipeline:

  • Part 1: Introduced the 3-stage architecture (detection → tracking → classification)
  • Part 2: Calibrated MegaDetector for 2x speedup + 15% recall gain
  • Part 3: Solved the tail-wagging paradox with ultra-permissive IoU tracking
  • Part 4: Auto-labeled 912 crops using weak supervision, trained ResNet50 to 95.7% accuracy

Now in this final part, we examine production deployment, results analysis, and key lessons learned.


Results & Impact: Production Deployment

Chapter Summary: Final model hits 95%+ F1 on every species, delivers 9× faster review cycles, and outputs clean species counts ready for conservation workflows.

Results Preview:

  • Macro F1: 95.3% on held-out test set
  • Throughput: 18 s/video end-to-end (≈200 videos/hour)
  • Artifacts: Counts CSV + labeled videos for rapid stakeholder review

After 3 weeks of development, I had a complete pipeline: MegaDetector → ByteTrack → ResNet50 classifier. Time to see how it performs in the real world.


Test Set Performance: The Numbers

Running the best checkpoint (epoch 7) on the held-out test set:

1
2
3
Test Accuracy:  95.7%
Macro F1 Score: 95.3%
Test Loss:      0.166

Per-Class F1 Scores:

SpeciesF1 ScoreSupport (Test Crops)
cow100%18
gray_brocket100%12
wild_boar100%14
human97.0%20
margay95.7%13
hare95.2%12
skunk95.2%12
bird94.7%11
dusky_legged_guan94.1%19
capybara91.7%15
armadillo84.2%13

Three species achieved perfect 100% F1! (cow, gray_brocket, wild_boar)

Lowest performer: Armadillo at 84.2% F1 (still respectable, but room for improvement).

Watchlist: Nighttime armadillo footage drags F1 to 84%—more labeled examples and better low-light augmentations are next.


Confusion Matrix Analysis

The confusion matrix reveals where the model struggles:

Confusion matrix heatmap Confusion matrix on the held-out test crops.

Most confused pairs:

  1. Armadillo ↔ Skunk (2 misclassifications)
    • Both small, dark, nocturnal animals
    • Similar body shape when curled up
    • Often captured in low-light conditions
  2. Capybara ↔ Wild Boar (1 misclassification)
    • Similar size and brownish coloration
    • Both quadrupeds with stocky build
    • Confusion only in partial occlusion cases
  3. Bird species variations
    • High intra-class variance (size, color, posture)
    • Motion blur from flight
    • Still achieved 94.7% F1

Notably absent confusion: No margay ↔ human or cow ↔ bird errors. The model clearly learned meaningful species-specific features, not just superficial correlations.


Error Analysis: When Does It Fail?

I manually reviewed all misclassified test crops. Failure modes cluster around:

1. Extreme Occlusion (>50% hidden)

Example: Armadillo behind tall grass, only shell visible → classified as skunk

Mitigation: Could filter crops with low “visibility score” (future work).

2. Motion Blur

Example: Bird in mid-flight, wings blurred → low confidence across all bird species

Mitigation: Frame sampling already prioritizes high-conf frames, but fast-moving animals remain challenging.

3. Poor Lighting (Very Dark Nighttime)

Example: Nocturnal animal in near-complete darkness → weak features

Mitigation: Could add infrared-specific augmentation or train on more nighttime data.

4. Juvenile Animals

Example: Young capybara (smaller, different proportions) → brief confusion with hare

Mitigation: Need more juvenile examples in training set (data collection bias toward adults).

Key insight: Most errors are understandable even to humans. The model isn’t making nonsensical mistakes (e.g., confusing a cow with a bird). This suggests it’s learned robust visual features.


Inference Speed: Production Performance

I measured end-to-end throughput on my RTX 3060 Ti (8GB VRAM):

Per-Stage Timing:

1
2
3
Stage 1 (MegaDetector):    ~8 FPS
Stage 2 (ByteTrack):       <1 second per video
Stage 3 (Classification):  64 crops/second

Total Pipeline (typical 30-second video):

1
2
3
4
5
Detection:      ~15 seconds
Tracking:       ~0.5 seconds
Crop extraction: ~2 seconds
Classification: ~1 second (assuming 50 crops)
Total:          ~18 seconds per video

Effective speedup: Manual review takes ~5 minutes/video. Automated pipeline takes ~18 seconds.

Processing rate: ~200 videos/hour (if I batch them efficiently)

Scaling to Production

For my 170-video test subset:

  • Automated: ~1.5 hours (170 videos × 18s ÷ 3600)
  • Manual: ~14 hours (170 videos × 5 min ÷ 60)
  • Speedup: 9.3×

For the full 2,000-video Cupybara dataset:

  • Automated: ~10 hours (2,000 videos × 18s ÷ 3600)
  • Manual: ~167 hours (2,000 videos × 5 min ÷ 60)
  • Speedup: 16.7×

Production Impact: At full dataset scale (2,000 videos), the pipeline saves 157 hours of manual work - turning a 3-week labeling project into a half-day batch job.


Real-World Deployment: End-to-End Inference

I wrapped the full pipeline into scripts/run_inference_pipeline.py for one-command deployment:

1
2
3
4
python scripts/run_inference_pipeline.py \
  --videos data/video_inference \
  --checkpoint experiments/exp_003_species/best_model.pt \
  --output experiments/exp_005_inference

This orchestrates:

  1. MegaDetector on all videos → md_json/
  2. ByteTrack on detections → tracking_json/
  3. Crop extraction from tracks
  4. ResNet50 classification → predictions.csv
  5. Species count aggregation → counts/results.csv
  6. (Optional) Labeled video rendering → videos_labeled/

Example output from experiments/exp_005_inference/counts/results.csv:

video,species,track_count
capybara_new_001.mp4,capybara,2
margay_new_034.mp4,margay,1
cow_new_0777.mp4,cow,4
bird_new_089.mp4,bird,1

Clean, structured data ready for ecological analysis.


Species Count Visualization

I generated a species distribution plot from the test dataset:

Species counts bar chart Distribution of predicted species across the labeled videos.

Observations:

  • Balanced representation: No species dominates (cow slightly higher due to herd videos)
  • Diverse ecosystem: 11 species captured across 87 videos
  • Low-frequency species: Margay (elusive carnivore) has fewer sightings, as expected

Dataset Health: Every species clears 60+ crops—perfect for balanced fine-tuning.

This distribution makes ecological sense for Uruguayan camera trap deployments.


Crop Quality Visualization

To show what the model is learning from, I generated a crop grid with 3 examples per species:

Crop grid collage Sample crops show the pose and lighting diversity the model sees.

Quality checks:

  • Sharp, well-framed animals (hybrid sampling worked!)
  • Diverse poses (animals standing, grazing, moving)
  • Varied lighting (day/night, sunny/cloudy)
  • Minimal occlusion (guardrails filtered out bad crops)

This visual proof validates that auto-labeling + quality filters produced training-worthy data.


Impact Metrics: What Did We Achieve?

Let me quantify the real-world impact:

Time Savings

StageDurationNotes
Manual annotation76 hoursCrop-level labeling
Automated pipeline3 hoursVideo-level labeling + validation
Time saved73 hours96% reduction

Processing Speedup

ApproachSpeedImpact
Manual review5 minutes/videoTraditional workflow
Automated pipeline18 seconds/videoAI-powered
Speedup~17× per video
Full dataset (170 videos)14 hours → 1.5 hours9.3× faster

Model Performance

MetricValue
Test accuracy95.7%
Macro F195.3%
Perfect F1 species3 (cow, gray_brocket, wild_boar)

Dataset Created

MetricValue
Total crops912
Species coverage11 Uruguayan fauna
Train/val/test split634 / 139 / 139
ReproducibleYes (all configs/splits version-controlled)

Artifacts Published

ArtifactDetails
Trained model283MB ResNet50 checkpoint (Git LFS)
Code repositoryOpen-source (MIT license)
Documentation6 implementation guides + experiment reports
ReproducibilityFull pipeline runnable with one command

Limitations & Future Work

No project is perfect. Here’s what I’d improve:

Current Limitations

  1. Data leakage risk (crop-level splits)
    • Crops from same video may span train/test
    • Test metrics may be slightly optimistic
    • Mitigation: Tested on held-out videos (not in 87-video set)
  2. Single-species assumption
    • Auto-labeling assumes 1 dominant species per video
    • Multi-species scenes required manual removal (2 videos)
    • Scaling to larger datasets needs multi-label support

Dataset-Specific Insight: This 1-video = 1-species assumption worked for the Cupybara dataset after manual inspection, but it’s not universal. I spent hours auditing the videos to confirm this held true. In other camera trap scenarios (e.g., watering holes, feeding stations), you’ll frequently see mixed-species footage where this assumption completely breaks down. Always validate your dataset’s characteristics before designing your labeling strategy!

  1. Small dataset (912 crops)
    • Limited pose diversity for rare species (margay: 72 crops)
    • Vulnerable to distribution shift (new camera locations)
    • Would benefit from 10x more data
  2. Armadillo performance (84% F1)
    • Confused with skunk in low-light scenarios
    • Need more training examples or better night-vision augmentation

Short-Term Improvements (1-2 months)

  1. Expand dataset to 500+ videos
    • Collect more diverse conditions (weather, seasons)
    • Intentionally target rare species (margay, gray_brocket)
  2. Multi-species scene handling
    • Detect multiple species per video
    • Track-to-species association (not video-to-species)
  3. Confidence calibration
    • Temperature scaling for uncertainty quantification
    • Reject predictions below confidence threshold
  4. Active learning loop
    • Use model uncertainty to guide manual labeling
    • Prioritize low-confidence crops for human review

Long-Term Vision (6-12 months)

  1. Edge deployment (Jetson Nano, Coral TPU)
    • On-camera processing for remote traps
    • Real-time species alerts (e.g., endangered species detected)
  2. Behavior classification beyond species
    • Feeding, mating, territorial behaviors
    • Temporal action recognition (3D CNNs or transformers)
  3. Collaboration with Uruguayan researchers
    • Deploy on real conservation projects
    • Contribute to ecological studies
  4. Generalization to other regions
    • Fine-tune on Argentinian or Brazilian fauna
    • Build region-agnostic “base model” + regional fine-tuning

Key Takeaways

This project represents 3 weeks of focused AI-assisted implementation, but the path here took months of research, experimentation, and failed attempts. I tested multiple architectures (I started with straight-up YOLO models, labeling raw data directly in CVAT, which somehow managed to corrupt my Docker setup and start all over again 😅 ), explored various labeling strategies, and iterated through several non-optimal designs before landing on this modular approach. The 3-week timeline reflects the final, successful build—not the full journey of learning what works and what doesn’t.

With that context, here’s what I ultimately built:

Technical Achievements

  • 95.7% test accuracy on 11 Uruguayan species
  • 3-stage modular architecture (detection → tracking → classification)
  • Fully automated end-to-end inference (18 seconds/video)
  • Reproducible experiments (all configs, splits, checkpoints version-controlled)

Methodological Innovations

  • Ultra-permissive IoU tracking (solved tail-wagging paradox)
  • Video-level weak supervision (73 hours saved vs crop-level labeling)
  • Hybrid frame sampling (balanced confidence + temporal diversity)
  • Quality guardrails (prevented garbage labels, 98% auto-label accuracy)

Development Process Insights

  • AI-assisted coding (Claude Code generated ~80% of initial implementation)
  • Experiment-driven tuning (27 MegaDetector configs, 3 tracking iterations)
  • Living documentation (implementation guides tracked real-time progress)
  • Rapid iteration (3 weeks from start to production, vs. estimated 6-8 weeks)

Beyond the technical side, I approached this repo as professionally as I possibly could. In some of my earlier projects, I had no attached code, or just a single Jupyter notebook. But for this one, I wanted to build something that felt production-grade, not just a demo.

So I focused on learning and applying real software engineering practices: clear structure, modular scripts, consistent logging, versioned configs, and detailed documentation. The repo includes commit histories from Claude Code, each one with precise explanations of what changed and why, along with issue tracking, example fixes, and carefully reviewed pull requests.

This process took real time and effort, but it’s one of the parts I’m most proud of. It taught me how to work with an AI assistant as a teammate, not just as a code generator, and how to build a complete, maintainable, end-to-end pipeline, the way a real ML engineering team would.


Open Source Release

The full pipeline is available at: github.com/JuanMMaidana/Wildlife-Tracking-Uruguay

Included:

  • Complete source code (detection, tracking, training, inference)
  • Trained ResNet50 checkpoint (283MB, Git LFS)
  • Sample dataset (10 videos across all species)
  • Comprehensive documentation (guides, experiment reports, API docs)
  • Unit tests + reproducibility scripts

License: MIT (open for research and commercial use)

Citation: If you use this pipeline, please cite this repository and the underlying tools (MegaDetector, ByteTrack).


Closing Thoughts

When I started this project, I was skeptical about AI-assisted development. Could Claude Code really accelerate my work, or would I spend more time debugging generated code than writing it myself?

The verdict: AI assistance was transformative.

Claude Code generated the scaffolding, I refined it based on real data. GPT-5 helped me think through design decisions. This hybrid approach let me focus on the creative, domain-specific challenges (tail-wagging paradox, weak supervision strategy) while automating the boilerplate (data loaders, training loops, evaluation metrics).

The 3-week timeline speaks for itself. Without AI assistance, this would have taken 8-10 weeks. The 2.5x speedup wasn’t just about typing faster—it was about iterating faster. I could test ideas quickly, fail fast, and converge on solutions backed by data.

But AI didn’t replace domain expertise. It accelerated implementation, but I still had to:

  • Understand wildlife behavior (biological motion challenges)
  • Design clever data strategies (video-level labeling)
  • Interpret results (error analysis, failure modes)
  • Make judgment calls (crop-level vs video-level splits)

This project represents a new way of building ML systems: human insight + AI execution. I’m excited to see where this paradigm leads.


Want to build your own wildlife classifier? Check out the full repository and feel free to reach out with questions!


Pipeline in Action: Real Inference Results

Before wrapping up, here are a few short clips showing the full wildlife detection & tracking pipeline in action. Each video runs through the complete stack: MegaDetector → ByteTrack → ResNet50 classification → Labeled output. All detections, tracks, and predictions are produced in real-time using the same configuration described throughout this series.

Watch how the pipeline handles diverse scenarios—different species, lighting conditions, occlusions, and motion patterns:

What You’re Seeing: Green bounding boxes show MegaDetector’s animal detections, track IDs persist across frames via ByteTrack, and species labels come from the ResNet50 classifier. The pipeline processes everything end-to-end without manual intervention.

Notice the flickering? The bounding boxes sometimes appear to “blink” or flicker because we’re processing frames with stride=2 (analyzing every 2nd frame). If we increased the stride to 3 or 4, the flickering would be less noticeable but we’d miss more detections. This is the speed/accuracy tradeoff we calibrated in Part 2!


Series Navigation

This is Part 5 of 5 in the Wildlife Tracking Uruguay series.

Previous: Part 4: Weak Supervision & Training


Complete Series Index

This 5-part series covered the entire journey of building an automated wildlife tracking pipeline for Uruguayan camera traps:

  1. Part 1: Overview & Introduction - Project motivation, 3-stage architecture, results preview
  2. Part 2: MegaDetector Calibration - Parameter sweeps, Pareto frontiers, 2x speedup
  3. Part 3: The Tail-Wagging Paradox - ByteTrack adaptation, ultra-permissive IoU, 1.2 tracks/video
  4. Part 4: Weak Supervision & Training - Auto-labeling strategy, ResNet50 training, 95.7% accuracy
  5. Part 5: Results & Impact - Production deployment, error analysis, lessons learned

You’ve made it all the way here! Thank you for following the entire journey.

This post is licensed under CC BY 4.0 by the author.