← Real World Protocols
Real World Ready  ·  Layer 1: Authentic Experience

Training the Eye: AI Image Classification in the Field

Technology  ·  Environmental Science  ·  Earth Science  |  Years 9–13  |  Portable framework  ·  No specialist equipment needed
Most encounters with AI in the classroom involve a chatbot: the student asks a question, the model generates text. This protocol is about a different kind of AI entirely. Computer vision models do not read or write. They look. They are trained on images, not language, and they learn to recognise categories of things in photographs by finding patterns humans cannot easily articulate. Students who have stood in a landscape, decided what matters, photographed examples, watched a model learn from their images, and then returned to test it against new territory understand something about AI that no chatbot demonstration can teach. The model is only as good as what they gave it. That is the lesson, and it can only be learned by building.
Prepare
Collect training data
Train and test
Evaluate and extend
Protocol contributor: This protocol was developed in collaboration with Taylor Thomson, Kaimātai Taiao (Environmental Scientist) and Field-Based STEM Facilitator. Taylor's experience training AI-assisted computer vision models for environmental rehabilitation monitoring in professional practice is the foundation for the professional context presented in this protocol.
Health and safety: As with any activity outside the classroom, your school's EOTC requirements and health and safety procedures apply. This protocol can be completed at any accessible outdoor site, including school grounds. Ensure students are adequately supervised for the chosen location.
What the Model Is and How It Learns

Google Teachable Machine

The tool for this protocol is Teachable Machine, a free, browser-based image classification platform from Google. No download, no account, and no coding is required. Students open it on any device with a browser, define their categories, add training images, train the model, and test it, all in a single session. Access at teachablemachine.withgoogle.com

1
Categories come first

A computer vision model does not observe a landscape and decide what to notice. It is given categories by the people who build it. Before any photographs are taken, students decide what distinction they want the model to learn: erosion or no erosion, native species or exotic, dense canopy or sparse. The choice of categories is a scientific and design decision, not a technical one.

2
Training data is the curriculum

The model learns only from the images it is given. It cannot generalise beyond them unless its training data contains enough variety. A model trained only on sunny-day erosion photographs will struggle with the same erosion in overcast light. This is not a bug: it is the defining characteristic of the technology, and understanding it is the primary learning outcome.

3
Confidence scores, not certainty

When Teachable Machine classifies a new image, it returns a confidence score for each category: a probability, not an answer. A model that returns 54% erosion / 46% stable is not malfunctioning. It is encountering an image that resembles both categories. Students should understand what that output means before they trust or dismiss it.

4
The model does not understand anything

Teachable Machine finds visual patterns in pixel data. It does not know what erosion is, why it matters, or what causes it. The environmental understanding belongs entirely to the student. This distinction matters: the model is a pattern-recognition tool, not a monitoring authority.

Tip: Running Teachable Machine in class before the field visit, using any two categories students can photograph in the room, removes the technical learning curve from the field session and lets students focus entirely on the data collection decisions.
Collecting Training Data in the Field
1
Choose your monitoring question

The question should be answerable by looking at a photograph. Good candidates for NZ landscapes include: erosion versus stable bank, native riparian vegetation versus exotic, healthy stream edge versus degraded, dense canopy versus open canopy, evidence of stock access versus fenced riparian zone. The question should be ecologically meaningful, not just visually convenient.

2
Photograph enough examples

A minimum of 30 images per category gives the model enough variation to learn from without overfitting to a small set. Students should vary their angle, distance, and lighting as they photograph. An image taken from one metre away and an image of the same feature from five metres are both useful training examples.

3
Photograph the difficult cases too

The images that sit on the boundary between categories are the most valuable training data. A bank that is beginning to erode but is not yet clearly in either category should be photographed and assigned. The student's decision about which category it belongs to is itself an act of scientific judgement.

4
Keep a set of images back for testing

Before leaving the field, set aside ten images per category that will not be used for training. These become the test dataset: images the model has never seen, used to evaluate whether it has actually learned the distinction or simply memorised its training examples.

5
Return to the field after training

The most instructive moment in this protocol is testing the trained model against new field images taken after training is complete. Students photograph new examples of both categories and run them through the model. Where it succeeds and where it fails is the data for every subsequent discussion.

What phone photography can and cannot do: A phone camera is sufficient to train a functional classroom model. It cannot produce the resolution, coverage, or consistency of drone imagery or satellite data. That gap is not a limitation to apologise for: it is the starting point for the professional context discussion in column three.
What This Looks Like at Professional Scale

The conceptual workflow students complete with Teachable Machine is identical to what environmental and engineering professionals use at scale. The differences are in data volume, model sophistication, deployment infrastructure, and the consequences of getting the answer wrong.

Environmental rehabilitation monitoring

Companies such as Dendra Systems in Australia train computer vision models on drone imagery to identify erosion, invasive species, and vegetation recovery across mining rehabilitation sites. Quarterly drone surveys generate thousands of images. The model flags areas of concern for human review, enabling monitoring at a scale that would be impossible with field observers alone. Students in this protocol are completing the same conceptual workflow with phone cameras and a single site.

Riparian and catchment monitoring in NZ

Regional councils and DOC use aerial and satellite imagery to monitor vegetation change, erosion, and land use compliance across catchments. The shift toward AI-assisted image analysis is already underway. An environmental scientist or engineer who understands how these models are trained, what their failure modes are, and what human judgement they require is substantially better equipped for this work than one who has only used the outputs.

Post-disturbance land assessment

Following major weather events, AI-assisted image classification has been used to rapidly assess land damage across large areas before field teams can reach every site. The speed advantage of a trained model over manual image review is significant. The risk of systematic error when the model encounters conditions outside its training distribution is equally significant.

The gap that matters: Professional models are trained on thousands of images across varied conditions, validated against field-verified ground truth data, and evaluated against precision and recall thresholds before deployment. A student Teachable Machine model trained on 60 phone photographs is the same concept at a scale where the failure modes are visible and recoverable. That is exactly the right scale for learning what the concept actually means.
Drone access: Where schools have access to a drone and a trained operator, aerial imagery can replace phone photography for data collection. This closes the gap between the student workflow and the professional one considerably. It is not required for this protocol.

Back in the classroom: AI as thinking partner (Real World Ready Layer 2)

These prompts build on what students observed, decided, and discovered in the field and during model training. The gen AI chatbot is not the technology being studied here: it is a thinking partner for reflecting on a different kind of AI system that students have now built and tested themselves.

Years 9–10
What is a training dataset?

Ask a gen AI chatbot to explain what training data is for an image classification model. Then compare its explanation with your own experience in the field. What made your dataset strong? What limited it? Where does the AI explanation match what you found, and where does it miss the practical reality?

Why did it fail?

Identify one image your model classified incorrectly. Describe that image to a gen AI chatbot and ask: "Why might an image classifier trained on 30 examples per category struggle with this image?" Evaluate the response. Does it identify the same reasons you suspect, or different ones?

Where is this technology used?

Ask a gen AI chatbot to describe three real-world applications of computer vision image classification in environmental monitoring. For each application, ask: what categories would the model need to learn, and what would good training data look like? Then compare those requirements with the training data you collected.

Human versus model

Take five of your test images and classify them yourself, without the model. Then run them through the model. Where do you agree? Where do you disagree? Ask a gen AI chatbot: "In what conditions is a trained computer vision model more reliable than a human observer, and when is it less reliable?" Evaluate the answer against your own comparison.

Years 11–13
Measuring model quality

Ask a gen AI chatbot to explain precision, recall, and the confusion matrix as measures of classifier performance. Apply each measure to your own model's test results. Calculate your model's precision and recall for each category. Ask the AI: "What precision and recall thresholds would be required before a model like this was used for real environmental monitoring decisions?" Evaluate whether your model would meet that threshold.

Training data bias

Review your field data collection process and identify at least two systematic biases in your training dataset: conditions you photographed consistently that may not represent the full range of the real world. Ask a gen AI chatbot to explain what "distribution shift" means in machine learning. Apply that concept to your model: what would distribution shift look like if your model were deployed at a different site or in a different season?

From classroom to professional deployment

Your model was trained on phone photographs from a single site in a single session. Ask a gen AI chatbot: "What steps would be required to develop a computer vision model suitable for professional environmental monitoring, starting from a student-built Teachable Machine classifier?" Evaluate each step against the gap you observed between your model's performance and the professional applications described in this protocol.

Decisions, accountability, and error

Ask a gen AI chatbot: "Who is accountable when an AI-assisted environmental monitoring system produces an incorrect classification that leads to a management decision being delayed or made incorrectly?" Research whether New Zealand has regulatory guidance on AI use in environmental decision-making. What does the absence or presence of that guidance tell you about where this technology sits in the professional and legal landscape?

EXPERIENCE TRACE SCALE  ·  TRAINING THE EYE: AI IMAGE CLASSIFICATION IN THE FIELD
Level Years 9–10 Years 11–13
1 Student can explain what a computer vision model is trained on, and describe the categories they chose and why. Understands that the model learns from examples, not from rules, and can only classify what it has been shown. Student can explain the conceptual difference between a language model (text-based, generative) and a computer vision classifier (image-based, discriminative), and describe the training workflow they completed, from category definition through to test evaluation.
2 Student connects the quality and variety of their training data to their model's performance on the test set. Can identify at least one specific case where the model failed and offer a reason grounded in the training data they collected. Student applies the concepts of precision, recall, and distribution shift to their own model's performance, identifying at least two systematic limitations in their training dataset and predicting how those limitations would affect deployment at a different site or under different conditions.
3 Student compares their Teachable Machine model with a professional environmental monitoring application, identifying at least two differences in scale, data quality, or deployment context, and explains what those differences mean for how much the model's output can be trusted. Student evaluates the gap between their student-built classifier and a professionally deployed system across multiple dimensions: training data volume and diversity, validation methodology, precision and recall thresholds, and the consequences of error in each context.
4 Student articulates what standing in the landscape and making data collection decisions added that could not be replicated by using a pre-built dataset or watching a demonstration: the experience of deciding what counts, photographing the difficult cases, and discovering where the model fails. Student reflects on the specific knowledge that comes from building and testing a model in the field: understanding of what a training dataset actually contains, what the model cannot see that a human observer can, and what environmental understanding is required to interpret the model's output responsibly.
5 Student identifies one specific improvement they would make to their training data collection if they repeated the protocol, grounded in the failure cases their model produced. Can describe what a more robust test would look like and what it would tell them. Student designs a hypothetical monitoring deployment for their site: specifies the training data requirements, the validation methodology, the precision and recall thresholds for deployment, the human review process, and the accountability structure for decisions made using the model's output. Identifies what field experience is irreplaceable in that design process.