The tool for this protocol is Teachable Machine, a free, browser-based image classification platform from Google. No download, no account, and no coding is required. Students open it on any device with a browser, define their categories, add training images, train the model, and test it, all in a single session. Access at teachablemachine.withgoogle.com
A computer vision model does not observe a landscape and decide what to notice. It is given categories by the people who build it. Before any photographs are taken, students decide what distinction they want the model to learn: erosion or no erosion, native species or exotic, dense canopy or sparse. The choice of categories is a scientific and design decision, not a technical one.
The model learns only from the images it is given. It cannot generalise beyond them unless its training data contains enough variety. A model trained only on sunny-day erosion photographs will struggle with the same erosion in overcast light. This is not a bug: it is the defining characteristic of the technology, and understanding it is the primary learning outcome.
When Teachable Machine classifies a new image, it returns a confidence score for each category: a probability, not an answer. A model that returns 54% erosion / 46% stable is not malfunctioning. It is encountering an image that resembles both categories. Students should understand what that output means before they trust or dismiss it.
Teachable Machine finds visual patterns in pixel data. It does not know what erosion is, why it matters, or what causes it. The environmental understanding belongs entirely to the student. This distinction matters: the model is a pattern-recognition tool, not a monitoring authority.
The question should be answerable by looking at a photograph. Good candidates for NZ landscapes include: erosion versus stable bank, native riparian vegetation versus exotic, healthy stream edge versus degraded, dense canopy versus open canopy, evidence of stock access versus fenced riparian zone. The question should be ecologically meaningful, not just visually convenient.
A minimum of 30 images per category gives the model enough variation to learn from without overfitting to a small set. Students should vary their angle, distance, and lighting as they photograph. An image taken from one metre away and an image of the same feature from five metres are both useful training examples.
The images that sit on the boundary between categories are the most valuable training data. A bank that is beginning to erode but is not yet clearly in either category should be photographed and assigned. The student's decision about which category it belongs to is itself an act of scientific judgement.
Before leaving the field, set aside ten images per category that will not be used for training. These become the test dataset: images the model has never seen, used to evaluate whether it has actually learned the distinction or simply memorised its training examples.
The most instructive moment in this protocol is testing the trained model against new field images taken after training is complete. Students photograph new examples of both categories and run them through the model. Where it succeeds and where it fails is the data for every subsequent discussion.
The conceptual workflow students complete with Teachable Machine is identical to what environmental and engineering professionals use at scale. The differences are in data volume, model sophistication, deployment infrastructure, and the consequences of getting the answer wrong.
Companies such as Dendra Systems in Australia train computer vision models on drone imagery to identify erosion, invasive species, and vegetation recovery across mining rehabilitation sites. Quarterly drone surveys generate thousands of images. The model flags areas of concern for human review, enabling monitoring at a scale that would be impossible with field observers alone. Students in this protocol are completing the same conceptual workflow with phone cameras and a single site.
Regional councils and DOC use aerial and satellite imagery to monitor vegetation change, erosion, and land use compliance across catchments. The shift toward AI-assisted image analysis is already underway. An environmental scientist or engineer who understands how these models are trained, what their failure modes are, and what human judgement they require is substantially better equipped for this work than one who has only used the outputs.
Following major weather events, AI-assisted image classification has been used to rapidly assess land damage across large areas before field teams can reach every site. The speed advantage of a trained model over manual image review is significant. The risk of systematic error when the model encounters conditions outside its training distribution is equally significant.
These prompts build on what students observed, decided, and discovered in the field and during model training. The gen AI chatbot is not the technology being studied here: it is a thinking partner for reflecting on a different kind of AI system that students have now built and tested themselves.
Ask a gen AI chatbot to explain what training data is for an image classification model. Then compare its explanation with your own experience in the field. What made your dataset strong? What limited it? Where does the AI explanation match what you found, and where does it miss the practical reality?
Identify one image your model classified incorrectly. Describe that image to a gen AI chatbot and ask: "Why might an image classifier trained on 30 examples per category struggle with this image?" Evaluate the response. Does it identify the same reasons you suspect, or different ones?
Ask a gen AI chatbot to describe three real-world applications of computer vision image classification in environmental monitoring. For each application, ask: what categories would the model need to learn, and what would good training data look like? Then compare those requirements with the training data you collected.
Take five of your test images and classify them yourself, without the model. Then run them through the model. Where do you agree? Where do you disagree? Ask a gen AI chatbot: "In what conditions is a trained computer vision model more reliable than a human observer, and when is it less reliable?" Evaluate the answer against your own comparison.
Ask a gen AI chatbot to explain precision, recall, and the confusion matrix as measures of classifier performance. Apply each measure to your own model's test results. Calculate your model's precision and recall for each category. Ask the AI: "What precision and recall thresholds would be required before a model like this was used for real environmental monitoring decisions?" Evaluate whether your model would meet that threshold.
Review your field data collection process and identify at least two systematic biases in your training dataset: conditions you photographed consistently that may not represent the full range of the real world. Ask a gen AI chatbot to explain what "distribution shift" means in machine learning. Apply that concept to your model: what would distribution shift look like if your model were deployed at a different site or in a different season?
Your model was trained on phone photographs from a single site in a single session. Ask a gen AI chatbot: "What steps would be required to develop a computer vision model suitable for professional environmental monitoring, starting from a student-built Teachable Machine classifier?" Evaluate each step against the gap you observed between your model's performance and the professional applications described in this protocol.
Ask a gen AI chatbot: "Who is accountable when an AI-assisted environmental monitoring system produces an incorrect classification that leads to a management decision being delayed or made incorrectly?" Research whether New Zealand has regulatory guidance on AI use in environmental decision-making. What does the absence or presence of that guidance tell you about where this technology sits in the professional and legal landscape?
| Level | Years 9–10 | Years 11–13 |
|---|---|---|
| 1 | Student can explain what a computer vision model is trained on, and describe the categories they chose and why. Understands that the model learns from examples, not from rules, and can only classify what it has been shown. | Student can explain the conceptual difference between a language model (text-based, generative) and a computer vision classifier (image-based, discriminative), and describe the training workflow they completed, from category definition through to test evaluation. |
| 2 | Student connects the quality and variety of their training data to their model's performance on the test set. Can identify at least one specific case where the model failed and offer a reason grounded in the training data they collected. | Student applies the concepts of precision, recall, and distribution shift to their own model's performance, identifying at least two systematic limitations in their training dataset and predicting how those limitations would affect deployment at a different site or under different conditions. |
| 3 | Student compares their Teachable Machine model with a professional environmental monitoring application, identifying at least two differences in scale, data quality, or deployment context, and explains what those differences mean for how much the model's output can be trusted. | Student evaluates the gap between their student-built classifier and a professionally deployed system across multiple dimensions: training data volume and diversity, validation methodology, precision and recall thresholds, and the consequences of error in each context. |
| 4 | Student articulates what standing in the landscape and making data collection decisions added that could not be replicated by using a pre-built dataset or watching a demonstration: the experience of deciding what counts, photographing the difficult cases, and discovering where the model fails. | Student reflects on the specific knowledge that comes from building and testing a model in the field: understanding of what a training dataset actually contains, what the model cannot see that a human observer can, and what environmental understanding is required to interpret the model's output responsibly. |
| 5 | Student identifies one specific improvement they would make to their training data collection if they repeated the protocol, grounded in the failure cases their model produced. Can describe what a more robust test would look like and what it would tell them. | Student designs a hypothetical monitoring deployment for their site: specifies the training data requirements, the validation methodology, the precision and recall thresholds for deployment, the human review process, and the accountability structure for decisions made using the model's output. Identifies what field experience is irreplaceable in that design process. |