PROCESSES, PREDICTION, AND THE POWER OF BIG DATA: HOW MACHINE LEARNING IS IMPROVING STORMWATER MODELING
In this discussion, Geosyntec's Christian Nilsen and Saiful Rahat delve into how cities are applying machine learning in their stormwater modeling. They explore how modeling can help communities achieve greater water quality and adapt to climate change.
About the Speakers
Christian Nilsen, P.E., is a Seattle-based project manager who specializes in decision-support tools for analyzing and prioritizing green stormwater infrastructure. His work includes The Nature Conservancy’s machine-learning powered Stormwater Heatmap and watershed planning tools for King County and the City of Tacoma.
Saiful Rahat, Ph.D., is a water systems modeler and researcher focused on assessing climate risks and understanding how machine-learning algorithms and remotely sensed satellite data can improve water systems management. His work includes projects for USACE, NOAA, the National Science Foundation, and the World Bank. His most recent paper, Remote Sensing-Enabled Machine Learning for River Water Quality Modeling Under Multidimensional Uncertainty was published early this year in the journal Science of the Total Environment.
What is machine learning?
Machine learning is a subset of artificial intelligence, or AI. When you hear AI, most people think of ChatGPT, which actually another branch of AI called generative AI.
Machine learning uses algorithms and models to look for patterns. We use remote sensing, where we have satellites taking thousands of pictures a day all over the earth. There's no way one human can look at that and get insights. But machine learning lets us look for patterns and quickly classify those images.
Is that how you understand machine learning, Rahat?
I think Christian described it beautifully. Let's say you are an engineer trying to model a river watershed. Traditionally, we leaned on one-size-fits-all mathematical equations or models that could be limited to certain variables.
But here’s where machine learning steps in. Instead of relying on one fixed equation, machine-learning algorithms can analyze tons of data and you can use as many variables as you like. It’s like giving your model a crash course in the specific river behavior that you are interested in and having it come up with its very own equation using all sorts of location-specific parameters. And what excites me is having the ability to apply this approach to watershed management or any kind of hydrological modeling.
How can machine learning be used in watershed management or hydrological modeling?
Before I address that, let me say that we are not going to replace physical process-based models with machine learning anytime soon. They are the most classic and valuable tools for watershed modeling. However, machine learning can be a powerful complement to improve predictions.
Imagine you're trying to understand how a river reacts to long-term climate patterns. Conventional models typically focus on fundamental variables like rainfall and temperature. But a river can be affected by so many other factors. Things like agricultural practices, land-use patterns, soil types, or even seasonal changes in farmer behavior, for example.
Machine learning gives us the ability to consider all these variables comprehensively. It's a powerful tool that gathers a wide array of information from various angles, providing us with a better understanding of the situation.
I think we need to be cautious when we talk about prediction with machine learning, especially in terms of engineering and hydrology in our work.
When we're using models to predict, that's based on process-based models and data we can measure. And while machine learning is good at finding patterns, we don't want to rely on it to make predictions that impact safety or property.
We know really well through experimental data how river stage is going to change with flow, for example. That's something that can be measured. We don't want to replace data like that—that we have confidence in—with data measurements where we might not have a lot of insight into the machine-learning output. We can review patterns and results and see if they make sense to us, but we don’t want to rely on it for making predictions.
Where I see a lot of potential for machine learning is for physical processes that we don't really have great process-based models for. Like soil infiltration or evaporation dynamics or wind dynamics, where it can look into the patterns of the data and find relationships we might not necessarily be able to measure.
Can’t you see how machine-learning models prioritize each predictor variable?
You can, but how it's doing the calculations and making its analysis, you can’t really point to a relationship or an equation. If you're explaining the results, for example, you can look at the outputs and you can see how it's been weighted, but there's always going to be something unknowable about how it's how it's making those decisions.
And there are many flavors of different machine-learning algorithms. Some are more opaque than others, but the point I wanted to make is that when we're making decisions for public financing or capital projects, we want to be able to show the path to the answers and analysis, and the equation that led us to them.
And so, my bigger point is that I see it helping, but it's never going to replace the expertise of an engineer. I see the importance of machine learning as helping to inform the person doing the modeling on processes that we might not otherwise have a lot of insight into.
Oh, that's for sure. I agree!
Let's pick up on that a little bit more. What steps can one take to make sure the machine-learning models are trained on relevant and representative data so that the predictions are accurate?
OK—so interpretability for machine learning is definitely a valid concern and a lot of the time these models can be a black box, as Christian said. But there is a way around it.
I think one way to better understand a model’s prediction is to identify the key relationships it has learned from the data it was trained on. We can look at the most influential inputs driving the prediction based on the weight factors of each variable. We can also run a sensitivity analysis to understand how different factors are affecting this prediction.
It is important to use your judgment. A real-life example would be ChatGPT. If you ask ChatGPT to do complex mathematical problems, most of the time it will give you inaccurate results. Why? Because it's a language model and when you're asking it to do a calculation, it's giving you textual answers based on probabilities rather than complete exact solutions
But that doesn’t mean it is a bad model. It’s wonderful at organizing texts. So the key takeaway is understanding what your model’s limitation is and applying those models for the specific purposes it was created for. Machine-learning models are great, but you have to be cautious about the type of prediction they give you.
And like any other modeling, the results you get are directly correlated with the quality of the inputs that you that you give it. Garbage in, garbage out.
The most important thing is to make sure you have high quality, accurate data. That’s why machine learning should not take the place of measuring things you can physically measure, because the more things you can measure the less error there will be in the results.
What kind of data was used for the Stormwater Heatmap?
We used remote-sensing data, satellite imagery, and aerial imagery to classify land cover. What we did was use a data set provided by the Washington Department of Fish and Wildlife (WDFW) where they hand-classified aerial photography with land-cover classes. Things like rooftops, roadways, trees, and grass. Then we trained a model to look at the spectral imagery in that aerial imagery.
What’s spectral imagery?
By spectral imagery I'm talking about the red band, green band, near infrared band, other things that are actually measurements. The model looks at the relationships between all those different bands and then comes up with a way to correlate them with WDFW’s hand-classified data. Then for areas where we don’t have measured data, we can use the model to predict the type of land cover.
We also used other data and did a traditional regression analysis without machine learning to predict pollutant loads.
You mentioned about the availability of quality data. In cases where data is limited, how can machine-learning techniques still be integrated effectively into riverine and watershed modeling?
Limited data is a common challenge, but there are strategies to address it. You can use techniques like transfer learning, where models are first trained on other similar watershed data. We can also use expert knowledge by defining the expected relationship between the inputs and outputs of a certain model.
For instance, Christian, I think you said that some aspects of your work included linear regression, right?
So, in our recent research with the US Army Corps of Engineers, not every step involved machine learning. Some of the steps actually employed straightforward linear regression methods, which, interestingly, fall under the umbrella of machine learning too. Point being, you need to use a variety of techniques to address data limitations.
Yeah. And when we do hydrology and hydraulic models, these models have calibration parameters to adjust the equations for modeling these physical processes. Often, there's not a lot of insight into how these should be adjusted to match the data. One class of example is in a stream channel, if you have flow and depth, you calibrate the roughness of the channel so you're matching the results. In more complex watershed models, there might be 20 or 30 different parameters associated with soil moisture retention, evaporation, and things like that.
I have seen some good examples of using, not just machine learning, but also genetic algorithms, a branch of AI that's slightly different from machine learning. Here, you'll start with a calibration run and then it's almost like survival of the fittest, an evolution model where the best model that fits the data will be iterated on and then iterated on. And then you get an answer that dials in those calibration parameters.
That’s another example of where you might have limited data, but you have a good understanding of the fundamental equations that govern the processes. Still, you might not have a lot of insight into how your particular case is going to behave in terms of those really specific calibration models.
How do you explain these predictions or outcomes, or however you want to call it, from machine learning based models to stakeholders and decision-makers? Are people skeptical of the data or the outcomes that come out from it?
I think it's in the same class as our typical models where we don't want to get bogged down in the details and we don't want to show that we have absolute trust in our models, but we do want to explain the steps.
So we explain: why are we using a model? Why are we applying it to this situation? What can it show? And then we talk about the results and our interpretation of those results and make sure that we're also including all the caveats that we would with any traditional modeling—that it's based on the data that we do have. And there's always going to be errors in our in our results. But if you can bring out what's useful for decision-making, that's the most important thing any model.
I would follow the same approach as well.
At the end of the day, we are trying to come up with solutions. And not just solutions, but possible solutions, solutions that can help prepare for unforeseen challenges. Regardless of whether we use machine learning or a physical process-based model, we need to come up with strategies that make the most sense.
Yeah, people don't trust models anyway. (laughs) It doesn't matter if it's a computer or a person, people don't necessarily trust models.
And models should be flexible like that. You cannot prepare for only one future. You have to prepare for multiple futures. That's why we use model results as a range from the worst- to best-case scenarios.
Speaking of looking to the future. Five to ten years from now, what will this pivot into machine learning help us do better than we would otherwise?
Yeah, it's an exciting time. It's hard to really predict where the advances are going to go. In general, what I'm excited about is the explosion in computing power and cloud computing power to look at a lot of data.
In our work for public agencies, a lot of the best data might be in the filing cabinet somewhere or in a PDF or written down by hand. By using text recognition or semantic models of language, there's a lot that can be pulled out of the data that's locked away.
For example, there are great groundwater data collected by well drillers on handwritten well logs. It's very tedious for a human to go through and read those and extract the data, but we can give that information to a computer and it can pull out that data, which can then be used for all kinds of things.
For me, it's more like merging the best of both worlds. We’re not going to replace the physical process-based model anytime soon and we should not. In practice, we have seen some studies where deep learning techniques were incorporated for flood prediction, and the results were dramatically improved.
And for the future, yes, machine-learning models have their strength in data-driven prediction and pattern recognition, but physical process-based models give us scientific consistency.
It is exciting and it brings me a lot of joy that we have the ability to use both techniques for the well-being of our environment.
As long as we apply them together wisely and with good judgment. That’s the key here. I believe we can make some remarkable progress. The possibilities are endless, I’d say.
Learn more about The Nature Conservancy’s Stormwater Heatmap: https://www.washingtonnature.org/fieldnotes/creating-the-stormwater-heatmap-an-open-source-tool-to-track-pollution
Listen to Christian explain the Stormwater Heatmap’s origins: https://www.youtube.com/watch?v=eao39Ba6g0o
Check out Dr. Rahat’s latest paper “Remote Sensing-Enabled Machine Learning for River Water Quality Modeling Under Multidimensional Uncertainty” from Science of the Total Environment: https://www.sciencedirect.com/science/article/pii/S004896972304127X?dgcid=author