Development of a machine learning-based method for predicting the concentrations and identifying the sources of heavy metals in river water

Denghui Zhu; Tomohisa Shimada; Jiajie Wang; Noriyoshi Tsuchiya

5:15 PM - 6:45 PM

[MGI28-P07] Development of a machine learning-based method for predicting the concentrations and identifying the sources of heavy metals in river water

*Denghui Zhu¹, Tomohisa Shimada¹, Jiajie Wang¹, Noriyoshi Tsuchiya¹ (1.Graduate School of Environmental Studies, Tohoku University)

Keywords:heavy metal pollution, machine learning, predict, source identification

Conventional assessment method for heavy metal pollution in river systems requires long-term sampling of river water with the subsequent laboratory analysis, which is time-consuming, laborious and costly. And also, previous method always uses principal component analysis (PCA) to identify the sources of heavy metals. However, some studies reveal that PCA will make variables become less interpretable and result in information loss. Considering the development of machine learning technique and its advantage for prediction, this research aims to develop an efficient method of utilizing the easily obtained source data of heavy metals (mine, industrial and domestic wastewater, geological background, soil features, land use type, vegetation, elevation, water discharge, precipitation, pH, temperature) by the latest interpretable machine learning technique in order to: 1. Predict the concentrations of heavy metals in river water; 2. Quantitatively identify the pollutant load contribution of each pollution source.

We have collected 160 river water samples from Yoneshiro River and Kosaka River which are located in Akita prefecture of Japan. The concentration of Pb, Zn, Cu and Cd in samples are measured. The above-mentioned source data of heavy metals are used as input variables and the measured Pb, Zn, Cu and Cd concentrations are used as output results to be trained by random forest (RF) to build the model for predicting Pb, Zn, Cu and Cd concentrations. SHapley Additive exPlanations (SHAP) is used to identify the source of heavy metals. The coefficient of determination (R²) and the mean square error (MSE) are used to evaluate the performances of the established machine learning model. The result showed that the concentration of Pb, Zn, Cu and Cd can be well predicted by the model with the R²of 0.97, 0.93, 0.95, 0.99 and MSE of 10.25, 0.46, 0.011, 0.002 respectively. By the SHAP, the quantitatively source identification information of Zn, Cu, Pb and Cd for not only whole study area but also each sampling point is obtained. The result of SHAP is verified by comparison with the result of PCA.

Presentation information

[M-GI28] Data-driven geosciences

[MGI28-P07] Development of a machine learning-based method for predicting the concentrations and identifying the sources of heavy metals in river water