11:00 AM - 1:00 PM
[HTT19-P04] Effect of different data pre-processing methods on the performance of LULC classification using Machine Learning and Landsat-8 OLI imagery
Keywords:LULC classification, Machine Learning, Data pre-processing
Land Use / Land Cover (LULC) classification is one of the most widely used applications in remote sensing. Map products derived from the LULC classification have been used as a fundamental source of information for a variety of studies on environmental issues, offering opportunity to enhance our understanding of complex dynamics of LULC changes in response to evolving biophysical and socio-economic conditions. LULC map products have also contributed as scientific data for decision-making and consensus-building to find an adequate compromise among stakeholders in the management and planning of land resources. Hence, the remote sensing community has been committed to improving the performance of LULC classification by considering a variety of combinations of remote sensing data and classification algorithms.
In recent years, huge amounts of remote sensing data from Earth observation satellites such as Landsat-8 and Sentinel-2 has been available free of charge. In addition, there has been a remarkable development of Machine Learning (ML) that show excellent performance in classifying large multi-dimensional data sets, and they are increasingly being applied to the LULC classification. The ready availability of remote sensing data and ML has enabled us to carry out the cost-effective LULC classification at spatial scales from regional to global. However, the performance of LULC classification depends not only on the combination of remote sensing data and ML algorithms but also on data pre-processing procedures.
This study aims to compare different combinations of commonly applied data pre-processing methods and evaluate their effects on the performance of LULC classification for mapping bamboo grove distribution at a regional scale. Although bamboo is important to the culture and tradition of Japan, rapid increase of unmanaged bamboo groves has brought us with various local problems. To address the local problems, a systematic forest management and planning are needed based on an understanding of the bamboo grove distribution over a regional scale. A previous study has compared classification performance of several ML algorithms for mapping bamboo grove distribution. However, details of data pre-processing and parameter settings for ML were not reported well, and more detailed studies are needed to find an optimal method for mapping bamboo grove distribution.
Remote sensing data used in this study was multi-temporal Landsat-8 OLI imagery acquired in the period from 2013/04/26 to 2019/12/23. Predictor variables for use in ML were derived from the remote sensing data through the pre-processing with different combinations of the following six methods: (1) imputation of missing data; (2) removal of non-informative near-zero variance data; (3) removal of highly correlated data; (4) PCA transformation; (5) BoxCox transformation; and (6) Z-score standardization. The performance of LULC classification for each combination of pre-preparation methods was evaluated based on the repeated 10-fold cross validation method with Cohen’s kappa, using the six representative ML algorithms such as Artificial Neural Network (ANN); Support Vector Machine (SVM); k-Nearest Neighbor (kNN); Random Forest (RF); C5.0; and eXtreme Gradient Boosting (XGB).
The results indicate that the 4 pre-processing methods including imputation, removal of non-informative data, BoxCox transformation and Z-score standardization lead to increase of classification performance of RF, XGB, C5.0 and SVM. It is also suggested that removal of highly correlated data and PCA transformation might have adverse effects on the classification performance for all the ML algorithms except for ANN. In conclusion, the present results provide recommendations for commonly applied data pre-processing methods and might help to build more practical classification models using ML and Landsat-8 OLI imagery.
In recent years, huge amounts of remote sensing data from Earth observation satellites such as Landsat-8 and Sentinel-2 has been available free of charge. In addition, there has been a remarkable development of Machine Learning (ML) that show excellent performance in classifying large multi-dimensional data sets, and they are increasingly being applied to the LULC classification. The ready availability of remote sensing data and ML has enabled us to carry out the cost-effective LULC classification at spatial scales from regional to global. However, the performance of LULC classification depends not only on the combination of remote sensing data and ML algorithms but also on data pre-processing procedures.
This study aims to compare different combinations of commonly applied data pre-processing methods and evaluate their effects on the performance of LULC classification for mapping bamboo grove distribution at a regional scale. Although bamboo is important to the culture and tradition of Japan, rapid increase of unmanaged bamboo groves has brought us with various local problems. To address the local problems, a systematic forest management and planning are needed based on an understanding of the bamboo grove distribution over a regional scale. A previous study has compared classification performance of several ML algorithms for mapping bamboo grove distribution. However, details of data pre-processing and parameter settings for ML were not reported well, and more detailed studies are needed to find an optimal method for mapping bamboo grove distribution.
Remote sensing data used in this study was multi-temporal Landsat-8 OLI imagery acquired in the period from 2013/04/26 to 2019/12/23. Predictor variables for use in ML were derived from the remote sensing data through the pre-processing with different combinations of the following six methods: (1) imputation of missing data; (2) removal of non-informative near-zero variance data; (3) removal of highly correlated data; (4) PCA transformation; (5) BoxCox transformation; and (6) Z-score standardization. The performance of LULC classification for each combination of pre-preparation methods was evaluated based on the repeated 10-fold cross validation method with Cohen’s kappa, using the six representative ML algorithms such as Artificial Neural Network (ANN); Support Vector Machine (SVM); k-Nearest Neighbor (kNN); Random Forest (RF); C5.0; and eXtreme Gradient Boosting (XGB).
The results indicate that the 4 pre-processing methods including imputation, removal of non-informative data, BoxCox transformation and Z-score standardization lead to increase of classification performance of RF, XGB, C5.0 and SVM. It is also suggested that removal of highly correlated data and PCA transformation might have adverse effects on the classification performance for all the ML algorithms except for ANN. In conclusion, the present results provide recommendations for commonly applied data pre-processing methods and might help to build more practical classification models using ML and Landsat-8 OLI imagery.