CIGR VI 2019

Presentation information

Poster Session

Postharvest Machinery

[5-1130-P] Postharvest Machinery (5th)

Thu. Sep 5, 2019 11:30 AM - 12:30 PM Poster Place (Entrance Hall)

11:30 AM - 12:30 PM

[5-1130-P-14] Detection of Outliers in Pre-processing of Datasets for Recognition of Classifiers Using Partial Least Squares Discriminant Analysis

*Miki Fujii1, Ryozo Noguchi2, Tofael Ahamed2, Takuma Genkawa3 (1. Graduate School of Life and Environmental Sciences, University of Tsukuba(Japan), 2. Faculty of Life and Environmental Sciences, University of Tsukuba(Japan), 3. Food Research Institute, NARO(Japan))

Keywords:Pre-Processing, Dataset for Recognition of Classifiers, Machine Learning, Multivariate Analysis

In recent years, smart agriculture has received increasing attention in Japan. Image recognition is used to confirm the growth of vegetables and to determine the proper harvest timing. In machine learning, the choice of images used for the data set affects the accuracy rate of recognition of classifiers. Generally, collected data sets are pre-processed by analysts according to their experience and knowledge. Among them, there are images that could be outliers that adversely affect the accuracy rate. In this study, pre-processing was performed to datasets with objective indicators using partial least squares discriminant analysis (PLS-DA), which is one of the multivariate analyses. In datasets, 300 images of lemon and 300 images of strawberry were used. All images were 75x75 pixels in size. In first test, recognition of classifiers was performed on this dataset by Support Vector Machine (SVM). Of all the data, 75% was set as training data and 25% was randomly set as test data. The rate at which images are correctly classified is defined as the accuracy rate. Also, the images of the dataset were resized from 2x2 pixels to 64x64 pixels, and the same verification was performed. Verification was performed 100 times at each pixel condition. The outliers were detected by PLS-DA before recognition of classifiers by SVM. The objective variable of the data of the lemon images were set to 1, and data of strawberry images were set to 0. The threshold value was determined to be 0.5. In the model of PLS-DA, data of lemon images whose predicted values showed a value of 0.5 or more and data of strawberry images whose predicted values showed 0.5 or less were detected as outliers. Data detected as outliers were removed from the dataset and then image recognition was performed in the same flow as the first test. First test was conducted and noted that SVM had 91.6% ~ 96.5% accuracy rates in each pixel images. It means recognition of classifiers was performed almost accurately. Focusing on the increase in the number of pixels, the accuracy rate continued to improve up to 8x8 pixels images and stayed about 96% after that. At 2x2 pixels images, its standard deviation shows 7.6% (maximum accuracy rate: 98.0%, minimum accuracy rate: 51.7%) and its coefficient of variation shows 0.083. On the other hand, 4x4 pixels and more pixels images showed 1.4 ~ 1.8% standard deviation and less than 0.009 coefficient of variation. Comparing these two, the accuracy rate varied widely for each test when using 2x2 pixels images for testing. Second test was conducted and noted that PLS-DA for preprocessing and performed SVM had more than 99% accuracy regardless of the number of pixels. Images detected as outliers were less than 6% (4 images ~ 17 images) in each pixel image. The t test between the first test and the second test showed that the accuracy rate was significantly improved in all pixel conditions. And the coefficient of variation in each pixel images showed less than 0.009. In particular, in the 2x2 pixels images, the value of the coefficient of variation decreased significantly. This means that it proved removal of outliers can suppress variation in accuracy rate. From the above, by detection of outliers to remove from dataset using PLS-DA, it proved that the accuracy rate of recognition of classifiers could be significantly improved from 96% to 99%, and the variation in accuracy rate values could also be suppressed. In the machine-learning algorithm for training and testing, the developed outlier detection method can be implemented to increase the accuracy of validation.