Using of cost sensitive learning approach for prediction of Imbalanced soil classes

Document Type : Complete scientific research article

Authors

1 PhD student, Department of Soil Science, Faculty of Agriculture, University of Zanjan, Iran

2 Associate Professor, Department of Soil Science, Faculty of Agriculture, University of Zanjan, Iran

3 Assistant Professor, Soil and Water Research Institute, Agricultural Research, Education and Extension Organization, Karaj, Iran.

Abstract

Background and objectives: Optimal soil management and sustainable agricultural development require access to accurate and reliable information about the condition and classification of soil, and accurate prediction of soil classes and their location is of great importance. The use of machine learning methods and especially the cost-sensitive learning approach can help to improve the accuracy and efficiency of soil class prediction by considering the imbalance in the distribution of soil classes and providing valuable information for optimal soil management and agriculture. With this aim, this study was conducted in a part of the southwest lands of Zanjan province.
Materials and methods: A number of 148 soil profiles were excavated using a regular grid pattern with an average spacing of 500 meters (and in some locations, up to 700 meters based on expert recommendations), described and classified by laboratory analysis up to the family level. Covariates included geomorphological and geological map information, digital elevation model (DEM), and data from Landsat 8 satellite images that used principal component analysis (PCA) and expert knowledge approaches, some covariates including geomorphological maps, geological information, analytical hill shading, sunrise, valley depth, LS factor, channel network distance, topographic wetness index and multi-resolution ridge top flatness as the most effective covariates for predicting soil classes and model input is selected. Modeling of the soil-landscape relationship was performed using the algorithm, random forest (RF), and ensemble model (after data balancing) in “Rstudio” software.
Results: The soils of the region at the subgroup level were categorized in five classes, with imbalanced distribution, including Typic Calcixerepts, Typic Haploxerepts, Gypsic Haploxerepts, Typic Xerorthents, and Lithic Xerorthents. The results of overall accuracy and Kappa coefficient for evaluating soil map in random forest model were 65% and 0.32 before data balancing and after balancing the data with a cost-sensitive learning approach 86% and 0.77, respectively. The accuracy values of the prediction of soil classes at the subgroup level showed that after balancing with a cost-sensitive learning approach, all soil classes, especially the two minority classes of Gypsic Haploxerepts and Lithic Xerorthents, with user accuracy values of 100% and 100% and producer accuracy of 91% and 85% respectively, were predicted with very high accuracy. The values of the sensitivity index for the two minority classes of Gypsic Haploxerepts (zero) and Lithic Xerorthents (zero) show that no correct prediction has been made for these two minority classes. The Specificity index values for Gypsic Haploxerepts and Lithic Xerorthents classes are equal to 1 and 0.97, respectively, these values show that the ability of the model to distinguish these two classes is very high compared to other classes. The results of balanced accuracy showed that the accuracy of the model in differentiating the minority classes of Gypsic Haploxerepts and Lithic Xerorthents with the values of 0.50 and 0.49 by the model is more difficult than other classes, but the model can predict the classes relatively well.
Conclusion: The results of the study confirm that the method of improving imbalanced data with a cost-sensitive learning approach increases the accuracy of prediction in soil classes and produced maps. The focus of the model in the cost-sensitive learning method is on the data with the low number (minority) and this reduces the prediction error and increases the accuracy of the model. The results showed that the random forest algorithm using the cost-sensitive learning approach can have a significant improvement in distinguishing soil classes, especially minority classes.

Keywords

Main Subjects


  1. Garg, K. K., Anantha, K. H., Nune, R., Akuraju, V. R., Singh, P., Gumma, M. K., ... & Ragab, R. (2020). Impact of land use changes and management practices on groundwater resources in Kolar district, Southern India. Journal of Hydrology: Regional Studies, 31, 100732.doi.org/10.1016/j.ejrh.2020.100732.
  2. Bouma, J., Bonfante, A., Basile, A., van Tol, J., Hack-ten Broeke, M. J. D., Mulder, M., ... & Hirmas, D. R. (2022). How can pedology and soil classification contribute towards sustainable development as a data source and information carrier?. Geoderma, 424, 115988. doi.org/10.1016/j.geoderma.2022.115988.
  3. Sharififar, A., Sarmadian, F., & Minasny, B. (2019a). Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique. Computers and Electronics in Agriculture, 159, 110-118. doi.org/10.1016/j.compag.2019.03.006.
  4. Lagacherie, P., Arrouays, D., & Walter, C. (2013). Cartographie numérique des sols: principe, mise en œuvre et potentialités. Etude et Gestion des Sols, 20(1), 83-98.
  5. Wadoux, A. M. C., Brus, D. J., & Heuvelink, G. B. (2019). Sampling design optimization for soil mapping with random forest. Geoderma, 355, 113913.doi.org/10.1016/j.geoderma.2019.113913.
  6. Vincent, S., Lemercier, B., Berthier, L., & Walter, C. (2018). Spatial disaggregation of complex Soil Map Units at the regional scale based on soil landscape relationships. Geoderma, 311, 130-142.doi.org/10.1016/j.geoderma.2016.06.006.
  7. Wadoux, A. M. C., Minasny, B., & McBratney, A. B. (2020). Machine learning for digital soil mapping: Applications, challenges and suggested solutions. Earth-Science Reviews, 210, 103359. doi.org/10.1016/j.earscirev.2020.103359
  8. Sharififar, A., & Sarmadian, F. (2023). Coping with imbalanced data problem in digital mapping of soil classes. European Journal of Soil Science, 74(3), e13368. doi.org/10.1111/ejss.13368
  9. Rahimi mashkale, M., Delavar, M. A., jamshidi, M., & sharififar, A. (2023). Improving the classification of Soil imbalanced data using machine learning algorithms in Some Part of Zanjan provice land. Journal of Agricultural Engineering Soil Science and Agricultural Mechanization, Scientific Journal of Agriculture, 46(1), 61-82. doi: 10.22055/AGEN.2023.43838.1667.[In Persian]
  10. Helfenstein, A., Mulder, V. L., Heuvelink, G. B., & Okx, J. P. (2022). Tier 4 maps of soil pH at 25 m resolution for the Netherlands. Geoderma, 410, 115659. doi.org/10.1016/j.geoderma.2021.115659.
  11. Heung, B., Ho, H. C., Zhang, J., Knudby, A., Bulmer, C. E., & Schmidt, M. G. (2016). An overview and comparison of machine-learning techniques for classification purposes in digital soil mapping. Geoderma, 265, 62-77.doi.org/10.1016/j.geoderma.2015.11.014
  12. Sharififar, A., Sarmadian, F., Malone, B. P., & Minasny, B. (2019b). Addressing the issue of digital mapping of soil classes with imbalanced class observations. Geoderma, 350, 84-92. doi.org/10.1016/j.geoderma.2019.05.016
  13. Taghizadeh-Mehrjardi, R., Mahdianpari, M., Mohammadimanesh, F., Behrens, T., Toomanian, N., Scholten, T., & Schmidt, K. (2020). Multi-task convolutional neural networks outperformed random forest for mapping soil particle size fractions in central Iran. Geoderma, 376, 114552. doi.org/10.1016/j.geoderma.2020.114552.
  14. Zhu, B., Baesens, B., & vanden Broucke, S. K. (2017). An empirical comparison of techniques for the class imbalance problem in churn prediction. Information sciences, 408, 84-99. doi.org/10.1016/j.ins.2017.04.015.
  15. Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert systems with applications, 73, 220-239.doi.org/10.1016/j.eswa.2016.12.035.
  16. Padarian, J., Minasny, B., & McBratney, A. B. (2019). Machine learning and soil sciences: A review aided by machine learning tools. doi:10.5194/soil-6-35-2020.
  17. Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B., & Gräler, B. (2018). Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. Peer J, 6, e5518.  doi: 10.7717/peerj.5518.
  18. Taghizadeh-Mehrjardi, R., Minasny, B., Toomanian, N., Zeraatpisheh, M., Amirian-Chakan, A., & Triantafilis, J. (2019). Digital mapping of soil classes using ensemble of models in Isfahan region, Iran. Soil Systems, 3(2), 37. doi:10.3390/soilsystems3020037.
  19. Jing, X. Y., Zhang, X., Zhu, X., Wu, F., You, X., Gao, Y., ... & Yang, J. Y. (2019). Multiset feature learning for highly imbalanced data classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1), 139-156.doi:10.1109/TPAMI.2019.2929166.
  20. Zhang, C., Tan, K. C., Li, H., & Hong, G. S. (2019). A cost-sensitive deep belief network for imbalanced classification. IEEE transactions on neural networks and learning systems, 30(1), 109-122. doi:10.1109/TNNLS.2018.2832648.
  21. Mienye, I. D., & Sun, Y. (2021). Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Informatics in Medicine Unlocked, 25, 100690. doi.org/10.1016/j.imu.2021.100690.
  22. Ma, Y., Zhao, K., Wang, Q., & Tian, Y. (2020). Incremental cost-sensitive support vector machine with linear-exponential loss. IEEE Access, 8, 149899-149914.  doi:10.1109/ACCESS.2020.3015954.
  23. Statistical Yearbook of Zanjan Province. (2019). Land and Climate, National Statistics Organization. [In Persian]
  24. Soil and Water Research Institute. (2010). Site Selection, Soil Survey and Land Evaluation for Development of Orchards in Zanjan Province, Iran. [In Persian]
  25. Bouyoucos, G. J. (1962). Hydrometer method improved for making particle size analyses of soils 1. Agronomy journal, 54(5), 464-465. doi:10.2134/agronj1962.00021962005400050028x.
  26. Perry Jr, C. R., & Lautenschlager, L. F. (1984). Functional equivalence of spectral vegetation indices. Remote sensing of environment, 14(1-3), 169-182. doi:10.1016/0034-4257(84)90013-0.
  27. Lanyon, L. E., & Heald, W. R. (1983). Magnesium, calcium, strontium, and barium. Methods of Soil Analysis: Part 2 Chemical and Microbiological Properties, 9, 247-262.  doi:10.2134/agronmonogr9.2.2ed.c14.
  28. Sumner, M. E., & Miller, W. P. (1996). Cation exchange capacity and exchange coefficients. Methods of soil analysis: Part 3 Chemical methods, 5, 1201-1229.  doi:10.2136/sssabookser5.3.c40.
  29. Richards, L. A. (Ed.). (1954). Diagnosis and improvement of saline and alkali soils (No. 60). US Government Printing Office. doi:10.1097/00010694-195408000-00012.
  30. Walkley, A., & Black, I. A. (1934). An examination of the Degtjareff method for determining soil organic matter, and a proposed modification of the chromic acid titration method. Soil science, 37(1), 29-38. doi:10.1097/00010694-193401000-00003.
  31. Artieda, O., Herrero, J. and Drohan, P.J., 2006. Refinement of the differential water loss method for gypsum determination in soils. Soil Science Society of America Journal, 70(6), pp.1932-1935. doi:10.2136/sssaj2006.0043N.
  32. Soil Survey Staff. (2022). Keys to soil taxonomy, 13th edition. USDA Natural Resources Conservation Service.
  33. Olaya, V. I. C. T. O. R. (2004). A gentle introduction to SAGA GIS. The SAGA User Group eV, Gottingen, Germany, 208.
  34. Zinck, J. A., Metternicht, G., Bocco, G., & Del Valle, H. (2016). Geopedology. An integration of geomorphology and pedology for soils and landscape studies: Springer International Publishing Switzerland, 556p. doi:10.1007/978-3-319-19159-1.
  35. Kuhn, M., & Johnson, K. (2013). Applied predictive modeling (Vol. 26, p. 13). New York: Springer. doi:10.1007/978-1-4614-6849-3.
  36. Breiman, L. (2001). Random forests. Machine learning, 45, 5-32. doi:10.1023/A:1010933404324.
  37. Breiman, L., & Cutler, A. (2004). Random Forests. Department of Statistics, University of Berkeley. doi:10.1214/10-AOAS427.
  38. Zhao, P., Zhang, Y., Wu, M., Hoi, S. C., Tan, M., & Huang, J. (2018). Adaptive cost-sensitive online classification. IEEE Transactions on Knowledge and Data Engineering, 31(2), 214-228. doi:10.1109/TKDE.2018.2826011.
  39. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering21(9), 1263-1284. doi:10.1109/TKDE.2008.239.
  40. Moepya, S. O., Akhoury, S. S., & Nelwamondo, F. V. (2014, December). Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In 2014 IEEE international conference on data mining workshop (pp. 183-192). IEEE.doi:10.1109/ICDMW.2014.141
  41. Jin, C., & Jin, S. W. (2018). Content-based image retrieval model based on cost sensitive learning. Journal of Visual Communication and Image Representation55, 720-728. doi:10.1016/j.jvcir.2018.08.009
  42. Zhang, J., Schmidt, M. G., Heung, B., Bulmer, C. E., & Knudby, A. (2022). Using an ensemble learning approach in digital soil mapping of soil pH for the Thompson-Okanagan region of British Columbia. Canadian Journal of Soil Science102(03), 579-596. doi:10.1139/cjss-2021-0091.
  43. Brodersen, K. H., Ong, C. S., Stephan, K. E., & Buhmann, J. M. (2010, August). The balanced accuracy and its posterior distribution. In 2010 20th international conference on pattern recognition (pp. 3121-3124). IEEE. doi:10.1109/ICPR.2010.764.
  44. Congalton, R. G. (1991). A review of assessing the accuracy of classifications of remotely sensed data. Remote sensing of environment, 37(1), 35-46. doi.org/10.1016/00344257(91)90048-B.
  45. Rahimi Mashkaleh, M., amirdelavar, M., jamshidi, M., & sharififar, A. (2023). Modeling Spatial Distribution of Soil Classes Using Machine Learning Algorithms in Some Parts of Zanjan Provice. Iranian Journal of Soil Research, 37(2), 147-165. doi: 10.22092/ijsr.2023.361649.698. [In Persian]
  46. Jensen, J. R. (2005). Introductory image processing: A remote sensing perspective.
  47. Kang, M., Liu, Y., Wang, M., Li, L., & Weng, M. (2022). A random forest classifier with cost-sensitive learning to extract urban landmarks from an imbalanced dataset. International Journal of Geographical Information Science, 36(3), 496-513. doi.org/10.1080/13658816.2021.1977814.
  48. Devi, D., Biswas, S. K., & Purkayastha, B. (2019). A cost-sensitive weighted random forest technique for credit card fraud detection. In 2019 10th international conference on computing, communication and networking technologies (ICCCNT).  1-6.  doi:10.1109/ICCCNT45670.2019.8944885.
  49. Fernández, A., del Jesus, M. J., & Herrera, F. (2009). On the influence of an adaptive inference system in fuzzy rule-based classification systems for imbalanced data-sets. Expert Systems with Applications, 36(6), 9805-9812. doi.org/10.1016/j.eswa.2009.02.048.
  50. Li, R., Pan, X., Wu, H., Huang, Y., Li, W., & Li, M. (2021). A comparative study of cast-sensitive methods in digital soil mapping using machine learning algorithms. doi.org/10.2139/ssrn.4658128Catena, 208, 105266.
  51. Li, H., Li, J., Zhao, Y., Gong, M., Zhang, Y. & Liu, T. (2021). Cost-sensitive self-paced learning with adaptive regularization for classification of image time series. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing14,11713-11727. doi: 10.1109/JSTARS.2021.3127754.
  52. Wang, N., Liang, R., Zhao, X., & Gao, Y. (2021). Cost-sensitive hypergraph learning with f-measure optimization. IEEE Transactions on Cybernetics. doi:10.1109/TCYB.2021.3126756.
  53. Wong, M. L., Seng, K., & Wong, P. K. (2020). Cost-sensitive ensemble of stacked denoising autoencoders for class imbalance problems in business domain. Expert Systems with Applications, 141, 112918. doi.org/10.1016/j.eswa.2019.112918.
  54. Fan, Y., Zhang, C., Liu, Z., Qiu, Z., & He, Y. (2019). Cost-sensitive stacked sparse auto-encoder models to detect striped stem borer infestation on rice based on hyperspectral imaging. Knowledge-Based Systems, 168, 49-58.doi.org/10.1016/j.knosys.2019.01.003.
  55. Yu, H., Sun, C., Yang, X., Zheng, S., Wang, Q., & Xi, X. (2018). LW-ELM: a fast and flexible cost-sensitive learning framework for classifying imbalanced data. IEEE Access, 6, 28488-28500. doi: 10.1109/ACCESS.2018.2839340.