Plant Leaf Disease Detection Using Efficient Image Processing and Machine Learning Algorithms

— India is often described as a country of villages, where a majority of the population depends on agriculture for their livelihood. The landscape of Indian agriculture is approximately 159.7 million hectares. Agriculture plays a pivotal role in India's Gross Domestic Product (GDP), accounting for about 18% of the nation's economic output. Diseases and pests can have detrimental effects on crops, leading to reduced yields. These challenges can include the spread of plant diseases, infestations by insects or other pests, and the overall degradation of crop health. Early detection of diseases in crops is crucial for several reasons. Detecting diseases at an early stage allows for prompt intervention, such as applying appropriate pesticides or taking preventive measures. The main aim of this study is to develop a highly effective method for plant leaf disease detection using computer vision techniques. Here, leaf disease detection comprises histogram equalization, denoising, image color threshold masking, feature descriptors such as Haralick textures, Hu moments, and color histograms to extract the salient features of leaf images. These features are then used to classify the images by training Logistic Regression, Linear Discriminant Analysis, K-nearest neighbor, decision tree, Random Forest, and Support Vector Machine algorithms using K-fold validation. K-fold validation is used to separate the validation samples from the training samples, and the K indicates the number of times this is repeated for the generalization. The training and validation processes are performed in two approaches. The first approach uses default hyperparameters with segmented and non-segmented images. In the second approach, all hyperparameters of the models are optimized to train segmented datasets. The classification accuracy improved by 2.19% by utilizing segmentation and hyperparameter tuning further improved by 0.48%. The highest average classification accuracy of 97.92% is achieved using the Random Forest classifier to classify 40 classes of 10 different plant species. Accurate detection of plant disease leads to the sustained growth of plants throughout the growing span of the plants.


INTRODUCTION
Agriculture serves as the primary source of income for over 58% of India's population [1].As of April 2023, India is home to more than 96 million farmers.The agriculture sector contributes to over 18% of India's GDP [2].Notable commercial crops in India include potatoes, tomatoes, mangoes, apples, grapes, peppers, soybeans, cotton, jute, tobacco, coffee, tea, and mustard [3].
Potatoes and tomatoes are the prominent crops grown globally.India contributes approximately 11% of the world's tomato production [4].Exporting a significant portion of its tomatoes to countries such as Pakistan, Bangladesh, the Maldives, the United Arab Emirates (UAE), and the United States [5].
The yield of tomatoes or any crops depends on numerous factors, including soil fertility, environmental conditions, pests, and diseases.Diseases are significant contributors to crop losses, making early detection crucial.Detecting diseases through visual inspection can be challenging, especially when cultivating a variety of crops, even for experienced pathologists [6], [7].In rural areas, open-eye inspection remains a common method for disease classification [8].However, the reliance on visual methods can lead to delays in disease identification due to a shortage of experts in rural areas [9].Disease detection using automated systems allows for early intervention, helping farmers to implement timely and targeted control measures.Early detection enables the implementation of effective strategies to contain or eradicate the disease before it spreads extensively.
Advancements in technology can transform the lives of farmers, providing them with a range of automated systems.Farmers can easily capture images of plant parts using standard digital cameras and upload them to disease detection systems, which provide information about treatment options and recommended pesticides [10].Bacteria and fungi often cause plant diseases that can affect various plant parts, including leaves, stems, and roots [11], [12].Since many disease symptoms manifest in the leaves, numerous researchers have focused on leaf disease detection using image processing and computer vision techniques.
Image processing and computer vision techniques are used to extract shape and texture features [13]- [21].Among these methods, the combination of machine learning algorithms with image texture features is widely applied in plant disease detection.Notable machine learning algorithms viz., Random Forest Classifier (RFC), Logistic Regression Classifier (LRC), Support Vector Machine (SVM), Decision Tree Classifier (DTC), Linear Discriminant Analysis (LDA), and -Nearest Neighbor (K-NN) [22].Also, deep Convolutional Neural Networks (CNN) play a pivotal role in extracting complicated patterns to identify plant diseases.
In real natural environmental conditions, plant disease detection faces numerous challenges, including issues such as noise and lower contrast in lesion images, as well as small differences between the background and the lesion area [23].To address these challenges, a novel technique has been proposed, which utilizes efficient image processing and machine learning classification techniques.In the proposed methodology, histogram equalization is used to enhance image quality, and the color denoising technique is used to eliminate noise.Subsequently, the leaf area is separated from the background using threshold masking [24].Texture and color features of the image are extracted, including Hu moments, Haralick textures, and color histograms [25].These features are then employed for classification through the application of machine learning algorithms.
The proposed work mainly highlights: 1. Importance of image pre-processing in plant disease detection to improve the classification accuracy.
2. Importance of choosing the optimum hyperparameters for machine learning algorithms for accurate disease classification.
Organization of the manuscript: section II discusses about the related works carried out by researchers globally, focusing on plant leaf disease detection.Section III explains the proposed methodology and the dastasets used in the research emphasizing more on the novelty of the proposed methodology.In Section IV experimental results are discussed, qualitatively and quantitatively, with the evaluation metrics.Finally, the conclusion and future works are given in section V.

II. RELATED WORK
Historically, significant researchers have focused on plant leaf disease detection using image processing techniques.The most recent techniques for plant leaf disease detection are reviewed in [26]- [30].In recent years, there has been a growing emphasis on the use of machine learning for leaf disease detection.M. R. Raigonda et al. [31], implemented a preprocessing and image segmentation approach to accurately identify leaf diseases in potato plants.Image sharpening through contrast enhancement is focused initially, and denoising techniques using median and Gaussian filters are applied at a later stage.For highlighting the region of interest, they employed kmeans clustering as an image segmentation method.Color, shape, and texture features were subsequently extracted and fed into the classifier, enabling accurate disease detection.Md.R. Mia et al. [32], employed an Artificial Neural Network (ANN) for mango leaf disease detection.They converted the original RGB leaf images to LAB color space and used k-means clustering for segmentation.The cluster representing the disease-affected area was used to extract 13 features, including contrast, energy, correlation, mean, moment, standard deviation, etc., These features were then used to train the machine learning system to recognize leaf disease.
In the study by S. S. Harakannanavar et al. [33], images were resized, and their quality was improved through histogram equalization.Lesion areas were partitioned using k-means clustering, and image boundaries were extracted using contour tracing.Informative features from image samples were extracted using Principle Component Analysis (PCA), Discrete Wavelet Transforms (DWT), and Grey Level Co-occurrence Matrix (GLCM).These features were employed to classify images using machine learning techniques such as KNN, SVM, and CNN.M. Badiger et al. [34], developed a leaf disease classifier using SVM.The authors standardized the image sizes and applied k-means clustering for image segmentation.The SVM classified diseases using GLCM features.A. S. Deshapande et al. [35], implemented a machine-learning algorithm for disease classification in maize leaves.The authors utilized eighteen histogram features and eight Haar wavelet features with SVM and KNN classifiers.These classifiers achieved an accuracy of 85% for KNN and 88% for SVM.In another study [36], researchers focused on classifying diseased tobacco leaves with 120 leaf images.They implemented a CNN model and compared it with existing models, demonstrating an accuracy of 85.1% for their proposed model.A. K. Singh et al. [37], introduced two methods for classifying plant leaf diseases using the PlantVillage dataset.In the first method, they employed CNN for image feature extraction, followed by classification using a Bayesianoptimized support vector machine.In the second method, features including the histogram of oriented gradients, color moments, and GLCM were extracted.Feature selection was performed using a binary particle swarm optimizer and the selected features were used for image classification with a random forest classifier.P. Shetty et.al. [38], focused on classifying diseases in tomato plant leaves using image processing and machine learning classifiers.They aimed to classify four diseases: Leaf mold, Late blight, Bacterial spot, and Early blight using Linear discrimination analysis, Logistic regression, KNN, Decision tree, SVM, Naïve Bayes, and Random Forest.Experimental results revealed that the Random Forest classifier outperformed other classifiers in terms of classification accuracy.
Bijaya Hatuwal et al. [39], in their proposed method, multiple plant leaf diseases were classified using SVM, random forest, k-nearest neighbor, and CNN models.For CNN, images were directly used for training and classification, while the other three models utilized image features.Features like entropy, inverse difference moments, contrast, and correlation were extracted using Haralick textures.Among the models used, RFC, SVM, and KNN achieved classification accuracies of 87.43%, 78.61%, and 76.96%, respectively, while CNN achieved an accuracy of 97.89%.
Transfer learning with pre-trained deep convolutional neural networks was applied in [8] Existing methods for plant leaf disease detection face challenges in providing accurate output.The leaf overlap, poor lighting, and randomness in air flow are the major issues while capturing the images in real-time environmental conditions, as the aforementioned conditions can obscure the lesion area, necessitating proper image pre-processing techniques.Additionally, existing algorithms tend to consume substantial time due to their complexity.To address these issues, in the proposed work, efficient image preprocessing techniques like image denoising, image enhancement, and segmentation are used along with machine learning algorithms to produce accurate results with reduced processing time and complexity.

III. MATERIALS AND METHODS
The block diagram of the proposed leaf disease detection model is shown in Fig. 1.The proposed model is developed using digital image processing and machine learning approaches.

Fig. 1. The proposed leaf disease detection model
The experimentation is performed on the publicly available PlantVillage [41] and MangoLeafDB [42] datasets.Pre-processing techniques like image resizing, histogram equalization, gaussian denoising, and segmentation are applied to all the images in the database to make the feature extraction and classification process accurate.Image texture and color features are extracted from the pre-processed images and used to classify the images as healthy or the disease type using machine learning classification algorithms.

A. Dataset
The PlantVillage dataset comprises 38 classes of leaves from different plants, while the MangoLeafDB dataset consists of 7 unhealthy and 1 healthy class of mango leaves.A total of 41,546 images across 40 classes are chosen from the combined dataset, as shown in Table I.The choice of data split is predominantly influenced by the dataset size, and since the number of images is deemed sufficient for generalization purposes, a ratio of 80:20 has been selected to balance training and testing without encountering overfitting concerns.

B. Image Pre-processing
Image pre-processing is a crucial step in computer visionbased image processing systems, as its primary purpose is to enhance the accuracy of image classification [43].
Image Resize: In this study, all the images in the datasets are of size 256×256 pixels.This consistent sizing ensures that the results can be directly compared with existing models.In cases where images deviate from this specified size, they are resized to 256×256 to maintain uniformity and enable fair comparisons.
Image Enhancement: Subsequently, the adaptive histogram equalization [AHE] technique is applied to enhance image contrast [44].AHE improves the visibility of details in both bright and dark regions by dividing the image into smaller regions and applying histogram equalization independently to each of these regions.This adaptability allows AHE to handle varying illumination conditions within an image.
Image Denoising: In the next step, a color image denoising technique is used to reduce image noise.In this process, RGB images are converted to the CIE LAB color space, and the L and AB components are denoised separately using a Gaussian filter before being converted back to the RGB color space [45].
Image Segmentation: Segmentation is employed to extract the leaf part from the image by suppressing the background pixels.In this study, a threshold-based segmentation method [46] is utilized, in which green and brown masks are individually created with their respective lower and upper threshold values.These threshold values are set based on the background in the image.The final mask is generated by combining the green and brown masks.A logical 'AND' operation is then applied between the input preprocessed images and the final mask to remove the background from the leaf.The output of the pre-processing steps is shown in Fig. 2(a-d).

C. Feature Extraction
Features are extracted from pre-processed images using Hu moments, Haralick textures, and color histogram feature descriptors.Hu moments provide an array of shape descriptors, calculated over a single channel of an image to precisely describe the leaf boundary.Haralick textures are used to differentiate texture features in leaf images.These texture features at pixel positions (, ) are based on the frequency of pixel  occurring next to pixel .Common texture features used in image classification problems include energy, entropy, homogeneity, autocorrelation, crosscorrelation, dissimilarity, average, sum of squares, and variance.In leaf disease classification these features describe the shape and textures of disease affected area.To compute Hu moments and Haralick features, the segmented RGB images should first be converted to grayscale.
The detailed representation of colors in the image is obtained by calculating the color histogram.These color histograms help to differentiate the color changes in the disease affected region with respect to different disease classes.Since the HSV model closely aligns with the human eye's ability to perceive colors [47], input RGB images are converted to the HSV color space, and then the histogram is calculated over the HSV color space.This histogram plot provides information about the number of pixels that represent a given color range.All the features, including Hu moments, Haralick textures, and the color histogram are combined into a feature vector.This feature vector serves as input to the classifiers for recognizing the image class.

D. Classification
The extracted features are normalized and then used for training the classifier.Training is performed using machine learning algorithms such as logistic regression, linear discriminant analysis (LDA), K-nearest neighbor (K-NN), Decision Tree Classifier (DTC), Random Forest Classifier (RFC), and support vector machine (SVM).
The logistic regression model converts the continuous output of the linear regression function into categorical values by applying a sigmoid function.This sigmoid function maps any set of real-valued independent variables as input to a value ranging from 0 to 1 [48].Additionally, extensions such as one-vs-rest enable logistic regression to handle multiclass classification problems.This model produces coefficients for each feature, indicating the strength and direction of their influence on the predicted outcome.
LDA operates by reducing the dimensionality of the data while enhancing class separation.This is achieved by identifying a set of linear discriminants that maximize the ratio of between-class variance to within-class variance.In simpler terms, LDA identifies the optimal directions in the feature space to effectively distinguish between various data classes [49].
KNN is a supervised learning algorithm that assumes samples of the same class have similarities in the feature space.To identify the class for any sample, this algorithm considers the k closest neighbors of the sample and then applies simple rules for classification [50].KNN  The Decision Tree algorithm involves predefined target variables and constructs a tree-like structure consisting of multiple branches and leaf nodes.Each leaf node represents a specific decision, while each branch node signifies a choice among various alternatives.The decision tree outputs a Yes/No decision based on an input object that describes a set of properties [51].Decision Trees can model non-linear relationships between features and the target variable.This flexibility allows them to capture complex patterns in image data, which might be challenging for linear models.
The Random Forest Classifier consists of a number of decision trees.The final output of this classifier depends on the outcomes of the individual decision trees.This algorithm is used for both regression and classification.It outputs the mean prediction in regression problems and the class in classification [52].By combining the predictions of multiple trees, the model tends to generalize well to unseen data and reduces the risk of overfitting.Support Vector Machine is indeed one of the most widely used machine learning algorithms, especially for classification tasks.SVM classifies a number of classes in one-dimensional feature space by drawing straight lines called hyperplanes between the classes [53].This means that the features on one side of the line represent one class, while those on the other side represent another class.SVM has the capability to fit complex datasets and exhibits good generalization properties [54].
The proposed approach aims to improve leaf image analysis through the implementation of efficient noiseremoval methods and background-removal techniques.The primary objective is to ensure image clarity by eliminating noise and removing the background without affecting the lesion area.The methodology underscores the utilization of uncomplicated machine learning algorithms to maintain a minimal model complexity while still achieving high classification accuracy.

IV. EXPERIMENTAL RESULTS
This section presents the simulation outcomes of the proposed model.During the experimentation process, image pre-processing, feature extraction, and image classification were performed using Jupyter Notebook with Python 3.11, along with libraries such as OpenCV, Keras, OS module, Globe module, and GridSearchCV.
The hardware setup consisted of an Intel(R) Core(TM) i5-4200U CPU running at 1.60GHz with a maximum turbo frequency of 2.30 GHz and 4GB of RAM.This configuration was utilized for training the classifiers and evaluating the performance of the proposed model.
The classifiers were trained using features obtained from each image.Six machine learning classifiers were trained and validated using the -fold cross-validation technique.-fold validation is the most popular technique for validating machine learning algorithms.In this technique, the available test data is split into  sample planes, and  iterations are performed for validation.In each iteration, one sample plane is used for validation and all other  − 1 sample planes are used for training.This process continues  times and all the sample planes are used as test samples at least once.The final accuracy is calculated as the average accuracy of all  iterations.
The experimentation is conducted in two approaches using nine classes of tomato leaf images.In the first approach, the classifiers performance on segmented and non-segmented images is evaluated.In the second approach, the classifier performance is evaluated by choosing optimal hyperparameters.later the optimized model is used to classify all other images given in the dataset.
The model is evaluated using performance evaluation metrics like accuracy, precision, recall, and F1-score.Accuracy is a measure of overall correctness in a model, representing the ratio of correctly predicted instances to the total instances.Precision gauges the accuracy of positive predictions by calculating the ratio of true positives to the sum of true positives and false positives.Recall, or sensitivity, measures a model's ability to identify all relevant instances by calculating the ratio of true positives to the sum of true positives and false negatives.The F1 score, a harmonic mean of precision and recall, offers a balanced evaluation of a model's performance.

A. Performance of Classifiers on Segmented and Non-Segmented Images
Initially, the features of tomato leaf images are directly extracted from denoised images without segmentation.These features are then used to train classifiers employing the fold validation technique.The classification accuracy is observed to vary with the choice of the parameter .Subsequently, in the image pre-processing stage, image segmentation was carried out to eliminate the background from the leaf images.Features are then extracted from these segmented images.Table III displays the classifier performance on these segmented images for various values of , enabling a comparison of their effectiveness in this segmented context.

The comparison between Table II and Table III reveals
that the introduction of image segmentation has a beneficial impact on image classification performance.Furthermore, it's noteworthy that the Random Forest classifier consistently achieves the highest accuracy in both approaches across all values of .The accuracy tends to increase initially with an increase in the number of cross-validation folds () and reaches its peak at  = 30.Specifically, the Random Forest classifier achieves a maximum accuracy of 94.43% without image segmentation and an improved accuracy of 96.62% with image segmentation both occurring at  = 30.Fig. 3 visually depicts the comparative accuracy of all the classifiers, showcasing the advantage of image segmentation in enhancing classification results, especially at  = 30.In the segmentation process the leaf background is removed, thereby eliminating the unwanted information in the image which leads to improved accuracy.

B. Performance of Classifiers on Tuning Hyperparameters
The classification accuracy of a classifier is influenced by variety of training parameters, including the number of trees, kernel size, penalty parameter, class weights etc.,.
To investigate the impact of the number of trees in the Random Forest classifier, we conducted an ablation study.The Random Forest classifier consists of multiple decision trees built on different subsets of the dataset.It aggregates predictions from each tree and makes a final prediction based on majority votes.Fig. 4 illustrates the performance of the Random Forest classifier on the tomato leaf dataset.This visualization allows us to assess how the number of trees affects the classifier accuracy.
In Fig 4, initially, the classification accuracy increases with an increase in the number of trees.The maximum test accuracy of 97.14% is recorded for 300 trees.Increasing the number of trees increases accuracy and increases computational complexity thereby, increasing the training time.Also, with a huge number of trees the model may overfit the dataset that was trained on and cause the reduction of model accuracy as observed after 300 trees.Every machine learning algorithm is a mathematical model defined using the number of parameters that need to be trained from data.The kind of model parameters are called "hyperparameters" that cannot train directly from a regular training process.Each model and a dataset needs different set of hyperparameters.One way to determine the correct value for the hyperparameters is through multiple experiments, where each time pick a set of hyperparameters and train the model, this is called hyperparameter tuning.After multiple sets of experiments choose the best set of hyperparameters by evaluating the accuracy or loss.There are several automated methods available for this process: Bayesian optimization, Grid search, and Random search.These techniques train the model by choosing every possible set of hyperparameters and evaluating the model performance for each set.From this the best set of hyperparameters can be chosen, this process is called hyperparameter optimization.
The performance metrics for all six classifiers on the tomato leaf dataset using hyperparameter optimization are shown in Table IV  The use of machine learning algorithms for plant leaf disease detection is the best idea for the early detection of diseases before they spread over the farm.From Table III and Table IV, it is clear that image background removal using the segmentation technique improves the accuracy of image classification and also, the random forest classifier performed best among the machine learning algorithms.The performance of the algorithms varied with the value of  in -fold validation.The classification models have their own training parameters to be tuned for accurate training and classification.The results tabulated in Table V are evidence for having the best classification results with optimized hyperparameters.
The use of preprocessing techniques like image enhancement, denoising, and threshold-based segmentation helped to identify the disease parts easily and this led to improved classification accuracy in the proposed model compared to other state-of-art methods as given in Table VI.
The proposed model can identify the plant disease with less computational complexity and the accuracy of classifying the leaf disease is more compared to even CNN models that are computationally very heavy.The proposed model can be made as a mobile application, farmers can upload the images of the leaf.The proposed model can be made to provide recommendations on the health of the leaf and the possible pesticide to use to eradicate the disease thereby increasing the crop yield.
The proposed algorithm presents various benefits, especially in terms of its size and resource demands.The algorithm is less computationally intricate, enhancing its suitability for a broader array of hardware.Nevertheless, it is crucial to acknowledge certain limitations in this study.The images in the database adhere to a standardized size and are captured under controlled conditions.Consequently, the effectiveness of the proposed To further enhance the model's performance in future work, potential avenues for improvement include the incorporation of fusion techniques for image feature extraction and the inclusion of diverse plant leaf datasets to increase the model's robustness and generalizability.

V. CONCLUSION
The objective of the proposed work is to classify leaf diseases in agricultural crops using efficient image processing and machine learning algorithms.The use of computer vision techniques helps to detect diseases in early stages with minimal time and avoid the spread of disease over the fields, this leads to improved crop yields.To achieve this, the proposed model employs image processing techniques, which encompass various steps such as image resizing, enhancement, denoising, and threshold-based segmentation.Moreover, the machine learning algorithm utilizes multiple feature descriptors including Haralick textures, Hu moments, and color histograms to capture both texture and color characteristics from leaf images for disease classification.
The use of image segmentation and hyperparameter optimization enhances the classification accuracy by 3.82% with a random forest classifier (RFC).It is observed that, the random forest classifier stands out as a particularly suitable choice for leaf disease classification.Superior classification accuracy is achieved by RFC compared to the other classifiers.RFC combines the predictions of multiple trees, tends to generalize well to unseen data, and reduces the risk of overfitting.Notably, the proposed model with RFC has classified the tomato leaf dataset with 98.02% accuracy, and the near competitor [40] has obtained 97.52%.Compared to this model the proposed model achieved approximately 0.5% improvement in accuracy and even outperforming other stateof-the-art methods.The database used in this research is captured under controlled conditions and also the effectiveness of the algorithm has not been evaluated on the open-field dataset.In the future, the use of image preprocessing techniques with Deep CNN models can improve the disease classification accuracy.
The proposed accurate model can be implemented as a standalone application to effectively classify the diseased leaf images in the early stages.This aids the farmers in proper crop management.Detection of diseases in early stages prevents the disease spreading over the crops and improves the crop yield this leads to global improvement in food production.

Fig. 2 .
Fig. 2. (a-d) The output of the pre-processing steps for a tomato leaf from the PlantVillage dataset

Fig. 4 .
Fig. 4. Performance of RFC with the number of trees Journal of Robotics and Control (JRC) ISSN: 2715-5072 847 Kiran S M, Plant Leaf Disease Detection using Efficient Image Processing and Machine Learning Algorithms algorithm has not been evaluated on open-field datasets or real-time images.
[40]kki et al.[40], classified 38 classes of images from the PlantVillage database using transfer learning with AlexNet, InceptionV3, MobileNet, and a simple sequential model.Their experiments revealed a maximum classification accuracy of 97.52% for the MobileNet model. B

TABLE I .
DATASET SPECIFICATIONS does not assume linear relationships between features.It can capture Kiran S M, Plant Leaf Disease Detection using Efficient Image Processing and Machine Learning Algorithms complex decision boundaries, making it suitable for tasks where classes are not easily separable by linear boundaries.
Table II presents the classification results achieved without image segmentation.It is evident from the table that the Random Forest classifier outperforms other classifiers in terms of accuracy.

TABLE II .
CLASSIFIERS' PERFORMANCE ON NON-SEGMENTED IMAGES

TABLE III .
CLASSIFIERS' PERFORMANCE ON SEGMENTED IMAGES . Compared to Table III and Table IV the classification results are better with hyperparameter optimization as shown in Table IV.During the hyperparameter optimization, GridSearchCV evaluates the model performance with multiple combinations of parameters and automatically chooses the optimum parameter for better classification accuracy.Table V shows the classification results for all the six classifiers on all 40 classes of images.It is noticed from TableV, that the random forest classifier outperformed in classifying all the ten types of plant classes with an average accuracy of 97.92% followed by the SVM classifier which has an average classification accuracy of 96.91%.
Kiran S M, Plant Leaf Disease Detection using Efficient Image Processing and Machine Learning Algorithms

TABLE IV .
CLASSIFIERS' PERFORMANCE ON TOMATO LEAF DATASET WITH HYPERPARAMETER OPTIMIZATION

TABLE VI .
COMPARISON OF THE PROPOSED METHOD WITH STATE OF ART METHODS