Disease Detection of Solanaceous Crops Using Deep Learning for Robot Vision

— Traditionally, the farmers manage the crops from the early growth stage until the mature harvest stage by manually identifying and monitoring plant diseases, nutrient deficiencies, controlled irrigation, and controlled fertilizers and pesticides. Even the farmers have difficulty detecting crop diseases using their naked eyes due to several similar crop diseases. Identifying the correct diseases is crucial since it can improve the quality and quantity of crop production. With the advent of Artificial Intelligence (AI) technology, all crop-managing tasks can be automated using a robot that mimics a farmer's ability. However, designing a robot with human capability, especially in detecting the crop's diseases in real-time, is another challenge to consider. Other research works are focusing on improving the mean average precision and the best result reported so far is 93% of mean Average Precision (mAP) by YOLOv5. This paper focuses on object detection of the Convolutional Neural Network (CNN) architecture-based to detect the disease of solanaceous crops for robot vision. This study's contribution involved reporting the developmental specifics and a suggested solution for issues that appear along with the conducted study. In addition, the output of this study is expected to become the algorithm of the robot's vision. This study uses images of four crops (tomato, potato, eggplant, and pepper), including 23 classes of healthy and diseased crops infected on the leaf and fruits. The dataset utilized combines the public dataset (PlantVillage) and self-collected samples. The total dataset of all 23 classes is 16580 images divided into three parts: training set, validation set, and testing set. The dataset used for training is 88% of the total dataset (15000 images), 8% of the dataset performed a validation process (1400 images), and the rest of the 4% dataset is for the test process (699 images). The performances of YOLOv5 were more robust in terms of 94.2% mAP, and the speed was slightly faster than Scaled-YOLOv4. This object detection-based approach has proven to be a promising solution in efficiently detecting crop disease in real-time.


INTRODUCTION
Agriculture has long been a vital economic and social sector. It is difficult for manpower to accurately detect crop diseases at an early stage to improve the quality and quantity of crop production. The causes of crop diseases are more likely due to many factors, such as shifting weather, lack of nutrition, and pest attacks. In general, crop disease detection is carried out manually using visual inspection or microscope techniques, which are time-intensive and prone to inaccuracy leading to different human vision and error information [1]. Mistakes or missteps are usually unavoidable when using manpower, especially when classifying the plant's type of disease because human eyes are prone to errors and require a time-consuming diagnosis. However, disease and pest control challenges still haunt some local farmers [1]. As a result, disease detection requires regular crop monitoring throughout the growing period. One practical approach to resolving these issues is the development of an automated agricultural robot capable of detecting disease and monitoring the field condition by moving around the field. The development of the robot's vision is a hurdle. We want the robot's vision to mimic the sight of human eyes [2]. The robot is expected to improve operational accuracy in the farming industry [3]. On top of that, robot motion planning in real-time application is also one of the key areas of research in computer science and computer geometry [4].
Artificial Intelligence (AI) is a suitable algorithm for robot vision if we aim for an intelligent system. Since AI focuses on developing computer software to make computing tasks smarter, AI research applications' ultimate focus is to develop computational approaches for intelligent behaviour [5]. The demand for an intelligent system with real-time control in manufacturing processes and productions is increasing rapidly [6]. The increasing powers of computers and embedded computing have further contributed to AI advancement [7]. The most common application of AI in computer vision is face recognition [8] which is heavily deployed on the smartphone. AI technology is widely used worldwide and positively impacts manufacturing, healthcare, and agriculture [9,10]. Agriculture is an extreme industry, with 30.7% contributing to economic progress [11]. Agriculture is a dynamic sector in which it is impossible to generalize situations to propose a standard solution [12]. In terms of accuracy and robustness, AI is at its best in supporting agricultural systems. An efficient technique of utilizing an AI system can help farmers to monitor their crops, including detecting any crop disease [13].
Deep learning is an AI technique that simulates how humans acquire knowledge. Deep learning is a critical component of information data science, covering statistics and prediction [14,15]. As the most empowered machine learning technique, deep learning has been applied in various fields, including robotics and agriculture [16]. Today, deep  [10]. Deep Learning models perform exceptionally well in prediction and classification because of their large learning capacity and highly hierarchical structure [11,12]. They are also flexible and adaptable to various highly complex (from a data analysis perspective) challenges [17].
CNN is one of the most prominent deep learning approaches [18,19]. CNN is an algorithm of deep learning which take images as input and can extract significant features automatically to learn and ultimately classify the input images to their suitable output class [20,21]. Object detection acts better for computer vision techniques regarding network architecture, training techniques, and optimization functions [22,23]. Object detection has several models, such as YOLO [24]. Faster R-CNN used to be the main model for object detection. However, the inference speed resulting from Faster R-CNN still does not meet the one derived by YOLO [22]. YOLO is an object detection method that acts as a realtime object detector [25]. Joseph Redmon created the original model of YOLO (You Only Look Once) in a custom-built framework called Darknet. Darknet is a very adaptable research framework written in low-level languages that have created computer vision that can achieve the most significant real-time object detectors, including YOLO, YOLOv2, YOLOv3, YOLOv4, and recently, the new one is YOLOv5 [26,27].
In a previous study, Roy et al. (2021) reported that YOLOv3 reached 78% of mAP compared to YOLOv4 and 86% in detecting various plant disease classes [28]. Wu et al. (2021) used YOLOv3 and YOLOv4 toward 2670 images of an augmented dataset and achieved an accuracy above 90% for each model [29]. Thuan et al. (2021) YOLOv5 shows the model fast and reached high accuracy of 93% on train 3422 images with 100 epochs. One epoch cycle only takes around 20 seconds to complete [30]. Other related research works on robot vision, such as [31] implemented U-Net architecture to detect the leaf of the bean images captured in uncontrolled environmental conditions. The accuracy achieved was 91.02%.
The research contribution is to present a detailed process and a suggested solution for problems that arose throughout the development of object detection algorithms (YOLOv5 and Scaled-YOLOv4) to detect the diseases on the leaf and fruit of solanaceous crops. The output of this study is expected to become the algorithm of the robot's vision in realtime. The performance comparison of these models is also analyzed in terms of precision, recall, mean average precision (mAP) and training time. The detailed development for the mentioned purpose was discussed in detail. This paper starts with the related work and theory, methods for completing the whole simulation, results, and performance of the YOLOv5 model compared with the previous YOLO (Scaled-YOLOv4), and the conclusion of the overall works.

A. Deep Learning
Deep Learning is a type of machine learning that extends traditional machine learning by adding more "depth" (complexity) to the model and modifying the data using several features that allow data to be represented in a hierarchical form through multiple levels of abstraction [32]. If large datasets describing the problem exist, these complex models used in deep learning can reduce the errors, especially in regression problems, and improve classification accuracy [33].
Deep learning has several layers, such as convolution, fully connected, pooling, etc. The main feature of deep learning is that the features in these layers are learned from data instead of just designed by engineers through some learning procedures [34]. The organization of the layers will create different network architectures, such as Convolutional Neural Networks, Recursive Neural Networks, Unsupervised Pretrained Networks, and Recurrent Neural Networks [35].

B. Convolutional Neural Network
Unlike other Deep Learning architectures, such as Recurrent Neural Networks or Long-Short Term Memory, in image and video applications, CNN architecture is preferable as the architecture design of CNN focuses on the spatial correlation of pixel intensities more efficient for images [36]. CNN model provides an essential visual feature extractor for crop diseases. It consists of three operation layers: convolutional layers, max-pooling layers, and fully connected layers that act as automatic feature extractors in one single module during training [37]. Then it employs 2D convolutional layers, making this architecture more ideal for interpreting 2D data, such as images, than other machine learning (ML) techniques [38]. CNN overcomes the limitation of the manual feature extraction process carried out by traditional ML techniques and can handle vast amounts of data [39]. CNN of its model extracts the data directly from images. CNN architecture consists of numerous layers that perform image processing operations. These layers include input, multiple hidden, and output layers. The hidden layers typically comprise several convolutional layers, pooling layers, and a set of fully connected layers to perform the classification task [40].

C. Object Detection
Traditional object detection algorithms use handcrafted designs and simple trainable architectures [41]. Their performance is easily stagnated by developing complicated ensembles that mix several low-level picture features with high-level information through object detectors and image classifiers. A traditional object detection architecture consists of region candidate generation, feature extractions, and classification tasks. The detection result from the classifier is fed onto the Non-Maximum Suppression (NMS) algorithm to optimize the results by combining multiple overlapping bounding boxes [42]. With the rapid advancement of deep learning, more powerful tools that can learn semantic, highlevel, and deeper features are being offered to address the issues that older systems have [43]. Object detection is a computer vision technique for identifying and locating objects in images and videos [44]. Object detection has been an active research area in computer vision for decades [45]. It deals with instances of visual detection of any specific class, such as detecting humans, vehicles or animals. Object detection may count multiple objects in a scene, identify and trace their precise locations, and accurately label them with this type of identification and localization [46]. In short, an object detection algorithm allows us to locate and predict the specific location of the desired object using bounding boxes [47]. Object detection models can be divided into two categories: a one-stage target detection framework based on region proposal or a two-stage target detection framework based on regression [48]. One of the examples of a one-stage object detection method is the variants of the YOLO family. YOLO has been widely applied across various industries due to its suitability to be implemented in embedded controller systems through transfer learning, plus its ability to be a self-adaptive algorithm [49]. An adaptive neural network is usually applied when there is minimal prior knowledge of the environments [50]. The recent version of YOLO is known as YOLOv5.
Since it was first introduced until now, many researchers and industry players have deployed the YOLOv5 model for tasks such as crop recognition, yield estimation and many more [47], [51]- [55].

D. YOLOv5
Glenn Jocher, the founder of Ultralytics, released an opensource implementation of the YOLOv5 model in June 2020 [56]. It is the first in the YOLO family to be released without a paper and is still in "continuing development" on its repository. The YOLOv5 switched from Darknet to Pytorch, achieving 140 frames per second in the Tesla P100, compared to 50 frames per second in the YOLOv4. YOLOv5 is suitable for real-time object detection and has many advantages over traditional object detectors [58]. YOLOv5 offers the same benefits as YOLOv4 and has a nearly identical architecture. Compared to YOLOv4 and YOLOv5, it is easier to train and detect the object [57].
The backbone, head, and detection are the three fundamental components of YOLOv5. A CNN serves as the backbone, gathering and shaping visual features at various levels of granularity. The YOLOv5 uses the Center and Scale Prediction (CSP) bottleneck to create image features. The detection is a method that localizes the bounding box, labels class prediction at the image, and uses features from the head of the structure [59]. The head comprises layers that aggregate image characteristics before being sent into a prediction algorithm. The PA-NET is also implemented in YOLOv5 for feature aggregation. Fig. 1 shows the architecture of YOLOv5.

A. Dataset
Adequate dataset samples are requisite for all deep learning methods to obtain a good generalization result. The images for the dataset were collected from a mobile phone camera, downloaded from the internet on Kaggle, GitHub of Plant Village dataset, and searched by disease of solanaceous crops name on Google Image. After collecting the images, data separation was carried out according to their classes. The classes were named based on either healthy or diseases that infected the leaf and fruit. The healthy image was also collected for this dataset to make the system differentiate whether the solanaceous crops were healthy or infected by the disease. The initial number of samples for the dataset is around 300 images for each class, and the total number of classes consists of 23. These classes are shown in TABLE 1. Roboflow also allows access to labeling datasets, annotating images, preprocessing, augmentation process, and other beneficial functions to handle the dataset. Some function mentioned has been applied in this project. Roboflow is a free online platform for labeling and annotation instead of downloading other software to your computer. The purpose is to secure your dataset and enable access on several devices, such as tablets or smartphones.
At the early stage, the images were split into a training set (70%), validation set (20%), and testing set (10%) after uploading the images to Roboflow. Then, the images are labeled by their class name and annotated by drawing a bounding box to identify the data features in the area of the diseases and healthy leaf and fruit of crops. Fig. 2 shows the annotation and labelling process on Roboflow.

C. Preprocessing
Transforming the data from raw data to the desired format suitable for the YOLOv5 preprocessing process must occur. This procedure eliminates data discrepancies or duplication, which could otherwise degrade the accuracy of a model. Data preprocessing also guarantees that no inaccurate or lost values exist because of human mistakes or bugs. By adding image alterations to all the images in this dataset, training time can be saved, and performance can be improved. EXIF rotations should be removed, and pixel order should be standardized with the help of auto-oriented applied for this preprocessing method. The image resizes to 416×416 to standardize the size, and the smaller file size can help for faster training. Auto adjust contrast can help the model to detect edges around the object. Roboflow has a function that allows you to modify the labeled samples that have been labeled mistakenly. Fig. 3 shows the example of preprocessing in Roboflow.

D. Data augmentation
In deep learning, the key feature to improve model performance accuracy is increasing the number of samples to train the system effectively. At the early stages of this process, the dataset only contains 300 images for each class.
It is considered a small dataset and may lead to lowperformance accuracy at the end of the training process. An augmentation process was applied to increase the number of samples to overcome this problem.

Fig. 4. Augmentation Process Selected in Roboflow
The augmentation process was also carried out at the Roboflow.ai website, providing an auto-generating function for the augmentation image. Based on Fig. 4, these are the augmentation techniques selected to increase the number of datasets. Random rotation augmentation can aid the model in detecting the object, even if the images are not precisely aligned. The same goes for bounding box rotation. Combining the methods can help the model stay sturdy to the camera roll in real-time usage. Grayscale, saturation, and exposure can help increase the various colors of the images so that when it is tested in real-time, it has learned to detect the object even in different lighting. Adding noise to the images can prevent overfitting and against adversarial attacks. Fig. 5 shows the augmented images auto-generated after selecting some techniques to increase the dataset.
After applying preprocessing and augmentation method, the dataset will be auto-generated, and random images will be selected for the training process. The dataset will expand to three times from the early total of images. At the end of this process, the dataset increases from 6900 images to 16580 images. The training set, validation set, and test set are separated into 88% (15000 images) for training, 8% (1400 images) for validation, and 2% (699 images) for the testing process. This train, valid, and test set is the final subset of the dataset applied to evaluate this project's performance.

1) Google Colab notebook
Google Colab provides free access to their Graphical Processing Unit (GPU). Users must select the Runtime operation, either Tensor Processing Unit (TPU), GPU or None. However, the upgrade version, Google Colab Pro, provides a random GPU, either Tesla T4 or Tesla P100.

2)
Training YOLOv5 model These are the steps involved in training the YOLOv5 model:

a) Installing the YOLOv5 environment
YOLOv5 pre-trained model repository provided by Ultralytics GitHub was used to train the dataset and provide the library dependencies. PyTorch requires the libraries before training the model. This step involves cloning the YOLOv5 repository before the Installation of the library dependency can be carried out. Fig. 7 shows the example of the cloning and installing process.

b) Download custom object detection in YOLOv5 format from Roboflow
After generating the augmented samples on Roboflow, a link was copied from the 'YOLOv5 PyTorch' to import the dataset. Before starting the training process, the augmented dataset from Roboflow is imported into the Google Colab notebook. It should be noted that the Ultralytics implementation supports a YAML file that specifies the location of the training and test data.

c) Define YOLOv5 Model Configuration and Architecture
Then, the YOLOv5 model configuration was defined by creating a YAML script that specifies the parameters for the YOLOv5 model, such as the number of classes, anchors, and backbone layers.

d) Training Custom YOLOv5 Detector
The training process is started when all the previous steps have been followed. This work uses the YOLOv5's model that runs a parameter of 100 epochs with 16 batch sizes and an input image size of 416. The training process will take around two to three hours to complete.

3) Evaluate the performance
Once the training process has been completed, the trained model's performance will be evaluated through the test images, whether it reaches 90% or above. If not, the training process is revoked by tuning the number of epochs and hyperparameters. The test images and videos used in this process are images that have never been seen during training. The training performance is evaluated through a plotted graph, including time taken to finish the training process, precision, recall, and mean average precision (mAP). The test images and video are verified to check the model's performance in detecting the disease of solanaceous crops. This training model can be used in real-time detection by exporting the trained weights of the network. The file can be kept in Google Drive for future use and deployed into realworld devices such as webcams, Raspberry Pi, Jetson Nano, mobile phones, and other supported devices.

IV.
RESULT AND ANALYSIS

1) Model Comparison
The training was performed on 16580 images using 100 epochs and 16 batch sizes on the YOLOv5 and Scaled-YOLOv4. The performance of YOLOv5 was compared with Scaled-YOLOv4. TABLE II shows the comparison of the pretrained model used in this project. Ultralytics supports numerous YOLOv5 architectures, known as P5 models, which differ primarily in size: YOLOv5n (nano), YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), YOLOv5x (extra-large). This project uses the YOLOv5s pretrained model to perform the object detection training because it is among the fastest model of P5  [60]. Meanwhile, this proposed model compared with Scaled-YOLOv4 because among of YOLOv4 family, Scaled-YOLOv4 obtained record-breaking performance on the COCO benchmark [61]. YOLOv5s was compared to Scaled-YOLOv4 to prove this proposed model's effectiveness and robustness in detecting the solanaceous crops' disease. The backbone of YOLOv5s is CSPDarknet, and the neck uses PANet. For Scaled-YOLOv4, CSPDarknet53 is used as its backbone and PAN+SPP for the neck. YOLOv5s has 283 layers, and the parameter of this model is 7.2 million. Scaled-YOLOv4 has 334 layers and 53 million parameters. Both of these models used the PyTorch Library framework for their implementation.

2) Performance Evaluation
The plotted graph illustrates the time spent between the YOLO model to complete the training process and the metrics for each method for comparison. TABLE III displays each algorithm's training time, precision, recall, and mAP_0.5. Fig. 8 and Fig. 9 show the cumulative graph of performance characteristics of precision, recall, mAP_0.5, and mAP_0.5:0.95 on the YOLOv5's model and Scaled-YOLOv4 model. The mean average precision, mAP, acts as an accuracy function. mAP computes a score by comparing the detected bounding box to the ground-truth bounding box. The greater the value, the more accurate the model's detections. The mAP@0.5 indicates that IoU is set to 0.5. The average percentage of all pictures of each category is calculated, and then all categories are averaged. IoU is an acronym that stands for interaction over the union. IoU will calculate the overlap of the two boundaries.
The Intersection over Union (IoU) calculates how much the estimated boundary overlaps with the actual boundary. mAP@0.5:0.95 is the average mAP for different IoU thresholds between 0.5 to 0.95 in the step of 0.05. Based on TABLE III, the YOLOv5 model performs better than Scaled-YOLOv4 in terms of accuracy, execution time and lightweight. , = Fig. 10, Fig. 11, Fig. 12, and Fig. 13 shows detailed comparison performance graph of precision, recall, mAP@0.5 and mAP @0.5:0.9 that being collected from trained process of YOLOv5 and Scaled-YOLOv4. From those graphs, it can be seen that YOLOv5 has a slightly better result but is much faster than Scaled-YOLOv4, as shown in Table III.  Table IV shows the performance achieved by other research works utilizing crop dataset for robot vision. As shown in the table, the proposed approach has achieved a slightly better mAP than the other approaches. Although the table is not a fair benchmarking due to different dataset and hardware used, the result shows that the proposed approach is promising.

3) Video Testing
The video testing was performed to evaluate the detection accuracy of these two YOLO models. In Fig. 14 and Fig. 15, the video's time has been paused to show the comparison between these two models. The YOLOv5 model managed to localize each common potato scab that it saw in the YOLOv5's model by a bounding box, while the Scaled-YOLOv4 model could not detect all of the common potato scabs. Therefore, YOLOv5 detects more accurately and faster than Scaled-YOLOv4, meaning that YOLOv5 is more suitable for real-time object detection. Fig. 16 shows some of the test images detected on healthy and diseases class of solanaceous crops performed by the YOLOv5 model used in this research.

5) Results Discussion
From the results shown previously, the trained model has achieved a detection accuracy of around 94.2%. However, some bounding boxes are too big for the disease area. The full name labeled and prediction does not appear as a whole in the image. This is due to the name being set too long, making it not fully appear in the images. Therefore, the annotating and labeling must be done correctly using a shorter but   This study presents the process details and suggestions of the problems that arose during the development of crop disease detection for robot vision. An efficient object detector is required to ensure that the robot's vision mimics the ability of human sight. For that purpose, the Scaled-YOLOv4 and YOLOv5 were tested in this study. The simulation work was carried out using a Google Colab notebook through the Pytorch framework. The Roboflow.ai website aids in creating the custom dataset by providing annotating, labeling, preprocessing, and data augmentation functions. It can also help in exporting a particular file format into a format required for the training process. Performance has been evaluated from training the dataset of 16580 images with 100 epochs and 16 batch sizes and shows that the mean average precision using the YOLOv5 model is 94.2% which is better than Scaled-YOLOv4. YOLOv5 also has shown better performance in training time and video testing. The outcomes demonstrated the potential of YOLOv5 as an important robot vision.
In the future, the designed model can be deployed on a real-world device by converting the trained weights of the model's network into an embedded device, such as a mobile phone. After deployment, this model can assist modern farmers with automatic crop disease detection at any time and place. Future work should concentrate on detecting diseases in various crop parts, tracking disease progression, and suggesting information to prevent the diseases.