Oil Palm USB (Unstripped Bunch) Detector Trained on Synthetic Images Generated by PGGAN

— Identifying Unstriped Bunches (USB) is a pivotal challenge in palm oil production, contributing to reduced mill efficiency. Existing manual detection methods are proven time-consuming and prone to inaccuracies. Therefore, we propose an innovative solution harnessing computer vision technology. Specifically, we leverage the Faster R-CNN (Region-based Convolution Neural Network), a robust object detection algorithm, and complement it with Progressive Growing Generative Adversarial Networks (PGGAN) for synthetic image generation. Nevertheless, a scarcity of authentic USB images may hinder the application of Faster R-CNN. Herein, PGGAN is assumed to be pivotal in generating synthetic images of Empty Fruit Bunches (EFB) and USB. Our approach pairs synthetic images with authentic ones to train the Faster R-CNN. The VGG16 feature generator serves as the architectural backbone, fostering enhanced learning. According to our experimental results, USB detectors that were trained solely with authentic images resulted in an accuracy of 77.1%, which highlights the potential of this methodology. However, employing solely synthetic images leads to a slightly reduced accuracy of 75.3%. Strikingly, the fusion of authentic and synthetic images in a balanced ratio of 1:1 fuels a remarkable accuracy surge to 87.9%, signifying a 10.1% improvement. This innovative amalgamation underscores the potential of synthetic data augmentation in refining detection systems. By amalgamating authentic and synthetic data, we unlock a novel dimension of accuracy in USB detection, which was previously unattainable. This contribution holds significant implications for the industry, ensuring further exploration into advanced data synthesis techniques and refining detection models.


INTRODUCTION
USB (Unstripped Bunch) is a cause of losses in palm oil production [1].The average loss through USB in Malaysia is indicated to be 0.05% [2].In Indonesia, another significant palm oil producer, the loss due to USB is 2.2% [3].If the USB is not appropriately managed, losses from the USB may reach up to 40% in certain circumstances [4].Sadly, manual USB monitoring is still practiced today, which has to change [2].Faster R-CNN, a reasonably accurate object detector, may be used to address this issue.The availability of USB images limits the use of Faster R-CNN in addressing this issue.

II. USB (UNSTRIPPED BUNCH) AND EFB (EMPTY FRUIT BUNCH)
Unstripped Bunch (USB) and Empty Fruit Bunch (EFB) are two terms used to describe the processed oil palm fruit bunches that exit a thresher in a palm oil mill while still containing oil palm fruitlet.Fig. 1 shows a USB image, while Fig. 2 depicts an EFB.The threshing process produces USB and EFB in a palm oil mill.Adzmi et al. (2012) classified USBs as empty bunches with more than 20 fruitlets attached [5].According to different research, USB is a bunch of oil palm fruit that contains at least 30% fruitlet [6].

III. SYNTHETIC DATASET AND PGGAN
A. Dataset for Specific Applications Deep learning (DL) or advanced machine learning models have been demonstrated to require supplies of large-scale datasets [7].As a general rule of thumb, thousands of photos per category are required to train DL models in order to attain human-level performance [8].Given the benefits of data size, the computer vision community has developed several extensive image datasets with millions of labeled images, including ImageNet [9], Microsoft Common Objects in There are currently relatively few publicly accessible image datasets in specialized applications, such as the agricultural product processing industry, such as palm oil mills, that contain hundreds or even thousands of photos per category, comparable to the computer vision datasets discussed above.Leveraging modern AI capabilities, particularly DL techniques, in agriculture is severely inhibited by the lack of large-scale image datasets related to agricultural or processing activities.Data/image augmentation, which increases the size and variety of data sets, is a popular strategy to overcome the limitations of physically acquired data [13].
Occasionally, models may be unsuitable for executing a particular job due to the characteristics they have been trained.Hence, the process of training a new model is a challenging endeavor.An adequate dataset is necessary for this task.If the required dataset is not accessible, it becomes imperative to generate a dataset that aligns with the particular issue or domain of interest.The ability of GAN to produce synthetic data is an opportunity to overcome this problem, so many researchers use GAN to build synthetic datasets that are used to train object detectors [53], [54] [55], [56], [57], [58], [59], [60], and [61].
The implementation of GAN to build synthetic datasets for agricultural product detection has been carried out by several researchers.It has been used to generate realistic images to train deep learning models, improving fruit recognition performance [62], [63], [64], [65], [66], and [67].The GAN architecture used includes the CGAN (cycleGAN) [68], [69], and [70]; Boundary Equilibrium GAN [71] and Deep Convolutional GAN [72].The GAN model used is not limited to the models that have been mentioned but continues to develop; this shows the potential of GAN in improving object detection performance.
Several researchers have also reported improvements in the ability of object detectors trained with synthetic datasets.Fei et al. (2021) reported that a fruit detection model developed with YOLOv3 and enhanced with GANsynthesized images for the day domain yielded 37.2 mAP (mean Average Precision) instead of 37.0 mAP [64].The GAN-generated synthetic data implementation for Faster R-CNN is reported to provide performance improvements; with GAN-augmented data, VGG16-based Faster R-CNN trained with augmented data yields the best average accuracy of 90%, which is 28% higher than the accuracy of VGG16 without data augmentation [73].
In their study, Yuwana et al. (2020) investigated the use of GANs using multilayer perceptron architecture as both the model generator and discriminator.Their research aimed to facilitate illness detection by synthesizing tea leaf images belonging to four distinct classes: healthy and three different types of sick leaves.The classification accuracy of GAN and DCGAN (Deep Convolutional GAN) was 88.84% and 88.86%, respectively, while using 1000 synthetic pictures per class.This indicates an enhancement of around 2.5% compared to the basic model that did not include image augmentation [38].Gomaa and El-Latif (2021) achieved a recognition accuracy of 97.9% in their study by using DCGAN to generate synthetic images in facilitating the first identification of tomato plants infected with the tomato mosaic virus.This accuracy was acquired using augmented data, resulting in an improvement of around 1% compared to the accuracy achieved without augmentation [39].
GAN has a high probability of experiencing a mode collapse.Mode collapse is a phenomenon that occurs when the model generator is unable to effectively capture and represent the entire spectrum of potential outputs, resulting in a limited and distinct set of outputs that deviate from the patterns seen in the training data.The discriminator exhibits excessive proficiency, making it unable to discriminate synthetic data generated by the generator, impeding knowledge acquisition.As a result, generators face challenges in producing multiple outputs, leading to a tendency for generators to get stuck in repeating patterns, limiting the available variance in the resulting samples.The duplicate image problem can be solved by modifying the GAN training method.

C. PGGAN
As a modified GAN, Progressively Growing GAN (PGGAN) [74] can solve the problem of duplicate samples [75] using a multi-stage training method.PGGAN utilizes the concept of a progressive neural network that was first proposed by Rusu et al. in 2016 [76].PGGAN requires synchronously developing generator and discriminator networks and gradual model training from low-resolution (4×4 pixels) to high-resolution images by adding network layers.This strategy substantially increases training speed (two to six times faster) and image stability for large image resolutions.As a result of the incremental expansion of the convolutional layers, the generator and discriminator model can effectively learn coarse-scale details initially and then fine-scale details through the training process.
The PGGAN is a GAN trained on multi-stage resolution; at each stage, the generator and discriminator are added with a new convolution layer with higher resolution.A fundamental GAN comprises two neural network models: a generative model  (Generator) that learns the distribution of unseen training data and a discriminative model  (Discriminator) that learns to classify whether samples are from the training data distribution.The generator takes a random noise vector  as input and synthetic data () as output; a discriminator  takes  or () as input and a probability () or (()) as output to indicate whether it is synthetic or from the authentic data distribution, as shown in Fig. 3.The generator and the discriminator are both simultaneously trained using the stochastic gradient descent (SGD) algorithm, and their training can be viewed as a twoplayer minimax game as an objective function equation (1).

𝑉(𝐷, 𝐺)
The discriminator attempts to maximize  (, ) (probability ()), whereas the generator attempts to minimize it.In other words, the discriminator distinguishes between the probabilities of images  in   () (the distribution of authentic data) and the noise distribution   ().On the contrary, the generator generates samples to deceive the discriminator.
In the PGGAN's discriminator and PGGAN's generator, the network is predominantly composed of up sample and down sample blocks.The backbone structure of the upsample block in the generator network is upsampling2d-convolution2d-LRelU activation-convoltion2d-LReLU, as shown in Fig. 4.

D. Faster R-CNN
Faster R-CNN, introduced in 2015 by Ross Girshick et al., is one of the most well-known object recognition designs that employ convolutional neural networks alongside YOLO (You Look Only Once) [77] and SSD (Single Shot Detector) [78].Apart from YOLO, the Faster R-CNN is one of the bestperforming object detectors in terms of accuracy [79] [80].Its good performance has led to the emerging application of the faster R-CNN for the detection of special objects, such as agricultural products [81], [82], [83], [84], and [85].
Faster R-CNN is a two-stage object detector.The first stage is an RPN (Region Proposed Layer), and the second is the classifier.

IV. METHODS
A comprehensive experimental approach was designed to evaluate the potential of synthetic image generation using PGGAN, especially in improving the performance of Unstripped Bunch (USB) detection.The initial experiment was carried out by conducting PGGAN training using authentic images.Authentic images were taken from a surveillance camera installed on a USB conveyor inside the palm oil mill.The total number of authentic images for the PGGAN training dataset was 800 images, consisting of 400 USB and 400 EFB images, respectively.Images used for the training were selected randomly.
The trained PGGAN was then used to produce 1,000 synthetic images, each consisting of 500 EFB images and 500 USB images.These synthetic images were used to train USB detectors using VGG16 as a feature extractor.The VGG-16 was selected as the feature extractor due to its smaller convolution filter, which reduces the tendency of the network to overtrain.Moreover, VGG-16 is the least size-wise model that allows spatial characteristic understanding of an image [86].

A. PGGAN Training
Progressive Growing GAN is an extension of the GAN training procedure that involves training the GAN to produce very small images, such as 4×4 pixels, and then progressively increasing the resolution scale of the resulting images to 8×8, 16×16, or larger, as desired.This allows progressive GAN to generate synthetic images with a 1024-by-1024-pixel resolution.
However, in this research, the resolution of the images used for USB detector training was limited to 512×512 pixels.The particular 512×512 resolution limit was determined since the largest side of input images in Faster R-CNN was set at 600 pixels.
As seen in Fig. 7, PGGAN was trained in stages.In the first stage, it was trained with a 4×4 image, and the weights generated at this size were used to train a larger resolution in the subsequent stage.Since our study limited the resolution size to 512×512, the process would be stopped when 512×512 image resolution was achieved.The training parameters are shown in Table I.The Adam optimizer was chosen for optimization because it is relatively insensitive to the training hyperparameters, such as learning rate and momentum; thus, the resulting PGGAN model became more stable.The He normal initializer was selected as the kernel initializer to prevent the gradient from vanishing or exploding and improve the network stability and convergence during training [87].A linear activation function was used for the output layer to produce a continuous output.A learning rate of 0.001 was chosen to speed up the training process.The total number of images used to train PGGAN was 1000, comprising two datasets: the USB and EFB datasets.Each data set contained 500 related authentic images.Table II shows samples of EFB and USB images that were used to train PGGAN.

B. USB Detector Training
The USB detector was trained using images generated by PGGAN (synthetic images), authentic images, and a merged dataset of both images.The training for each dataset was carried out separately.A performance comparison of USB detectors that were trained using different datasets (synthetic images, authentic images, and images of both types) was conducted.The USB detector training parameters were set using the same values.Table III   In this research, a similarity test was also carried out between the synthetic images and the authentic images.The parameters used for the image similarity test were PSNR, SSIM, and VIF.The PSNR and VIF evaluate whether the two images are the same, while the SSIM tests whether the two images are structurally similar.In the test, the PSNR value was expected to be below 20 dB [44], and the VIF score approached zero, meaning the two images were different [45].In contrast, the SSIM value was expected to be above 50% (confidence in recognition for SSIM is 0.580) [46], meaning that both images had the same structure, that the USB and EFB images produced by PGGAN still represented an oil palm bunch but still maintained the unique characteristics that differentiate USB from EFB.

A. Training Performances of The USB Detectors
The PGGAN model that had been trained was used to generate 1000 images.Table IV depicts USB and EFB PGGAN-generated sample images.Apparently, PGGAN was assumed to be capable of producing synthetic images differently from the authentic image.Therefore, an image similarity test was carried out to prove the assumption.The synthetic images were contrasted with authentic images that were used to train PGGAN.Several test parameters, including SSIM (Structural Similarity Index Measure), PSNR (Peak Signal to Noise Ratio), and VIF (Visual Information Fidelity), were conducted to evaluate the similarity of these images.
Table V presents similarity test results for the sample images.According to the PSNR scores, the average value was 13.5377 dB (under 20 dB).Whereas, the average VIF was 0.0325 (close to zero).The results of the similarly test suggested noticeable differences between the produced USB and EFB images and the authentic images.According to the SSIM parameters, whose average value was 0.7137 (or 71.37%, which was above 50%), it was evident that the images exhibited similarity despite their inherent disparities.Despite their distinctiveness, the images presented the defining features of a USB, which included oil palm fruitlets, as well as the defining features of an EFB, which was a lack of fruitlets.Consequently, the synthetic images portrayed an oil palm bunch effectively.As seen in Fig. 8 to Fig. 13, all the datasets used to train the USB detector could create a convergent model.From Fig. 8 to Fig. 13, the USB detector trained using a synthetic dataset displayed better training performance, where the total loss was 0.337 instead of 0.492, which was the total loss resulting from model training using an authentic dataset.The same trend was also demonstrated by the USB detector trained with a merged dataset, whose results were also reasonably better than using an authentic dataset.A summary of the training performance parameters of USB detectors trained on different datasets is presented in Table VI.
Table VI also reveals that the utilization of synthetic images could increase RPN layer accuracy (92.6% for training with the synthetic dataset, 92.1% for training with the merged dataset) compared to using only authentic images (89.9%).
Regarding RPN classification loss, model training using synthetic and merged datasets provided smaller values (0.092 and 0.078, respectively) compared to using an authentic dataset (0.115).The trend was similar for RPN regression loss.The regression loss of synthetic and merged datasets was smaller (0.019 and 0.023, respectively) compared to the authentic dataset (0.249).
As with the loss in the RPN network, the loss during classifier network training also showed similar trend results.The classification loss and regression loss were smaller when the USB detector was trained using synthetic images compared to when trained with only authentic images.This causes the total loss for the proposed method to be smaller (0.336 and 0.348 for synthetic and merged datasets, respectively) than conventional training with only authentic images (0.492).Regarding the RPN layer accuracy as a training parameter, using synthetic images for training resulted in increased accuracies (92.6% for training with a synthetic dataset, 92.1% for training with the merged dataset) compared to using authentic images (89.9%).

B. Validation Test
Validation tests were performed to verify the functionality of the USB detector.Fig. 14 illustrates the detection results of each USB detector trained using a synthetic dataset Fig. 14(a), a merged dataset Fig. 14(b), and an authentic dataset Fig. 14(c).Moreover, Fig. 14 actually depicted an ESB.Although the USB detector trained using the synthetic dataset failed to recognize an ESB Fig. 14(a), combining synthetic images with authentic images in a 1:1 ratio to form a merged dataset was able to improve the performance of the USB detector so that it could detect an ESB Fig. 14(b).In fact, it also corrected erroneous detection by the USB detector trained with authentic images as shown in Fig. 14(c).Fig. 15 shows the results of the USB detector validation test.The validation test results revealed that the performance of USB detectors trained on authentic and synthetic datasets was nearly identical.However, the USB detector trained with a merged dataset had better performance.

VI. CONCLUSION
This paper proposes synthetic data images to train a VGG16-based Faster R-CNN to detect USB and EFB.The new data samples were obtained using PGGAN with the highest resolution of 512×512.The synthetic imagegenerating system using PGGAN could produce new images while maintaining the semantics of authentic images.Increasing the number and variety of sample images impacts the accuracy of USB and EFB detection.Combining authentic and synthetic datasets yielded the best detection results, increasing the ability to detect (mAP) by 10.1% compared to using only an authentic dataset.
Even though this research succeeded in showing an increase in object detection performance trained using additional data in the form of synthetic images, training only using synthetic images produced by PGGAN was not able to improve object detection performance; one of the reasons was because PGGAN was trained using a relatively small dataset.Using a larger dataset would enable PGGAN to extract USB and ESB features more comprehensively.
This research shows promising results for further study in real applications of USB monitoring in palm oil mills where the data obtained was videos rather than still images.Further research for generating synthetic data to train detection objects might utilize other GAN architectures, such as a super-resolution GAN, since it could generate betterresolution images.

Fig. 5 .
Fig. 5. Down sample block Fig. 6 depicts the architecture of Faster R-CNN in general.Various bounding boxes are assigned to be region proposals through the region proposal networks and the deep, fully-connected convolutional neural networks; then, the results are normalized through the ROI (Region of Interest) Pooling Layer.The Fully Connected layers extract image features for object-class classification and perform bounding box regression.

Fig. 6 .
Fig. 6.Faster R-CNN architecture The synthetic images produced by PGGAN were split into 800 images for training and 200 for validation.The USB detector was also trained with the dataset of authentic images, with the same number of images and training parameters as the USB detector was trained with synthetic images.A performance comparison of USB/EFB detectors that were trained with synthetic image dataset and Journal of Robotics and Control (JRC) ISSN: 2715-5072 680 Wahyu Sapto Aji, Oil Palm USB (Unstripped Bunch) Detector Trained on Synthetic Images Generated by PGGAN those trained using authentic image dataset was carried out to assess whether the use of synthetic images generated by PGGAN could improve the performance of the USB detectors.

Fig. 8 to
Fig. 8 to Fig. 13 show the training performance of the USB detector on different training datasets.The blue lines represent the training performance of a USB detector that was trained using a synthetic dataset (which contains only PGGAN-generated images); the red lines represent the training performance of the USB detector trained using a merged dataset (which contains authentic and synthetic images with a 1:1 ratio).The yellow lines represent the training performance of the USB detector trained on an authentic dataset (which contains only authentic images).

Fig. 14 .
Fig. 14.Object detection accuracy per image during the validation test

TABLE I .
PGGAN FINE TUNING TRAINING PARAMETERS

TABLE III .
FINE TUNING TRAINING PARAMETERS OF THE USB DETECTOR

TABLE IV .
SAMPLES OF USB AND EFB SYNTHETIC IMAGES PRODUCED BY PGGAN AND THE REPRESENTATIVE AUTHENTIC IMAGES

TABLE V .
SIMILARITY TEST RESULT OF THE SAMPLE IMAGES

TABLE VI .
TRAINING PERFORMANCE OF USB DETECTORS TRAINED WITH DATASETS OF AUTHENTIC, SYNTHETIC, AND COMBINATION OF BOTH IMAGES Object detection accuracy per image during the validation testTable VII lists the validation test results of USB detectors trained using different datasets.According to the results, a merged dataset used to train the USB detector resulted in the best results.A USB detector trained with a dataset of synthetic images gave a slightly lower mAP (75.3%) than authentic (real) images (77.1%).The accuracy was slightly smaller because only some of the detailed distinctive features of USB and EFB could be captured by PGGAN.The results of the training validation test using a merged dataset in a ratio of 1:1 gave a better mAP of 87.9%, or an increment of approximately 46% than the mAP obtained when using authentic images as training data.

TABLE VII .
VALIDATION TEST RESULTS OF USB DETECTORS TRAINED WITH SYNTHETIC, MERGED AND AUTHENTIC DATASETS