Key Factors that Negatively Affect Performance of Imitation Learning for Autonomous Driving

— Conditional imitation learning (CIL) has proven superior to other autonomous driving (AD) algorithms. However, its performance evaluation through physical implementations is still limited. This work contributes a systematic evaluation to identify key factors potentially improving its performance. It modified convolutional neural network parameter values, such as reducing the number of filter channels and neuron units, and implemented the model into a vision-based autonomous vehicle (AV). The AV has front-wheel steering with an Ackermann mechanism since it is commonly used by passenger cars. Using the Inertia Measurement Unit, we measured the vehicle’s location and yaw angle along the experimental route. The AV had to move autonomously through new road sectors in the morning, afternoon, and night. First, an overall performance evaluation was carried out. The results showed a 99% success rate from 648 evaluation experiments under different conditions in which the 1% failure rate happened at new intersections. Then, a turning performance evaluation was conducted to identify key factors leading to failure at new intersections. They include fast speed, dazzling light reflection, late navigation command change instant, and the untrained turning driving pattern. The AV never failed while driving on the trained routes. It had a 100% success rate when driving slower, even under various lighting conditions and at various driving patterns, including untrained intersections. Although this study is limited to identifying key factors at three constant speeds, the results become the foundation for future research to improve CIL performance for AD, including by incorporating multimodal fusion and multi-route networks.


INTRODUCTION
Significant research progress has been achieved in autonomous driving (AD) of ground vehicles [1]- [14].Chen et al. categorized two significant paradigms for vision-based AD: mediated perception approaches that parse an entire scene to make a driving decision and behavior reflex approaches that directly map an input image to a driving action by a regressor [15].For behavior reflex approaches, an artificial neural network was designed to control an autonomous navigation test vehicle for road following [16]- [19].
A learning system that takes raw color images from forward-pointing cameras and maps them to a set of steering angles through a single trained function was termed end-toend learning by the authors in [20], [21].They developed a 6-layer convolutional neural network (CNN) for vision-based obstacle avoidance of an off-road 1/10 scale electric truck.In [22], a CNN framework was adopted to develop an end-toend controller that manages a full-scale car following the lane on local roads based on image input.The authors developed a method for determining which elements in the image most influence steering decisions [23].The framework was also used in [24] to build a low-cost modular automated guided vehicle (AGV) capable of autonomously following the lane in a specific fixed route.Furthermore, several methods have been proposed to improve the effectiveness of the end-to-end AD approach.For instance, a CNN was combined with a feedforward network with a fully connected hidden layer for lane following control of a 1/5-scale car, as presented in [25].
Most early research studies on imitation learning (IL) for AD have focused on lane following and obstacle avoidance problems.Later, more research pushed urban driving with nontrivial road layouts and traffic [26]- [30].Codevilla et al. proposed an IL method that maps camera images and incorporates high-level navigation input to control an autonomous vehicle (AV) to navigate the intersections, retrospectively known as conditional imitation learning (CIL) [31].This method used a deep learning architecture for the image processing module, which consists of 8-layer CNN and 2-layer Full Connected Network (FCN).It was successfully implemented using a 1/5 scale truck in a field experiment.Sauer et al. proposed a direct perception approach, called conditional affordance learning (CAL), that maps video input to an intermediate representation and combines it with highlevel directional inputs using specialized task networks to produce affordances [32].The authors demonstrated that CAL outperformed CIL when tested on the CARLA's simulator [33].However, the work has not been proven in field experiments.
Chen et al. proposed a two-stage learning method involving a privileged agent and a purely vision-based sensorimotor agent [34].The authors followed the prior work by Codevilla et al., in which the network branched into four heads, each producing a K-channel heatmap.It outperforms CIL and CAL when tested using the CARLA simulator.However, it still also needs to be proven in field experiments.Recently, CIL has been adopted to process multimodal inputs: RGB and depth modalities [35].The experiments in The authors reviewed end-to-end driving [38], and they described the standard learning methods in end-to-end driving as IL [59] and reinforcement learning (RL) [60]- [63].In [37], the authors reviewed 17 articles published from 2017 to 2021 regarding end-to-end AD in urban environments.They compared the performances of IL and RL approaches using two benchmarks in the CARLA simulator.It was found that the most effective approach was IL-based architecture for the CoRL2017 benchmark and RL-based architecture for the NoCrash benchmark.Nevertheless, which approach is leading is yet to be more conclusive.Other authors in [64] surveyed IL techniques for end-to-end AD and compared 34 articles published from 2016 to 2022.Only two articles implemented the controllers in real-world experiments in urban driving: [31], [65].Both articles rely on CIL.Field experiments are necessary since simulation can not capture real-world complexities.
Although CIL has been implemented in field experiments and has proven superior to other algorithms, its performance evaluation is still limited [31], [65].For example, in the work presented in [31], the authors did not distinguish clearly between training and evaluation routes in their physical implementation, nor did they systematically evaluate the effect of lighting.Moreover, they did not consider the effects of vehicle speed and navigation command change instant [26].The authors collected data for training from human demonstrations over 30 driving hours in a densely populated urban environment where the drivers were free to choose random routes [30].The training data is imbalanced, with the majority driving straight and a significant portion stationary.They selected two routes for testing.The testing environment contains uncontrollable time-dependent factors, e.g., weather, lighting, and road users.These randomness and uncontrollable conditions made it less systematic and challenging to identify key factors more deterministically.
This work provides research contributions as follows: 1) It modified the CIL algorithm by reducing the number of filter channels and neuron units.It developed an AD controller for a 1/10 scale AV that mechanically resembles a full-scale car with the Ackermann frontwheel steering.The camera model is valuable for scaling up to a full-scale car.
2 3) It explores the possibilities to improve the performance of AD in urban environments by considering perception accuracy and robustness, vehicle dynamics, and real-time implementation.
The remainder of this paper is organized as follows.Section two presents CIL.The experimental platform is described in Section 3, including the AV and camera model.Section 4 describes the training, validation, and performance evaluation experiment scenarios.Section 5 reports the results and discussions.Finally, Section 6 presents our conclusion.

II. METHOD
This work was carried out sequentially, as illustrated in the flowchart in Fig. 1.First, we briefly overviewed the CIL.We modified it to reduce computing load while maintaining performance.Second, we developed an experimental platform that allows efficient experiments but still inherits the characteristics of a real car.Third, we established a systematic experimental performance evaluation method.Fourth, we presented experimental results and discussion.The results include evaluating overall performance and turning performance at new intersections.We investigated the effects of critical factors: road pattern, lighting, speed, and navigation command change instant.Finally, we conclude.In the following, these aspects of methodology are explained in comprehensive detail.

A. Conditional Imitation Learning
This section briefly provides an overview of imitation and conditional imitation learning.We then describe our model of conditional imitation learning used in this paper.
It requires an assumption that a function  exists that maps observations to the expert's actions:   = (  ) .However, when an autonomous vehicle approaches an intersection, the driver's subsequent action is explained by the observations and is affected by the driver's intention.The driver's intention is exposed to the controller by introducing an additional command input  [31].During training, the expert provided commands.They provided information about the expert's decision-making.A driver or a navigation system can provide commands to affect the controller's behavior at the test time.The training dataset becomes  = {〈  ,   ,   〉} =1  .The objective of conditional imitation learning (CIL) is given by (2).
A deep artificial neural network expresses the controller (  ,   ; ).The network takes images as the input.By adopting the branched network architecture of commandconditional imitation learning from Codevilla et al., we assume a discrete set of commands,  = { 0 , … ,   }, and introduce a particular branch   for each command   .Fig. 2 illustrates the network architecture used in this paper for command-conditional imitation learning.The particular branch   learns sub-policies that correspond to the navigational commands.We set three modules for decisionmaking at an intersection that enable going straight or following the lane, turning left, and turning right.The image module is implemented as a combination of CNN and FCN, whereas the command module is an FCN.We adapted the architecture of CIL originally proposed in [31] with modifications to the network details.The input image of our CIL model has dimensions of 100×220 pixels.We constructed the base model using eight CNN layers and two FCN layers.A batch normalization accompanies each CNN layer.The first CNN layer has a kernel size of five, followed by a kernel size of three in the remaining layers.The first, third, fifth, and seventh CNN layers have a stride of two, whereas the remaining layers have a stride of one.The two FCN layers each contain 128 neuron units.Then, the model starts to branch, with each branch specializing in each navigation command.Each branched model consists of two FC hidden layers with neuron sizes of 256 and 512 for the left, right, and straight models, respectively.We applied a rectified linear unit activation function after each hidden layer and batch normalization after all convolutional layers.Some differences exist between our model and that used in [31].First, the seventh CNN layer of their model has a stride of one.Second, they applied a dropout layer after each convolutional layer.However, applying dropout after the convolutional layer decreased model performance in our case.Therefore, we did not apply a dropout layer to the convolutional layers.Finally, the number of filter channels in the convolutional layers and neuron units in the fully connected layers are much smaller in our model than theirs.
Our model assumes a constant vehicle speed, so the controller's action is only the steering angle.This assumption makes the vehicle dynamics from the steering angle to the yaw rate a linear time-invariant system [66].Therefore, the effect of speed on performance can be analyzed systematically.

B. Experimental Platform
We built an AV by retrofitting a 1/10 scale radiocontrolled car, model HG P408 US Military Vehicle (Fig. 3).It weighs 5.7 kg, has a width of 0.225 m, and has a turning radius of 1.3 m.

Fig. 3. Autonomous driving vehicle
We mounted a forward-pointing RGB camera on the front top of the vehicle.The camera has a frame resolution of 1920×1080 pixels, a frame rate of 30 fps, a field of view (FoV) (H×V) of 69° × 42°, and a sensor resolution of two MP.The central processing computer is Nvidia Jetson TX2, which oversees data acquisition, processing it, and sending control commands to the steering servo and electronic speed controller (ESC).The receiver unit receives the navigational command from the remote controller and sends it to the microcontroller via a digital input/output channel.The Jetson TX2 receives the navigational command, the camera's image, and measurement data from the inertial measurement unit (IMU) via a USB cable (Fig. 4).It sends the steering and throttle values calculated by the autonomous controller through a PCA9685 servo driver.We installed the Jetson Package into Jetson TX2, including Linux for Tegra, TensorRT, CUDA, cuDNN, and several computer vision libraries.TensorRT, CUDA, and cuDNN are libraries published by NVIDIA.We also installed other libraries: Intel librealsense and pyrealsense to access the Intel® RealSenseTM camera, OpenCV for real-time computer vision, and TensorFlow.We use Python scripts for data collection, neural network training, neural network testing, and other purposes.We developed a camera model to obtain parameter values related to the images, as illustrated in Fig. 5.The x-axis extends in the vehicle's forward direction, the y-axis points to the vehicle's left, and the z-axis faces upward, perpendicular to the ground.The camera's optical axis points to a certain point on the road and captures the entire area inside its FoV.FoVV denotes the camera's FoV in the vertical direction. c denotes the camera's height on the road surface.The pitch angle θ is between the camera's optical axis and the horizontal line.We define some essential parameters as follows: 1) Longitudinal distance   is the distance between the camera and the camera's focus point on the road, measured along the x-axis.
2) Short longitudinal distance   is the distance between the camera and the lowest point viewed by the camera, measured along the x-axis.
3) Horizon distance  ℎ is the distance between the camera and the object the camera views on the horizontal line.
4) Horizon height  ℎ is the height of the highest point of the camera views at the horizon distance measured from the horizontal line along the z-axis.
The longitudinal distance   is given by ( 3) The short longitudinal distance   and horizon height  ℎ are obtained by incorporating the FoV angle along the horizontal axis, as given in ( 4) and ( 5) where  ( +  2 ) represents the horizon height ratio.
When the vehicle runs at a speed of   and the sampling time of the computer is   , we can calculate the longitudinal displacement ∆  using the following relationship.
We must determine the appropriate sampling time to get an acceptable longitudinal displacement value by considering the vehicle velocity.The camera continuously records the image at the focus point at time step  up to time step  + , where  is the image repetition number given by (7).
Table I lists parameter values of the camera model.The mounted camera's pitch angle and height were determined based on practical considerations.We found the appropriate nominal speed of 0.4 m/s based on trial and error, considering the maximum capability of the computer vision of 30 fps.These parameter values become a foundation for scaling up the experiment to a full-scale car in the future.

C. Experimental Performance Evaluation
This section describes experimental scenarios, data acquisition and preprocessing, training and validation, and the performance evaluation process.
First, a systematic experimental scenario is explained, including experimental route, driving patterns, training road sectors, testing road sectors, and lighting conditions setting.Fig. 6 illustrates a top view of the route used in our experimental scenario.The plotted virtual numbers (1 to 16) represent the locations along the route to define the road sectors (RSs) and trajectory, where the AV navigated based Table II summarizes the experimental scenarios denoted by the road sector's numbers, with the trajectory fraction from one corresponding location to another and its driving pattern.The objective is to evaluate the AD performance at various driving patterns under different conditions.
The experimental scenarios consist of the following:  II.
3) Performance evaluation road sectors: overall performance evaluation is carried out through the same road sectors as the validation sectors.The turning performance evaluation is conducted through road sectors shown as purple and green lines in Fig. 6 (b).
We selected evaluation road sectors that reflected seven driving patterns.The CIL model has already been trained to experience DP1 but has never been trained for DP9.Moreover, the evaluation road sectors differed from the training road sectors for DP3, DP4, DP6, and DP7.
In order to evaluate the CIL's efficacy under various lighting conditions, we conducted experiments with multiple lighting configurations.As illustrated in Fig. 6, we fixed light bulbs (L1 to L8) at 4 m high in the experiment area.In the middle between L7 and L8, apart at a distance of 4.5 m, we had a side field bulb L9.The bulbs L1 to L8 were turned on or off to represent various illumination conditions, whereas L9 was always on.Thus, the illumination becomes a controllable and independent variable.Between bulbs L1 and L6, there are glass windows, and the other sides of the experiment field are enclosed by walls.The effect of sunlight on the glass windows introduces stochastic characteristics, an uncontrollable independent variable.To determine the combined effects of lighting, we conducted experiments in the morning (07:00-10:00), afternoon (12:00-15:00), and night (18:00-23:00).We trained the model using a particular nominal speed value to evaluate the effect of vehicle speed on performance.We then evaluated the trained model under three different speed values, i.e., the slower pace at 15 %, the nominal speed at 25 %, and the faster speed at 35 % of the throttle.Furthermore, we evaluate the efficacy of turning under three different navigational command change instants: too early, the normal instant, and too late.The normal instant is provided when the vehicle approaches an intersection at approximately 60 cm.It is said too late if the distance is approximately 40 cm or less.Conversely, it is considered too early if the space is approximately 120 cm or more.
In contrast to our experimental scenarios, in the work by Codevilla et al., the authors did not distinguish between training and evaluation routes in their physical implementation.They collected most training data in sunny weather and evaluated their model in overcast weather conditions.They did not evaluate the effects of vehicle speed and navigation command change instant [31].Our experimental scenario enables us to evaluate the hypothesis that the driving pattern, illumination, vehicle speed, and navigation command are independent variables.We expect these variables to be critical factors affecting the vehicle's position and yaw angle when running on the route.Second, we present data acquisition and preprocessing.During training data collection, an expert manually operated the vehicle and directly observed the lanes while providing appropriate navigation commands following the experimental scenario via a remote controller.The command values and images were recorded synchronously.The raw image was recorded with a 640×360 pixels image dimension.The final dataset for training contains 10,652 observations.Table III to Table V summarize the statistics of the training dataset concerning the driving pattern, time, light condition, initial lateral position, and navigation command.The data was acquired in the morning, afternoon, and night with the setup lights on or off.We collected training data for three initial vehicle positions: the middle of the road, on the right, and the left sides.Meanwhile, the validation data was collected only with the initial position in the middle.Regarding the amount of driving pattern observation, DP1 encompasses 65.74%.Other patterns occupy from 4.04% to 7.85%.We performed image augmentation to provide perception robustness against variations in lighting conditions.The observation amount of original (Ori.) and augmented (Aug.)images are listed in Table IV.The observation amount of each augmented image is around 9.7% of the corresponding original image.
The original image dataset was preprocessed by cropping the region of interest, resizing, converting the color system, and augmenting it.Fig. 7 depicts three samples captured at different road sectors at night with the bulbs switched on.Fig. 8 shows the preprocessing results of the raw image in Fig. 7 (a).After cropping the upper section and resizing, the image's resolution became 220×100 pixels (Fig. 8(a)).The RGB image was then converted into a YUV image (Fig. 8(b)).The image in the YUV color system is more efficiently processed by a digital computer [67].The image was then randomly augmented by adding Gaussian noise (Fig. 8 (c)).
Next, we describe the training and validation processes.We acquired training and validation data using the nominal speed and normal navigation command change instant.To accommodate a systematic analysis of critical factors, we associated the training dataset with three different models based on navigation command type as follows: model 1, which refers to the dataset recorded when the vehicle is moving along road sectors 1, 3, 4, 5, 6, 7, 8, 10, 11, and 12; model 2, which refers to those when the vehicle is moving along RS2; and model 3, which refers to RS9.We set the mini-batch size to 64, a learning rate decay of 0.001, and the epoch number to 65.The IL model was trained using the Adam optimizer [68].We used the mean absolute error as the loss function.Given mini-batch size   and predicted and ground truth steering angles   and   , we define the loss function (  ,   ) per mini-batch in (8).
We conducted a validation by running the vehicle through the validation road sectors six times.For each run, the number of observations was 225, 45, and 44 for model 1, model 2, and model 3, respectively.Fig. 9  The final process is performance evaluation, which includes overall performance and turning performance at intersections.Here, performance metrics are formulated.The autonomous controller performance was evaluated by comparing the vehicle position with the road sector coordinates.We defined reward and penalty values to evaluate the overall performance throughout all the road sectors.
1) When the vehicle stays inside the road, it is said to be successful and is given a reward of 1.
2) When one of the front tires slightly gets out of the lane marking yet manages to get back to the track and continue moving autonomously, it is said to return safely.We assign zero points for this case.
3) It fails when the vehicle exits lane markings and does not return to the lane.We assign -1 point for this case.
We conducted the experiments three times for each evaluation road sector under specific controllable conditions, including speed, navigation command instant, and light on/off.The experiment was also under the effect of random light from the outside environment.The reward or punishment value   is summed to obtain the overall performance indicator  1 throughout the evaluation road sectors, as given by ( 9). 1 denotes the total number of observations.Moreover, we introduced a more detailed measurement to evaluate the turning performance of the vehicle in terms of the location and yaw angle at a specific intersection.Before starting an evaluation experiment, we placed the x-axis of the vehicle parallel to the road lanes.We initially positioned the vehicle in the middle of the road and measured its position as it moved through the intersection.
The turning performance indicator in terms of location is given by (10), where (  ,   ) and (  ,   ) represent the reference and vehicle locations at each time step  , respectively. 2 denotes the total number of observations.
We first set the vehicle's initial yaw angle to   (0).Then, the yaw angle   () was measured relative to the initial yaw angle.We evaluated the turning performance indicator in terms of the yaw angle at the intersections by observing the plot of the yaw angle and the qualitative description.

III. RESULT AND DISCUSSION
First, we discuss the results of overall performance evaluation experiments.Table VI summarizes the overall performance of the CIL implementation in the experimental scenario.Effects of critical factors on performance are investigated.They include driving patterns (DPs), lighting conditions, vehicle speed, and navigation command change instant.M, A, and N represent morning, afternoon, and night.L1 and L0 denote the field bulbs on and off.S1, S2, and S3 refer to the vehicle speed slower than the nominal speed, the nominal speed, and faster than the nominal speed.C1, C2, and C3 refer to the navigation command change instant that is too late, timely, and too early compared to the normal instant.They apply only to RS15 and RS17.For other road sectors, we use C0.
Recall that statistics of the training dataset are displayed in Table III, Table IV, and Table V. Statistics of the evaluation dataset are explained in the paragraph below (8), and Table VI shows that each evaluation experiment was conducted three times.
The evaluation results demonstrate that the vehicle could autonomously drive successfully through road sectors 8, 12, 13, 14, 16, and 18 in all experiments.However, it failed five times when traversing the new intersections at road sectors 15 and 17.Except for road sector 8, they are new road sectors for the vehicle, as they have never been traveled during training.Nevertheless, the vehicle experienced the same driving patterns at other locations during the training session, except road sector 17 (RS17) with DP9.
Even though the vehicle had never been trained to pass through DP9 at RS17, during the evaluation, out of 162 experiments, it slightly deviated from the route three times and returned.The conditions occurred under the following scenarios: ML1-S3C1, AL0-S3C2, and NL1-S3C1.
During training, the vehicle had never traveled through RS15.However, it had been trained at the same driving pattern in RS2 even under different lighting conditions, as RS15 is close to glass windows.During the evaluation, out of 162 trials, the vehicle escaped from the lane once but returned under AL1-S3C1 and failed once under the conditions of NL1-S3C1.
From the overall performance evaluation, the AV failed once when it turned left at a new intersection (RS15) near the glass windows.It occurred under the following specific conditions (NL1-S3C1): at night, with the field's bulbs switched on, at a faster speed, and with a navigation command change moment that occurred too late.Under the specific conditions (L1-S3C1), it got out slightly and returned to the lane twice when turning right at a new intersection with an untrained driving pattern (RS17) and once when turning left at a new intersection (RS15).It also got out lightly and returned to the lane once when turning right at the new intersection with an untrained driving pattern (RS17) under the specific conditions (AL0-S3C2): in the afternoon with the field's bulbs off, faster speed, and normal navigation command change instant.
It is worthwhile to note that evaluation in the morning and night with the bulbs off under different conditions yielded 108 out of 108 successful autonomous driving, respectively.It can be concluded that the CIL model produced a success rate of 99.1% from 648 experiments under different conditions of driving patterns, lighting, vehicle speeds, and navigation command change instants.Second, since the unsuccessful autonomous driving during the evaluation session happened at new intersections, we discuss further in more detail the results of turning performance at new intersections.We measured the turning performance when the AV turned right at the intersection along RS19 and turned left at the intersection along RS15.The AV positions, yaw angles, yaw rates, and steering angles are depicted in Fig. 10 and Fig. 11.The reference trajectory was the middle line of the turning curve.VI.It can be seen from Fig. 11 that the vehicle locations remained close to the references.However, the vehicle experienced overshoot and undershoot in the location and yaw angle responses when it was faster than the nominal speed.The steering angle rapidly increased from the straight movingsteering angle and decreased back to the straight movingsteering angle to maintain the AV inside the lane.The steering angle moved to the opposite angle to compensate for the overshoot before returning to the straight-moving steering angle.
During turning performance evaluation experiments, the AV never escaped from the road; in other words, it achieved a 100% success rate.To deepen our evaluation, we calculated quantitative turning performance indicators regarding location (Table VII).During the right turn at RS19, the yaw angles gradually changed from approximately 0° to -90° between 3.5 s and 4.5 s.When turning left at RS15 at nominal and slower speeds, the vehicle did not overshoot or undershoot.The yaw angles changed from approximately 180° to 270° within 4 s at the nominal speed and 9 s at the slower speed.The main findings of our study are summarized as follows: 1) The AV could not maintain itself inside the road five times out of 648 experiments when it turned at new intersections.
2) The AV could not maintain itself inside the road at new intersections because of the adverse effects of dazzling light reflection, faster speed, and too-late command change instant.
3) The AV could not maintain itself inside the road at new intersections because of the untrained driving pattern combined with faster speed.
We use 73 references relevant to this work on autonomous driving of ground vehicles from several databases, including Science Direct (three articles and one book), IEEE Xplore (30 articles), Springer (12 articles), MDPI (seven articles), Wiley (two articles), Frontiers (one article), Scopus (11 articles), and others (six articles).Fiftysix articles were published in journals, ten articles in proceedings, and one book was published by Elsevier.This limited number of references indicates that the research topic of autonomous driving based on the end-to-end approach is still an infant.Only two articles on end-to-end autonomous driving reported physical experimental results in urban driving [31], [65].In [31], the authors did not distinguish between training and evaluation routes, nor did they systematically evaluate the effect of lighting.Moreover, they did not consider the effects of vehicle speed and navigation command change instant.In [65], the testing environment contained uncontrollable time-dependent factors, e.g., weather, lighting, and road users.Also, the drivers selected the experiment routes randomly and controlled both steering angle and vehicle speed.These uncontrollable conditions, route randomness, and time-varying speed made identifying key factors more difficult.Time-constant speed in our experiments may be a limitation of this study and, at the same time, becomes the strength since it enables systematical analysis of speed effects on performance by comparing three different time-constant speed values: low speed, nominal speed, and fast speed.This study's second and third main findings contribute to AD by suggesting avenues for enhancing AD technology.For example, the second main finding motivated us to employ an if-then logic to avoid a turning failure, i.e., if the controller identifies dazzling light reflections before turning, drive slower, and do not change the navigation command too late.The third main finding stimulated us to employ a second ifthen logic: if the autonomous controller identifies a new intersection with an untrained driving pattern, then do not drive at a faster speed.However, these two logics require more accurate and robust object identification capabilities.Some researchers proposed multimodality fusion and training data augmentation to enhance perception capability, target recognition and tracking, and semantic segmentation [35], [69]- [73].
The original CIL model was modified using multimodality fusion in the CARLA simulator, including color images (RGB)-stereo depth fusion [35] and RGB-LiDAR point cloud fusion [69].A multimodality fusion from camera and Radar was developed to process a real-world dataset for target tracking based on a switchable dual-level long short-term memory (LSTM) network [70].They validated the method in three illumination conditions: day, dusk, and night modes.However, they did not report running time.
A homography augmentation using the DeepLabv3+ network from stereo-images was developed and proven to outperform six state-of-the-art Deep CNNs regarding accuracy, precision, recall, and runtime [71].This method is potentially developed for collision-free space detection algorithms for autonomous driving.
CIL model implementation in real-world urban driving also necessitates efficient running time.Besides multimodal fusion and training data augmentation, exploring more powerful preprocessing methods combined with multi-route networks to answer this challenge is also interesting.

IV. CONCLUSION
Two failure conditions decreased the success rate to 99% out of 648 experiments.One is turning at a new intersection, coupled with the combination of three factors: dazzling light reflection, faster speed, and too-late command change instant.The other is turning at a new intersection with an untrained driving pattern coupled with faster speed.
Under controllable conditions, we need to ensure a success rate of 100%.Based on our knowledge obtained in this study, we can embed the following two if-then logics into the autonomous controller to avoid any turning failure: if the controller identifies dazzling light reflections before turning, then drive slower and do not change the navigation command too late; if the autonomous controller identifies a new intersection with an untrained driving pattern, then do not drive at a faster speed.
The two logics require more accurate and robust object identification capabilities.Implementing the autonomous controller in real time requires efficient running time.For future work, we intend to develop a model incorporating multimodal fusion, training data augmentation, powerful preprocessing, and multi-route networks.
One limitation of this study is that the dazzling light reflection happened accidentally during the experiment.To develop a robust CIL model against dazzling light reflections, we need to be able to reconstruct several dazzling light reflections and systematically evaluate the model's performance against such dazzling light reflections.The other limitation is that no obstacles exist in the road sectors.A natural extension of this study is to develop a CIL model that can avoid obstacles.

)
It performed field experiments in a more deterministic manner by evaluating the effects of critical factors: (a) experimental lighting conditions in the morning, afternoon, and night; (b) vehicle speed; (c) navigation command change instant at intersections; (d) the clear separation between training and evaluation road sectors.

Fig. 5 .
Fig. 5. Camera model of the camera fixed on the top of the AV the scenario.We set nine unique driving patterns (DPs) to drive the AV along the route.The patterns include lane following on a straight road (DP1), moving straight by passing an intersection on the left-hand side (DP2) or righthand side (DP3), turning left (DP4) or right (DP5) at a curve, turning left (DP6) or right (DP7) at an intersection, and turning left (DP8) or right (DP9) facing a T junction.

Fig. 6 .
Fig. 6.Experiment route.(a) training, validation, and overall performance evaluation.(b) evaluation of turning performance plots the model loss of each epoch obtained from the training and validation phases.The model loss of the validation phase fluctuates relatively, particularly for the right-turn navigation command.It is likely because the data in the validation dataset has never been seen in the training dataset.

Fig. 10
Fig.10depicts the experimental results obtained from three different navigation command change instants under the same conditions: nominal speed, night, and the experiment field bulbs were switched on.The solid red line in Fig.10(a)denotes the reference trajectory.The dotted, dotted-dashed, and dashed lines represent the trajectories when the navigation command change instant is too late, normal, and too early, respectively.The corresponding yaw angles, yaw rates, and steering angles are plotted in Fig.10(b), Fig.10(c), and Fig.10(d), respectively.These conditions correspond to NL1-S2C1, NL1-S2C2, and NL1-S2C3 in TableVI.We observe similar dynamical patterns among the experimental results -the AV locations remained close to the references without undergoing drastic change.

Fig. 11
Fig. 11 plots the experimental results obtained from three different vehicle speeds under the same conditions: normal navigation command instant, at night, and the experiment field bulbs were on.The solid red line in Fig. 11(a) denotes the reference trajectory.The dotted, dotted-dashed, and dashed lines represent the trajectories when the vehicle speed is faster, nominal, and slower, respectively.In Fig. 11(b), Fig. 11(c), and Fig. 11(d), the dotted, dotted-dashed, and dashed lines represent the yaw angles, yaw rates, and steering angles of the corresponding conditions.These conditions correspond

Fig. 10 .Fig. 11 .
Fig. 10.(a) Locations, (b) yaw angle, (c) yaw rate, and (d) steering angle along RS19 under three navigation command change instants (1) ,   〉} =1  generated by the expert.In each step, , the expert receives an observation   and takes an action   .The dataset is composed of  pairs of observations and actions.The objective is to find the parameter values of a model approximator (  ; ) that fits the mapping of observations to actions as expressed in(1).

TABLE I .
PARAMETER VALUES OF THE CAMERA MODEL

TABLE III .
TRAINING DATASET: ROAD SECTOR AND DRIVING PATTERN

TABLE VI
Estiko Rijanto, Key Factors that Negatively Affect Performance of Imitation Learning for Autonomous Driving

TABLE VII .
TURNING PERFORMANCE INDICATOR IN TERMS OF LOCATION