An optimized K-Nearest Neighbor based breast cancer detection

In this research, a grid search is employed to find the optimal hyper-parameter and an optimized K-Nearest Neighbor (KNN) based breast cancer detection model is proposed. The grid search is employed to find the best value of K that could produce better breast cancer detection accuracy. Moreover, this study explored the effect of hyper-parameter tuning on the performance of KNN for breast cancer detection. The findings of this research reveals that hyper-parameter tuning has a significant effect on the performance of the KNN model. The effect of hyper-parameter tuning on the performance of KNN algorithm is experimentally tested using Wisconsin breast cancer dataset collected from kaggle data repository. Finally, we have compared the performance of the KNN with the tuned hyper-parameter and with default hyperparameter. The result analysis on the performance of the model on breast cancer detection using the testing set reveals that the accuracy of the proposed optimized model is 94.35% and the performance of the KNN with the default hyper-parameter is 90.10%. Keywords—breast cancer detection, KNN, optimized KNN, breast cancer, machine learning, hyper-parameter tuning


INTRODUCTION
Breast cancer is one of the most common types of cancer in the world and the breast cancer causes death . To reduce the mortality rate caused by breast cancer, machine learning plays great role in breast cancer identification process. Machine learning algorithms are applied to develop an intelligent system which can identify breast cancer in the early stage as possible in order to reduce the complications and increase survival rate of the patients.
Literature review on breast cancer identification with machine learning shows number of works are conducted [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15] on breast cancer identification problem using machine learning algorithms. But the challenge with the machine learning models is choosing the hyper-parameter that could produce better prediction performance. In this study, hyperparameter tuning with the help of grid search is employed to develop an optimized K-Nearest Neighbor based model for breast cancer detection with maximum possible performance. Furthermore, this research is focused on investigating the answers to the following research questions: 1) How to optimize KNN algorithm for breast cancer detection? 2) What is the best hyper-parameter or K value that could produce maximum possible performance on breast cancer detection with KNN algorithm? 3) What is the effect of hyper-parameter tuning on the performance of KNN algorithm for breast cancer detection? 4) How to choose the K value for training K-Nearest Neighbor for breast cancer detection?
In [7], breast cancer detection model is proposed by employing three machine learning algorithms, namely Naïve Bayes, random forest and K-Nearest Neighbor (KNN). The authors have applied these algorithms on the Wisconsin breast cancer data repository. In their study, the authors compared the performance of the proposed model and experimental result shows that the K-Nearest Neighbor (KNN) has better performance than a random forest and the Naïve Bayes algorithm.
In another research on breast cancer identification problem [8], breast cancer prediction model is proposed using decision tree algorithm. The proposed model has an acceptable level of performance for breast cancer detection although, the performance of the model can be improved to get better result for breast cancer detection.
In [9], decision tree-based breast cancer classification model is proposed. the proposed model has acceptable performance for breast cancer identification or breast cancer detection. The experimental analysis on the test set shows that Journal of Robotics and Control ISSN: 2715-5072 116 Tsehay Admassu Assegie, An optimized K-Nearest Neighbor based breast cancer detection a decision tree algorithm has performed well for breast cancer detection. The performance of the model using accuracy as performance metric is 80.5% for breast cancer detection.
In [10], an artificial neural network (ANN) based breast cancer diagnosis model is proposed. The model has a single hidden layer. The performance of the proposed model is developed using Wisconsin's breast cancer data repository and the model is tested on this dataset.
In [11], K-means clustering is applied to university of California (UCI) breast cancer dataset and a learning model is proposed for breast cancer detection. The performance of the proposed K-means based clustering model is analyzed and result shows that the model has an accuracy of 73.70% on breast cancer detection.
In [12], the performance of decision tree, Naïve Bayes and logistic regression algorithms is compared for breast cancer detection using UCI breast cancer Repository. The comparison on the performance of a decision tree, Naïve Bayes and logistic regression shows that decision tree algorithm has better performance for breast cancer detection.
In [13], neural network-based breast cancer detection model is proposed. The neural network is trained using the (University of California Irvine (UCI) breast cancer data repository. The authors have compared the proposed model with K-Nearest Neighbor and Naïve Bayes. A comparison on the performance of K-Nearest Neighbor, neural network and Naïve Bayes algorithm shows that the neural network has better classification performance than the K-Nearest Neighbor and Naive Bayes algorithm.
In another research on breast cancer identification problem [14], weighted decision tree-based breast cancer identification model is proposed. The authors evaluated the proposed model on test set and result reveals the performance of the model is 94.03%.
In [15] deep learning is applied to Wisconsin's breast cancer dataset and a model for breast cancer is proposed. The proposed model is effective in breast cancer classification although the authors did not mention how hyper-parameters are selected in training phase. The performance of the presented model with deep learning is 90%.
In another research [16], support vector machine (SVM) is employed to automate breast cancer identification. The proposed model is evaluated and result shows the performance of the proposed model is 87.12%. In their study, the authors did not mention how the hyper-parameters are selected for training the support vector machine.

III. RESEARCH METHOD
In this section, the method for data collection, best hyperparameter selection approach employed to find the best K value for the K-Nearest Neighbor and the metrics used in performance evaluation of the proposed model is discussed.

A. Dataset description
In breast cancer dataset used in this research have six attributes as demonstrated in figure 1. Moreover, the dataset features are illustrated in table 1. The diagnosis feature is used as the class label, indicating the class that a particular observation in the dataset belongs to and other features are used along with the diagnosis feature when the model is trained on the training set. In this research, we have used 70% of the dataset for training and 30% of the dataset for testing the proposed model.

IV. RESULTS AND DISCUSSIONS
A grid search approach is used search for best K value. The grid search helps us top lot the accuracy of the KNN against misclassification error. The performance of the proposed model is evaluated using accuracy as a measure of performance on breast cancer test set. The performance of the proposed model is evaluated for different value of K and result is demonstrated in figure 2. As demonstrated in figure  5, the miss-classification error is highest at K value 8 and 39. The miss-classification error is lowest when k value is 9 which shows that the highest possible accuracy achieved by the proposed model 94.35% with miss-classification error value 0.065 as demonstrated in Figure 2.

CONCLUSION
In this research, an optimized KNN model is proposed for breast cancer prediction using a grid search approach for searching the best hyper-parameter. A comparison between default hyper-parameter and tuned hyper-parameter performance is carried out and the result shows that the performance significantly improves when best hyperparameter or K value is used for training the KNN. The performance of the KNN with default parameters is 90.10%. Better breast cancer detection accuracy is achieved by the KNN using best hyper-parameter or K value is chosen using a grid search approach when the algorithm is trained and the highest performance achieved using hyper-parameter tuning is 94.35%.