Enhancing Voice Authentication with a Hybrid Deep Learning and Active Learning Approach for Deepfake Detection

Ali Saadoon Ahmed; Arshad M. Khaleel

doi:10.18196/jrc.v5i6.23502

Authors

Ali Saadoon Ahmed University of Al Maarif https://orcid.org/0000-0003-4120-9073
Arshad M. Khaleel Iraq Ministry of Education 2nd International Smart Card Al-Anbar https://orcid.org/0000-0001-7448-8019

DOI:

https://doi.org/10.18196/jrc.v5i6.23502

Keywords:

Active Learning, Machine Learning, Automatic Speaker Verification, Asvspoof 2019, Random Forest, MLP, Spoofing Detection.

Abstract

This paper explores the application of active learning to enhance machine learning classifiers for spoofing detection in automatic speaker verification (ASV) systems. Leveraging the ASVspoof 2019 database, we integrate an active learning framework with traditional machine learning workflows, specifically focusing on Random Forest (RF) and Multilayer Perceptron (MLP) classifiers. The active learning approach was implemented by initially training models on a small subset of data and iteratively selecting the most uncertain samples for further training, which allowed the classifiers to refine their predictions effectively. Experimental results demonstrate that while the MLP initially outperformed RF with an accuracy of 95.83% compared to 91%, the incorporation of active learning significantly improved RF's performance to 94%, narrowing the performance gap between the two models. After applying active learning, both classifiers showed enhanced precision, recall, and F1-scores, with improvements ranging from 3% to 5%. This study provides valuable insights into the role of active learning in boosting the efficiency of machine learning models for dynamic spoofing scenarios in ASV systems. Future research should focus on designing advanced active learning techniques and exploring their integration with other machine learning paradigms to further enhance ASV security.

References

ZAO. Apple App Store. Available at: https://apps.apple.com/cn/app/zao/id1465199127.

Reface App. Website. Available at: https://reface.app/.

FaceApp. Website. Available at: https://www.faceapp.com/.

Audacity. Website. Available at: https://www.audacityteam.org/.

Sound Forge. Website. Available at: https://www.magix.com/gb/music/sound-forge/.

J. F. Boylan. Will deep-fake technology destroy democracy?. The New York Times, 2018.

D. Harwell, “Scarlett Johansson on fake AI-generated sex videos: ‘Nothing can stop someone from cutting and pasting my image’,” Washington Post, vol. 31, p. 12, 2018.

C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody dance now,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 5933–5942, 2019.

K. M. Malik, H. Malik, and R. Baumann, ”Towards vulnerability analysis of voice-driven interfaces and countermeasures for replay attacks,” in 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pages 523–528, 2019.

K. M. Malik, A. Javed, H. Malik, and A. Irtaza, “A light-weight replay detection framework for voice controlled iot devices,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 982– 996, 2020.

A. Javed, K. M. Malik, A. Irtaza, and H. Malik, “Towards protecting cyber-physical and iot systems from single-and multi-order voice spoofing attacks,” Applied Acoustics, vol. 183, p. 108283, 2021.

M. Aljasem et al., “Secure automatic speaker verification (sasv) system through sm-altp features and asymmetric bagging,” IEEE Transactions on Information Forensics Security, 2021.

D. Harwell, “An artificial-intelligence first: Voice-mimicking software reportedly used in a major theft,” Washington Post, vol. 4, 2019.

L. Verdoliva, “Media forensics and deepfakes: an overview,” arXiv preprint arXiv:2001.06564, 2020.

R. Tolosana, R. Vera-Rodriguez, J. Fierrez, A. Morales, and J. Ortega-Garcia, “Deepfakes and beyond: A survey of face manipulation and fake detection,” arXiv preprint arXiv:2001.00179, 2020.

T. T. Nguyen, C. M. Nguyen, D. T. Nguyen, D. T. Nguyen, and S. Nahavandi, “Deep learning for deepfakes creation and detection,” arXiv preprint arXiv:1909.11573, 2019.

Y. Mirsky and W. Lee, “The creation and detection of deepfakes: A survey,” arXiv preprint arXiv:2004.11138, 2020.

A. K. Singh and P. Singh, “Detection of ai-synthesized speech using cepstral & bispectral statistics,” in Proceedings of the 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 412–417, 2021.

C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, “Synthetic speech detection through short-term and long-term prediction traces,” EURASIP Journal of Information Security, vol. 2021, no. 2, 2021.

M. Todisco et al., “ASVspoof 2019: Future horizons in spoofed and fake audio detection,” arXiv preprint arXiv: 1904.05441, 2019.

T. Liu, D. Yan, R. Wang, N. Yan, and G. Chen, “Identification of fake stereo audio using svm and cnn,” Information, vol. 12, no. 6, p. 263, 2021.

N. Subramani and D. Rao, “Learning efficient representations for fake speech detection,” in Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, pp. 5859–5866, 2020.

E. R. Bartusiak and E. J. Delp, “Frequency domain-based detection of generated audio,” in Proceedings of the Electronic Imaging, vol. 2021, pp. 273–281, 2021.

M. Lataifeh, A. Elnagar, I. Shahin, and A. B. Nassif, “Arabic audio clips: Identification and discrimination of authentic cantillations from imitations,” Neurocomputing, vol. 418, pp. 162–177, 2020.

Z. Lei, Y. Yang, C. Liu, and J. Ye, “Siamese convolutional neural network using gaussian probability feature for spoofing speech detection,” in Proceedings of INTERSPEECH, pp. 1116–1120, 2020.

H. Hofbauer and A. Uhl, “Calculating a boundary for the significance from the equal-error rate,” in Proceedings of the 2016 International Conference on Biometrics (ICB), pp. 1–4, 2016.

R. Reimao and V. Tzerpos, “A dataset for synthetic speech detection,” in Proceedings of the 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–10, 2019.

H. Yu, Z.-H. Tan, Z. Ma, R. Martin, and J. Guo, “Spoofing detection in automatic speaker verification systems using dnn classifiers and dynamic acoustic features,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, pp. 4633–4644, 2018.

Z. Wu et al., “Asvspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge,” in Proceedings of the Interspeech 2015, p. 5, 2015.

R. Wang, F. Juefei-Xu, Y. Huang, Q. Guo, X. Xie, L. Ma, and Y. Liu, “Deepsonar: Towards effective and robust detection of ai-synthesized fake voices,” in Proceedings of the 28th ACM International Conference on Multimedia, pp. 1207–1216, 2020.

R. L. M. A. P. C. Wijethunga, D. M. K. Matheesha, A. Al Noman, K. H. V. T. A. De Silva, M. Tis- sera, and L. Rupasinghe, “Deepfake audio detection: A deep learning based solution for group conversations,” in Proceedings of the 2020 2nd International Conference on Advancements in Computing (ICAC), pp. 192–197, 2020.

A. Chintha, B. Thai, S. J. Sohrawardi, K. M. Bhatt, A. Hickerson, M. Wright, and R. Ptucha, “Recurrent convolutional structures for audio spoof and video deepfake detection,” IEEE Journal on Selected Topics in Signal Processing, vol. 14, pp. 1024–1037, 2020.

M. Shan and T. Tsai, “A cross-verification approach for protecting world leaders from fake and tampered audio,” arXiv preprint arXiv:2010.12173, 2020.

P. R. Aravind, U. Nechiyil, and N. Paramparambath, “Audio spoofing verification using deep convolutional neural networks by transfer learning,” arXiv preprint arXiv:2008.03464, 2020.

J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, and F. Kazi, “A deep learning framework for audio deepfake detection,” Arabian Journal for Science and Engineering, vol. 47, pp. 3447–3458, 2021.

H. Khalid, M. Kim, S. Tariq, and S. S. Woo, “Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors,” in Proceedings of the 1st Workshop on Synthetic Multimedia, pp. 7–15, 2021.

H. Khalid, S. Tariq, M. Kim, and S. S. Woo, “Fakeavceleb: A novel audio-video multimodal deepfake dataset,” in Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, p. 14, 2021.

M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep residual neural networks for audio spoofing detection,” arXiv preprint arXiv:1907.00501, 2019.

T. Arif, A. Javed, M. Alhameed, F. Jeribi, and A. Tahir, “Voice spoofing countermeasure for logical access attacks detection,” IEEE Access, vol. 9, pp. 162857–162868, 2021.

C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “Assert: Anti-spoofing with squeeze-excitation and residual networks,” arXiv preprint arXiv:1904.01120, 2019.

Z. Jiang, H. Zhu, L. Peng, W. Ding, and Y. Ren, “Self-supervised spoofing audio detection scheme,” in Proceedings of the INTERSPEECH 2020, pp. 4223–4227, 2020.

G. A. López-Ramírez, A. Aragón-Zavala, and C. Vargas-Rosales, "Exploratory Data Analysis for Path Loss Measurements: Unveiling Patterns and Insights Before Machine Learning," in IEEE Access, vol. 12, pp. 62279-62295, 2024.

J. Feldtkeller, P. Sasdrich, and T. Güneysu, “Challenges and opportunities of security-aware EDA,” ACM Transactions on Embedded Computing Systems, vol. 22, no. 3, pp. 1-34, 2023.

M. M. T. Nur, S. S. Dola, A. K. Banik, T. Akhter, and N. Hossain. Voice recognition using machine learning and central database to enhance security system. Doctoral dissertation, Brac University, 202.

S. Dargan, M. Kumar, M. R. Ayyagari, and G. Kumar, “A survey of deep learning and its applications: a new paradigm to machine learning,” Archives of Computational Methods in Engineering, vol. 27, pp. 1071-1092, 2020.

A. Barros, P. Resque, J. Almeida, R. Mota, H. Oliveira, D. Rosário, and E. Cerqueira, “Data improvement model based on ECG biometric for user authentication and identification,” Sensors, vol. 20, no. 10, p. 2920, 2020.

P. Ren et al., “A survey of deep active learning,” ACM computing surveys (CSUR), vol. 54, no. 9, pp. 1-40, 2021.

S. Bharadwaj, P. Amin, D. J. Ramya, and S. Parikh, “Reliable human authentication using AI-based multibiometric image sensor fusion: Assessment of performance in information security,” Measurement: Sensors, vol. 33, p. 101140, 2024.

S. M. S. Bukhari et al., “Secure and privacy-preserving intrusion detection in wireless sensor networks: Federated learning with SCNN-Bi-LSTM for enhanced reliability,” Ad Hoc Networks, vol. 155, p. 103407, 2024.

M. Vakili, M. Ghamsari, and M. Rezaei, “Performance analysis and comparison of machine and deep learning algorithms for IoT data classification,” arXiv preprint arXiv:2001.09636, 2020.

Kaggle. ASVspoof 2019 Dataset. Kaggle, 2019. Retrieved from https://www.kaggle.com/datasets/awsaf49/asvpoof-2019-dataset

A. T. Ali, H. S. Abdullah, and M. N. Fadhil, “Voice recognition system using machine learning techniques,” Materials Today: Proceedings, pp. 1-7, 2021.

A. A. Alnuaim et al., “Human‐computer interaction for recognizing speech emotions using multilayer perceptron classifier,” Journal of Healthcare Engineering, vol. 2022, no. 1, p. 6005446, 2022.

I. Sindhu and M. S. Sainin, "Automatic Speech and Voice Disorder Detection Using Deep Learning—A Systematic Literature Review," in IEEE Access, vol. 12, pp. 49667-49681, 2024.

T. Wan et al., “A survey of deep active learning for foundation models,” Intelligent Computing, vol. 2, p. 0058, 2023.

N. Saxena and D. Varshney, “Smart home security solutions using facial authentication and speaker recognition through artificial neural networks,” International Journal of Cognitive Computing in Engineering, vol. 2, pp. 154-164, 2021.

C. S. Hong and T. G. Oh, “TPR-TNR plot for confusion matrix,” Communications for Statistical Applications and Methods, vol. 28, no. 2, pp. 161-169, 2021.

G. Zeng, “On the confusion matrix in credit scoring and its analytical properties,” Communications in Statistics-Theory and Methods, vol. 49, no. 9, pp. 2080-2093, 2020.

M. Heydarian, T. E. Doyle, and R. Samavi, "MLCM: Multi-Label Confusion Matrix," in IEEE Access, vol. 10, pp. 19083-19095, 2022.

N. A. Al Hindawi, I. Shahin, and A. B. Nassif, "Speaker Identification for Disguised Voices Based on Modified SVM Classifier," 2021 18th International Multi-Conference on Systems, Signals & Devices (SSD), pp. 687-691, 2021.

R. Bold, H. Al-Khateeb, and N. Ersotelos, “Reducing false negatives in ransomware detection: a critical evaluation of machine learning algorithms,” Applied Sciences, vol. 12, no. 24, p. 12941, 2022.

J. D. Novaković, A. Veljović, S. S. Ilić, Ž. Papić, and M. Tomović, “Evaluation of classification models in machine learning,” Theory and Applications of Mathematics & Computer Science, vol. 7, no. 1, p. 39, 2017.

B. J. Erickson and F. Kitamura, “Magician’s corner: 9. Performance metrics for machine learning models,” Radiology: Artificial Intelligence, vol. 3, no. 3, p. e200126, 2021.

D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with statistical models,” Journal of artificial intelligence research, vol. 4, pp. 129-145, 1996.

L. Ma, B. Ding, S. Das, and A. Swaminathan, “Active learning for ML enhanced database systems,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 175-191, 2020.

T. Wang et al., “Boosting active learning via improving test performance,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 8, pp. 8566-8574, 2022.

B. Settles. Active learning literature survey. University of Wisconsin-Madison, 2009, http://digital.library.wisc.edu/1793/60660

M. Elahi, F. Ricci, and N. Rubens, “A survey of active learning in collaborative filtering recommender systems,” Comput Sci Rev, vol. 20, pp. 29–50, 2016.

A. Sholokhov, T. Kinnunen, V. Vestman, and K. A. Lee, “Voice biometrics security: Extrapolating false alarm rate via hierarchical Bayesian modeling of speaker verification scores,” Computer Speech & Language, vol. 60, p. 101024, 2020.

J. Seppälä. Presentation attack detection in automatic speaker verification with deep learning. Master's thesis, Itä-Suomen yliopisto, 2019.

K. Sriskandaraja. Spoofing countermeasures for secure and robust voice authentication system: Feature extraction and modelling. Doctoral dissertation, UNSW Sydney, 2018.

A. Poddar, M. Sahidullah, and G. Saha, “Speaker verification with short utterances: a review of challenges, trends and opportunities,” IET Biometrics, vol. 7, no. 2, pp. 91-101, 2018.

X. Liu. Advances in Deep Speaker Verification: a study on robustness, portability, and security. Doctoral dissertation, Itä-Suomen yliopisto, 2023.

A. Parmar, R. Katariya, and V. Patel, “A review on random forest: An ensemble classifier,” in International conference on intelligent data communication technologies and internet of things (ICICI) 2018, pp. 758-763, 2019.

M. Pal, “Random forest classifier for remote sensing classification,” International journal of remote sensing, vol. 26, no. 1, pp. 217-222, 2005.

V. F. Rodriguez-Galiano, B. Ghimire, J. Rogan, M. Chica-Olmo, and J. P. Rigol-Sanchez, “An assessment of the effectiveness of a random forest classifier for land-cover classification,” ISPRS journal of photogrammetry and remote sensing, vol. 67, pp. 93-104, 2012.

A. T. Azar, H. I. Elshazly, A. E. Hassanien, and A. M. Elkorany, “A random forest classifier for lymph diseases,” Computer methods and programs in biomedicine, vol. 113, no. 2, pp. 465-473, 2014.

M. Belgiu and L. Drăguţ, “Random forest in remote sensing: A review of applications and future directions,” ISPRS journal of photogrammetry and remote sensing, vol. 114, pp. 24-31, 2016.

Z. Khanjani, G. Watson, and V. P. Janeja, “Audio deepfakes: A survey,” Frontiers in Big Data, vol. 5, p. 1001063, 2023.

L. Blue et al., “Who are you (i really wanna know)? detecting audio {DeepFakes} through vocal tract reconstruction,” in 31st USENIX Security Symposium (USENIX Security 22), pp. 2691-2708, 2022.