Performance Evaluation of Apache Spark MLlib Algorithms on an Intrusion Detection Dataset

Document Type : Research Article


Department of Computer Engineering and Information Technology, Razi University, Iran.


The increase in the use of the Internet and web services and the advent of the fifth generation of cellular network technology (5G) along with ever-growing Internet of Things (IoT) data traffic will grow global internet usage. To ensure the security of future networks, machine learning-based intrusion detection and prevention systems (IDPS) must be implemented to detect new attacks, and big data parallel processing tools can be used to handle a huge collection of training data in these systems. In this paper Apache Spark, a general-purpose and fast cluster computing platform is used for processing and training a large volume of network traffic feature data. In this work, the most important features of the CSE-CIC-IDS2018 dataset are used for constructing machine learning models and then the most popular machine learning approaches, namely Logistic Regression, Support Vector Machine (SVM), three different Decision Tree Classifiers, and Naive Bayes algorithm are used to train the model using up to eight number of worker nodes. Our Spark cluster contains seven machines acting as worker nodes and one machine is configured as both a master and a worker. We use the CSE-CIC-IDS2018 dataset to evaluate the overall performance of these algorithms on Botnet attacks and distributed hyperparameter tuning is used to find the best single decision tree parameters. We have achieved up to 100% accuracy using selected features by the learning method in our experiments.


[1] S. Andreev, S. Balandin, and Y. Koucheryavy. Internet of things, smart spaces, and next generation networks and systems. Springer, 2014. [ bib ]
[2] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235--1241, 2016. [ bib | DOI ]
[3] M. Belouch, S. El Hadaj, and M. Idhammad. Performance evaluation of intrusion detection based on machine learning using Apache Spark. Procedia Computer Science, 127:1--6, 2018. [ bib | DOI ]
[4] A. Dobson, K. Roy, X. Yuan, and J. Xu. Performance Evaluation of Machine Learning Algorithms in Apache Spark for Intrusion Detection. In 2018 28th International Telecommunication Networks and Applications Conference (ITNAC), pages 1--6. IEEE, 2018. [ bib | DOI ]
[5] G. P. Gupta and M. Kulariya. A Framework for Fast and Efficient Cyber Security Network Intrusion Detection Using Apache Spark. Procedia Computer Science, 93:824--831, 2016. [ bib | DOI ]
[6] D. D. Protić. Review of KDD Cup ‘99, NSL-KDD and Kyoto 2006+ datasets. Vojnotehnički glasnik/Military Technical Courier, 66:580--596, 2018. [ bib | DOI ]
[7] C. Hsieh and T. Chan. Detection DDoS attacks based on neural-network using Apache Spark. In 2016 international conference on applied system innovation (ICASI), pages 1--4. IEEE, 2016. [ bib | DOI ]
[8] D. S. Kumar and M. A. Rahman. Performance Evaluation of Apache Spark Vs MPI: A Practical Case Study on Twitter Sentiment Analysis. Journal of Computer Science, 13(12):781--794, 2017. [ bib | DOI ]
[9] C. Hsieh and T. Chan. Big data analytics for network anomaly detection from netflow data. In 2017 International Conference on Computer Science and Engineering (UBMK), pages 592--597. IEEE, 2017. [ bib | DOI ]
[10] P. Dahiya and D. K. Srivastava. Network Intrusion Detection in Big Dataset Using Spark. Procedia computer science, 132:253--262, 2018. [ bib | DOI ]
[11] N. Marir, H. Wang, G. Feng, B. Li, and M. Jia. Distributed Abnormal Behavior Detection Approach Based on Deep Belief Network and Ensemble SVM Using Spark. IEEE Access, 6:59657 -- 59671, 2018. [ bib | DOI ]
[12] S. V. S. reddy and S. Saravanan. Performance Evaluation of Classification Algorithms in the Design of Apache Spark based Intrusion Detection System. In 2020 5th International Conference on Communication and Electronics Systems (ICCES), pages 443--447. IEEE, 2020. [ bib | DOI ]
[13] Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015. [ bib ]
[14] M. aharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. E. Gonzalez, S. Shenker, and I. Stoica. Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11):56--65, 2016. [ bib | DOI ]
[15] Holden Karau and Rachel Warren. High performance Spark: best practices for scaling and optimizing Apache Spark. " O'Reilly Media, Inc.", 2017. [ bib ]
[16] S. V. S. reddy and S. Saravanan. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 15--28. IEEE, 2012. [ bib | DOI ]
[17] R. E. Wright. Logistic regression. American Psychological Association, page 1995, 2016. [ bib | DOI ]
[18] Classification and regression., Date Accessed: June 29, 2019. [ bib ]
[19] S. Amarappa and SV. Sathyanarayana. Data classification using Support vector Machine (SVM), a simplified approach. Int. J. Electron. Comput. Sci. Eng, pages 435--445, 2014. [ bib | DOI ]
[20] Decision trees - rdd-based api., Date Accessed: June 29, 2019. [ bib ]
[21] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In 4th International Conference on Information Systems Security and Privacy, pages 108--116, 2018. [ bib | DOI ]
[22] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani. A detailed analysis of the KDD CUP 99 data set. In 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pages 1--6. IEEE, 2009. [ bib | DOI ]
[23] L. T. Heberlein. Statistical problems with statistical-based intrusion detection. Technical report, Technical report, Version1, Net Squared, Inc, 2007. [ bib ]
[24] R. Atefinia and M. Ahmadi. Network intrusion detection using multi-architectural modular deep neural network. The Journal of Supercomputing, 77(4):3571–3593, 2021. [ bib | DOI ]
[25] R. B. Basnet, R. Shash, C. Johnson, L. Walgren, and T. Doleck. Towards Detecting and Classifying Network Intrusion Traffic Using Deep Learning Frameworks. J. Internet Serv. Inf. Secur., 9(4):1--17, 2019. [ bib | DOI ]
[26] Ron Bekkerman, Mikhail Bilenko, and John Langford. Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press, 2011. [ bib ]
[27] A. Dobson, K. Roy, X. Yuan, and J. Xu. Performance Evaluation of Machine Learning Algorithms in Apache Spark for Intrusion Detection. In 2018 28th International Telecommunication Networks and Applications Conference (ITNAC), pages 1--6. IEEE, 2018. [ bib | DOI ]
[28] K. Huancayo Ramos, M. Sotelo Monge, and J. Maestre Vidal. Benchmark-Based Reference Model for Evaluating Botnet Detection Tools Driven by Traffic-Flow Analytics. Sensors, 20(16):4501, 2020. [ bib | DOI ]
Volume 9, Issue 1 - Serial Number 1
January 2022
Pages 57-69
  • Receive Date: 09 November 2021
  • Revise Date: 26 May 2022
  • Accept Date: 31 May 2022
  • First Publish Date: 31 May 2022