Publication Type : Book Chapter
Publisher : Springer International Publishing, Cham
Source : Cybersecurity and Secure Information Systems: Challenges and Solutions in Smart Environments, Springer International Publishing, Cham, p.43–65 (2019)
Url : https://doi.org/10.1007/978-3-030-16837-7_4
ISBN : 9783030168377
Campus : Coimbatore
School : School of Engineering
Center : Computational Engineering and Networking
Department : Electronics and Communication
Year : 2019
Abstract : Malicious uniform resource locator (URL) host unsolicited content and are a serious threat and are used to commit cyber crime. Malicious URL's are responsible for various cyber attacks like spamming, identity theft, financial fraud, etc. The internet growth has also resulted in increase of fraudulent activities in the web. The classical methods like blacklisting is ineffective in detecting newly generated malicious URL's. So there arises a need to develop an effective algorithm to detect and classify the malicious URL's. At the same time the recent advancement in the field of machine learning had shown promising results in areas like image processing, Natural language processing (NLP) and other domains. This motivates us to move in the direction of machine learning based techniques for detecting and classifying URL's. However, there are significant challenges in detecting malicious URL's that needs to be answered. First and foremost any available data used in detecting malicious URL's is outdated. This makes the model difficult to be deployed in real time scenario. Secondly the inability to capture semantic and sequential information affects the generalization to the test data. In order to overcome these shortcomings we introduce the concept of time split and random split on the training data. Random split will randomly split the data for training and testing. Whereas time split will split the data based on time information of the URL's. This in turn is followed by different representation of the data. These representation are passed to the classical machine learning and deep learning techniques to evaluate the performance. The analysis for data set from Sophos Machine Learning building blocks tutorial shows better performance for time split based grouping of data with decision tree classifier and an accuracy of 88.5%. Additionally, highly scalable framework is designed to collect data from various data sources in a passive way inside an Ethernet LAN. The proposed framework can collect data in real time and process in a distributed way to provide situational awareness. The proposed framework can be easily extended to handle vary large amount of cyber events by adding additional resources to the existing system.
Cite this Research Publication : N. B. Harikrishnan, Vinayakumar, R., Dr. Soman K. P., and Prabaharan Poornachandran, “Time Split Based Pre-processing with a Data-Driven Approach for Malicious URL Detection”, in Cybersecurity and Secure Information Systems: Challenges and Solutions in Smart Environments, A. Ella Hassanien and Elhoseny, M., Eds. Cham: Springer International Publishing, 2019, pp. 43–65.