The design idea of a malicious domain name detection system

Author:China Education Network Time:2022.06.30

Campus network users are huge, and there are many types of network applications. Poor user habits and endless software and hardware security vulnerabilities have enabled the campus network to lurk a variety of network security threats. Viruses, Trojans, fishing websites, spam, malware, etc. will be spread. Once the user equipment invades, it will also spread to other devices in the same network segment. The ultimate goal is to capture the equipment to seize control.

Monitoring found that the victims will be connected with the C & C server (Command & Control Server) to receive instructions to perform further attacks such as ransom, DDOS, spam, mining, etc. Initially, the domain name/IP address of the C & C server was hard -coded in a malicious software. The victim device visited the address to establish a control channel. The security personnel monitored this address and added the blacklist of the firewall to block. At present, attackers usually use the domain name generation algorithm (DGA) to generate malicious domain names in batches, of which only a small part can be parsed. The victims can get the C & C server address by accessing the domain name [1].

The domain name generation algorithm DGA also combines Fast-Flux frequently changing the relationship between the domain name and IP mapping, so that the domain name has been resolved multiple times in a short period of time [2]. As shown in Figure 1, the client accesses the domain name deployed using Fast-FLUX technology deployment. The returned domain name survival time TTL is 0, forcing the client to request analysis from the authoritative DNS server of the domain name, and analyze the same time the same time. Domain names, the IP addresses obtained are different.

FAST-FLUX technology blurred the address of the C & C server, making it impossible for security personnel to prevent the possible addresses and become a common technical means for malware to avoid tracking. As a result, visiting such malicious domain names has become a characteristic of malware infringement. In the campus network, the damage equipment can be located by detecting malicious domain names to prevent malware from spreading.

Figure 1 Client access Fast-FLUX network process

Status of research

Domain generation algorithm DGA uses a variety of encryption algorithms to generate a series of pseudo -random string through random seeds. Table 1 shows some malicious domain families and its domain name instance. The DGA algorithms of different domain names are different, but the characteristics of the domain name length and character distribution of each domain name of the same domain family have certain rules.

Table 1 Family name family and domain name instance

At present, there are three main tests about malicious domain names:

1. Detection based on domain name characteristics

The normal domain name and malicious domain name are significantly different in character distribution. Normal domain names are generally readable, and the domain name length is short. The malicious domain name is generally a meaningless string. Researchers use segmentation algorithms to decompose domain names, which are characterized by one -yuan word and binary word poly segmentation. [3] [4]. In addition, the domain name string length, the domain name information information entropy distribution, the vowel consonants, the longest meaningful subbar, the domain name K-L distance, etc. are characterized by the characteristics [5] [6] to detect malicious domain names.

2. Detection of domain name flow information

Normal domain names are very different from the traffic information of malicious domain names. The survival time of the malicious domain name TTL is very short, and the number of access is significantly surged and reduced on the timeline. Researchers extracted the malicious domain names, multi -relationships of access to access time, variation points for access time, multi -relationships of domain names and IP, domain name WHOIS information, TTL information, domain name analysis NXDOMain ratio [7] [8]. The research in this field focuses on effective testing for time rules and mutation points.

3. Use deep learning methods to detect

In recent years, a large number of research and application of deep learning methods, such as circulating neural networks, long and short -term memory networks, convolutional neural networks, etc. are used for malicious domain name detection [9] [10] [11]. Deep learning does not depend on artificial extraction characteristics, and can extract the depth features that cannot be found in the input information. Studies have shown that compared with traditional machine learning methods, the overall detection effect of classification models based on long -term memory networks and convolutional neural networks is better [12].

System design and implementation

Domain name flow information extraction is difficult and difficult to apply in the real network; the characteristics of domain name are easy to obtain, but the misunderstanding rate is higher [13]. This system combines the advantages of the two. The use of long -term memory network LSTM is based on the characteristics of domain name character characteristics to establish a classification model, which is verified based on the classification results combined with domain presence. The system is running daily to detect the domain name of the campus network of Huazhong University of Science and Technology (hereinafter referred to as the campus network). The architecture is shown in Figure 2. The system includes four modules: model training, data collection, domain name detection and data display.

Figure 2 Campus network malignant domain name detection system architecture

Model training

1. Sample collection

In order to train the classification model, the system collects a total of 1 million label samples per day as a training data for the classification model. The malicious domain name data is derived from Netlabdga Project [14]. This item collects more than 10,000 domain names of more than 50 DGA families and is updated daily. Each record includes the domain name, family name, and domain name information. Because the amount of domain names updated daily in some domain name families is small, the system merges new data with old data daily, continuously expands the sample scale, and random 500,000 domain names as a positive sample. Normal domain name data comes from Alexa's daily global traffic TOP1 million domain name [15], and the top 500,000 domain names are selected daily as negative samples. Because the domain name with large traffic is generally impossible to be a malicious domain name, researchers usually use this list as a normal domain name sample.

2. Training model

The system uses a classification model based on long -term memory network LSTM. The structure is shown in Figure 3, which is divided into embedded layers, LSTM layers, Dropout layers, and full connection layers. After the domain name is encoded, the characters 0-9, letters A-Z, mid-drawing line, and lower schedules are converted into digital sequences and input models. The embedded layer mappies it to a vector of 128 in length; the LSTM layer contains 128 units for extracting depth depth. Features; Dropout layers prevent overfitting; output classification results of the full connection layer. Training data is divided into training sets, verification sets and test sets. Use the training set to train multiple rounds for this model every day. After each round of training, the accuracy of the test category is used for verification set. Until this accuracy is no longer improved, and then generates and saves the model.

Figure 3 classification model based on LSTM

In order to test the classification effect of the model, the training data and generated models generated by 2022-01-11 are classified to classify the domain names of 2,766,875 domain names in the malicious domain name sample. Table 2 describes the total number and classification accuracy statistics of these 21 families. There are a total of 20 families with a classification accuracy of more than 80%, and more than 90%of the 16 families, of which 10 families exceed 99%. The family domain name with a better detection effect is shown in Table 1, and its character characteristics are obvious and easy to detect. Overall, the overall detection accuracy of 2,766,875 malicious domain samples was 96.57%.

Table 2 Classification model for the classification of malicious domain name samples that are not participated in training

data collection

1. Data collection

The list of domain names visited by the entire school users on the day before the DNS log from the daily network of the campus network, which includes about 1.5 million to 2 million different domain names. Due to the small number of malicious domain name visits, in order to reduce the detection overhead, the domain names with a daily query of less than 2000 are classified.

2. Data pre -processing

There are a large number of illegal domain names in the user inquiries, such as the prefix with "http: //" and not in line with the domain name structure. You must use regular expression filtering and remove the normal domain name sample collected from Alexa to save calculation time (only two here are two two two. Class domain names, such as "baidu.com" in "www.baidu.com", all the dotted domain names are classified as normal domain names). After pre -processing, about 700,000 to 1 million are left to classify domain names.

Domain name detection

1. Data classification

After the filtering domain data data is encoded, the input classification model is input, and the output classification results are output.

2. Analysis results

The classification model is classified based on the characteristics of the domain name character. Those who are classified as "malicious domain names" are classified. In order to avoid error classification of character characteristics, the system combines user access behavior to further analyze.

Most of the real malicious domain names do not mappore IPs (resolved fruit as NXDOMain). The system returns the domain name list of NXDOMain from the DNS response log to return the NXDOMain's domain name to exclude the resolved domain name from the detection result.

Visited malicious domain names is the automatic behavior of malicious software. Victims generally visit a batch of domain names of the same domain family every day. The system's malicious domain names in the test results are summarized according to the query source IP, and the batch access behavior is performed intuitively. The device that accesss the number of access domain names exceeds a certain threshold is determined as a high -risk access behavior. The threshold value is set to 3 according to the experience data.

Data Display

1. Administrator notice

The system sends the analysis results and running logs to the administrator daily, including a list of equipment sorted by the number of inverted sequences of the number of malicious domain names. Each record contains the number of device IP, the number of access to malicious domain names, and the list of domain names. For equipment with high -risk access behavior, there is also a time sequence icon of the device accessing malicious domain names.

2. Console display and user notification

Equipments with high -risk access may infect viruses and spread to other equipment. The system of high -risk access devices and access time sequences encapsulate the API, and the preparation of the front end of the console shows the malicious domain name and its access device IP detected by the console.

The system is connected to the unified communication platform of the campus network. After manual verification by the administrator, the console can choose to notify the relevant users by SMS/email. It is necessary to take measures such as killing viruses, reinstallation or shutdown equipment immediately.

Application effect and analysis

This system has been launched in November 2020 and is deployed on a high -performance computing public service platform on Huazhong University of Science and Technology. The environmental configuration is shown in Table 3. Table 3 operating environment configuration

During the operation, a total of 9900 malicious domain names were found, involving 73 victims. For the detected victims, the relevant users were notified and provided repair measures on the day. The victim equipment determined by the administrator immediately disconnected the network, and there was no high -risk access behavior of large -scale transmission.

Analyze the detected malicious domain names and related victims, and find that there are three problems in the network network security situation of campus network:

1. Malicious domain family distribution is extremely uneven

Table 4 shows the family distribution of 9,900 malicious domain names found in the test. Among them, the Conficker domain name accounts for 9,623, accounting for 97.20%, indicating that in the reality network traffic, the distribution of malicious domain families is extremely uneven. Conficker domain name accounts for the vast majority. Become the focus of prevention. Conficker virus is a worm virus that uses Windows operating system as an attack target, and uses system vulnerabilities to spread. In order to avoid the introduction of viruses, user devices must be protected from downloading unsafe application software; in order to avoid being transmitted to the virus, user equipment needs to keep the installation system update.

Table 4 Family distribution of malicious domain names

2. The line equipment is more likely to be attacked by the Internet

Among the 73 devices found in the 73 malicious domain names, 59 wireless devices accounted for about 80%, which is similar to the number of wireless users in the total number of campus networks. On the one hand, the wireless equipment is convenient and widely used, and the proportion of access to the network network network network is greater; on the other hand, due to its mobility, once the wireless equipment is invaded, it is conducive to the malware during the movement of the new access process to the new access process. The transmission of subnets is greater harmful than wired equipment.

3. Public equipment network security lacks maintenance

The use of 73 equipment is tracked, including 52 public equipment, accounting for about 70%. As shown in Table 5, public equipment includes public computers, teaching building classroom computers, self -service query equipment, and user self -built information system server in the office or laboratory. Once a considerable number of public equipment is established, it has been in a state of unmanned maintenance for a long time, and the virus is introduced due to negligible installation and updating systems or killing and killing software. The lack of effective network security maintenance will bring security risks to the campus network. At present, when the campus network is assigned to enter the network account, it is necessary to restrict the use of special network segments IP for long -term online devices. IP cannot be interoperable between IP in the network segment, which limits the automatic transmission of malicious programs.

Table 5 Part of infection equipment and attributes

For the problem of malware dissemination of malicious domain name access behavior on the campus network, this article proposes a malicious domain name detection method based on deep learning, that is, establish a classification model through the characteristics of domain name characteristics through the long -term memory network LSTM, and according to the classification results according to the classification results Combined with the domain prestigious behavior verification.

According to this method, Huazhong University of Science and Technology deployed the corresponding detection system in the campus network. Daily detection of access to the campus domain name name, position the device accessing malicious domain names and notify the victims. In the long -term operation, the system found 9,900 malicious domain names, involving 73 victims. Relevant users immediately notify and assist in repairing to avoid the continued spread of malware.

According to the analysis of the detection results of malicious domain names and related victims, it is found that the distribution of malicious domain names in the campus network is extremely uneven, and wireless equipment and public equipment are more likely to be attacked by cyber security. Therefore, while continuing to cultivate users' network security awareness and good use of network habits, we should continue to achieve targeted network security protection through network security equipment and detection methods at all levels, so that monitoring, positioning, notifications and blocking form a closed loop.

references

[1] Yu B, Pan J, Hu J, Et Al.Character Level Based Detection of Dga Domain names [C] // International Joint Conference on Neural Networks (IJCNN). Weeee, 2018: 1-8.

[2] zhauniaarovich y, khalil I, yu ting, et al.A Survey on Malicious Domains Detection Through Data Analysis [J] .acm Computing Surveys, 2018,51 (4): 67.Doi: 10.1451451454514514514514545.145145145145145145145145145145145145145145145145145145145145.145.145

[3] Davuth n, kim s r.classification of malicious domain names user machine and bigram method [j] .International of the Security & its application (2013,7,7,7,7,7,7,7,7

[4]Yadav E,Reddy A K K,Reddy A L N,et al.Detecting algorithmically generated maliciousdomain names[C]//Acm Sigcomm Conference on Internet Measurement.DBLP,2010.[5]Mowbray M,Hagen J.Finding domain-generation algorithms by looking at length distribution [C] // ieee International Symposium on Software Reliability Engineering Workshops.usa: IEEE, 2014: 395-400.

[6] Agyepong e, Buchaan W J, Jones K.Detection of Algorithmical GENERATED MALICIOUS DOMain [C] // International Conference of Advanced Computer Science & INFRORMATIG.20.20.220.

[7]Bilge L,Kirda E,Kruegel C,et al.EXPOSURE:Finding Malicious Domains Using Passive DNS Analysis[C]///Proceedings of the Network and Distributed System Security Symposium,NDSS2011,2011.

[8] Antonakakis M, Perdisci R.FromThrow-Awaytraffic Tobots: Detecting the Rise of Dga- Based Malware [C] // Usenix Conferenceon Symposium.2012 .2012.

[9] Wu police, Lu Tianliang, Du Yanhui. Malicious domain name training data generation technology based on Char-RNN improved model [J]. Information network security, 2020, 20 (9): 6-11.

[10] Woodbridge J, Anderson H S, Ahuja A, et al.predicting domain generating algorithms with long short-termory network: 1611.00791,2016.

[11] Yang Luhui, Liu Guangjie, Zhai Jiangtao. A improved convolutional neural network malicious domain name test algorithm [J]. Journal of Xi'an University of Electronic Science and Technology, 2020,47 (1): 37-43.

[12] yu b, gray d l, pan j, et al.Inline dga detection with deep networks [c] // ieee interniational conference on data mining workshops.ieee, 2017.

[13] Song Jinwei, Yang Jin, Li Tao, research on the Domain Flux zombie network domain name detection method based on weighted vector machines [J]. Information network security, 2018,12: 66-71.

[14] Netlab dga project [il] .http://data.netlab.360.com/dga/.

[15] Alexa’Stop-raked web-sites [il] .http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

*Fund Project: Experimental Technology Project of Huazhong University of Science and Technology "Research and Realization of DNS -based Data -based Domestic Name Excavation System (2021)"

Author: strict knowledge, Zhou Lijuan, Hong Jianke, Liu Lian (Network and Computing Center of Huazhong University of Science and Technology)

Editor -in -chief: Gaoming

- END -

"Deep Sea Warriors" manned submersible settled in Sanya Yazhou Bay Deep Sea Science and Technology City, Hainan

On June 24th, the Yazhou Bay Manned Detailing Project Lab and Deep Sea Lighting En...

Huang Bin: Live virtual persons will also "collapse houses" must pay attention to complian

Zhongxin Jingwei June 14th. Question: Live virtual human beings also collapse must pay attention to compliance issuesAuthor Huang Bin, Beijing Dehe Heng Law Institute, Director of the Cosmic Technol