CN102411687A

CN102411687A - Deep learning detection method of unknown malicious codes

Info

Publication number: CN102411687A
Application number: CN2011103735580A
Authority: CN
Inventors: 李元诚; 樊庆君
Original assignee: North China Electric Power University
Current assignee: North China Electric Power University
Priority date: 2011-11-22
Filing date: 2011-11-22
Publication date: 2012-04-11
Anticipated expiration: 2031-11-22
Also published as: CN102411687B

Abstract

The invention discloses a deep learning detection method of unknown malicious codes, belonging to the technical field of information security. The deep learning detection method of unknown malicious codes comprises the following steps of: firstly, extracting characteristic vectors of documents in a training set by using byte level n-gram; secondly, constituting an HTM (Hypertext Markup Language) network structure and determining the input data length of each node at the bottom layer of the HTM structure; thirdly, carrying out sequence pattern learning practice and classification derivation with an HTM algorithm by using the characteristic vector as input; fourthly, extracting characteristic vectors of documents in a testing set by using byte level n-gram; fifthly, inputting the characteristic vectors into an HTM network with finished practice for sequence identification, so as to determine whether the test centralized documents contain malicious codes or not. The invention has the beneficial effects of relatively high noise resistance and fault-tolerant ability, and strong adaptability. Simultaneously, the deep learning detection method disclosed by the invention has the advantages of improving the identification ability and identification rate of malicious code detection and realizing accurate detection of new targets of malicious codes.

Description

The degree of depth study detection method of unknown malicious code

Technical field

The invention belongs to field of information security technology, relate in particular to the degree of depth study detection method of unknown malicious code.

Background technology

The continuous development of Along with computer technology and network technology; Computing machine has become instrument indispensable in people's daily life; In order to obtain economy, political interest or to carry out individual's revenge, tissue or individual use various malicious codes to carry out unlawful activities in a large number, and the thing followed is that all kinds of malicious codes emerge in an endless stream; The technology that malicious code adopted is also more and more advanced, and its propagation, harm, ability such as hide constantly strengthen.Though the detection technique of various malicious codes is also in continuous development; But the detection technique of malicious code and the development that detectability still lags behind malicious code at present, particularly the detectability to unknown malicious code has proposed great challenge to the malicious code detection technique.

The computer malevolence code detection technique mainly contains two kinds at present, a kind of mode-matching technique that is based on condition code, and another kind is based on the detection technique of malicious code rule of conduct.

Mode-matching technique based on condition code is that the feature code of file to be detected and the malicious code feature string in the property data base are mated; In the successful interval scale of coupling file to be detected, contain malicious code, otherwise think that file to be detected does not contain malicious code.Malicious code sample is found and obtained to this Technology Need technician's very first time, and can extract the unique identification condition code of malicious code.Need in addition in time signature update in malicious code condition code storehouse, so that before this malicious code wide-scale distribution and outburst, detect.The malicious code that this detection technique is not suitable for introducing polymorphic and deformation technology detects, and the detection of propagating the malicious code rapid, that destructive power is strong, with strong points.Based on the detection technique of malicious code rule of conduct, be to come the detection of malicious code according to the common rule of conduct of the predefined malicious code of expert.This technology cardinal principle is that the operation action of malicious code is often followed behaviors such as user right change, Registry Modifications, open, the unusual network service of the network port, perhaps certain particular system sequence of operation.There is serious lag property defective in this technology, particularly along with the lifting significantly of computer run speed, when detecting the malicious code behavior by the time, has often brought irreparable damage to system.Above-mentioned two kinds of detection techniques all are a kind of detection techniques afterwards, can only detect known malicious code, perhaps after malicious code is performed, just can be detected, yet malicious code have caused destruction during this period.

Summary of the invention

The present invention is directed to above-mentioned defective and disclose the degree of depth study detection method of unknown malicious code, it comprises the following steps:

1) utilize the byte level n-gram syntax to extract the proper vector of training set file;

2) make up the HTM network structure, and the input data length of each node of bottom in definite HTM structure;

3) with proper vector as input, utilize the HTM algorithm to carry out the sequence pattern learning training and derive with classification;

4) utilize the byte level n-gram syntax to extract the proper vector of the file in the test set;

5) proper vector is input to the HTM network of accomplishing training and carries out recognition sequence, whether contain malicious code to confirm the file in the test set.

Said step 2) specifically comprises the following steps:

21) the HTM network model of a F layer of selection, outer each node of definite division bottom has the M node;

22) utilize formula l=L/M (F-1) intercepting document characteristic vector, and with the document characteristic vector of intercepting successively as the input sample of each node of HTM network bottom layer.

Said step 3) specifically comprises the following steps:

31) with the input as the HTM network of the document characteristic vector of intercepting, bottom layer node gets into learning phase, and the study, transient state pond of all accomplishing spatial model up to the pond, space of all nodes of bottom is the deadline study of dividing into groups all;

32) bottom layer node is through step 31) in after learning phase finishes; Bottom layer node gets into the derivation stage; The input of new sample is exported to the father node that it is positioned at one deck under the HTM network after bottom layer node is derived, the output of lower level node with identical father node is after connecting; Become the input of next node layer learning phase, next node layer gets into learning phase repeating step 31) in the learning process of node;

Step 33) process of repeating step 3.2 has all been accomplished the learning training of sequence pattern up to the node of all layers of HTM network.

Said step 31) specifically comprises the following steps:

311) binary sequence that is input to node is input to the pond, space, and the pond, space uses ultimate range parameter D to learn the cluster of these sequences; The pond, space uses the method for ultimate range D to store the subclass of input pattern, is called cluster centre; Along with the increase of time, the quantity of the new sequence pattern that the pond, space produced in the unit interval can reduce, and when the quantity of the new cluster centre of each time cycle is lower than the threshold value that configures, cluster process will stop;

312) the transient state pond is exported to the sequence pattern of having learnt in the pond, space, divides into groups to sequence pattern according to the time adjacency of sequence pattern in the transient state pond, after all sequence patterns all are grouped, divides set of calculated to finish.

Said step 32) specifically comprises the following steps:

321) utilize formula

Calculate list entries e ^-Spatial model c based on the node space pond _iProbability distribution, after regularization is handled as the output in pond, space, wherein

Represent the spatial model of non-zero, M is the quantity of the child node of this node, e ^-Be list entries to be identified from bottom;

322) based on the output y in pond, space, utilize formula Calculate the output in transient state pond, wherein, N _cBe vector length and the space pool space number of modes of y, λ length is N _g

Beneficial effect of the present invention is: introduce the HTM algorithm; The structure of mimic human neopallium and the novel artificial of principle of work intelligence degree of depth learning algorithm; Adopt the hierarchical tree network structure and use in the Bayesian network that information continues to share principle and degree of belief transfer principle between node, challenge is converted into pattern match and prediction.And, need not carry out complicated pre-service to the input data, have stronger anti-noise, fault-tolerant ability, adaptability is strong.Simultaneously, in the process that old model is derived, can learn, improve recognition capability and discrimination that malicious code detects, realize the target of the emerging malicious code of accurate detection new input pattern.

Description of drawings

Fig. 1 is the process synoptic diagram of the detection method of unknown malicious code;

Fig. 2 a is a HTM hierarchical tree structural model synoptic diagram;

Fig. 2 b is pond, space and the transient state pond synoptic diagram of node K;

Fig. 3 is the learning process synoptic diagram of a node of HTM algorithm training process;

Fig. 4 is that the degree of belief of node k in the HTM algorithm derivation is transmitted computational details synoptic diagram.

Embodiment

Below in conjunction with accompanying drawing, preferred embodiment is elaborated.Should be emphasized that following explanation only is exemplary, rather than in order to limit scope of the present invention and application thereof.

The thinking that the present invention deals with problems is: with the file set that contains malicious code is training sample; Adopt the byte level n-gram syntax that the training set file is carried out feature selecting; Thereby the corresponding proper vector of each file, proper vector is trained the HTM network as the input of HTM algorithm.Whether at last unknown file is carried out feature selecting and produce the characteristic of correspondence vector, as the input of the HTM network of accomplishing training it is carried out pattern-recognition, be the file that comprises malicious code thereby tell it.

As shown in Figure 1, the intelligent detecting method of unknown malicious code comprises the steps:

1) utilize the byte level n-gram syntax to extract the proper vector of training set file.

Can download on the net and be used for carrying out the standard data set that malicious code detects specially, to concentrate select File to construct training set from normal data, such as constructing training set according to malicious code kind select File according to ad hoc rules.

The byte level n-gram syntax are to adopt the moving window of a n byte-sized to get speech to binary word throttling or text, and each speech all is a n byte-sized.Content such as a text is " abcdef ", and its 2-grams sequence is so: ab, bc, cd, de, ef, its 3-grams sequence is: abc, bcd, cde, def.

Content with a file is that " abcd " is example, this document is extracted the 2-grams sequence be: ab, bc, cd, so just say that this file has three attributes; The vector that can utilize these three attributes to form is represented this file; Vector is: { ab, bc, cd}.

Each attribute is quantized, can obtain the proper vector of this document.With above-mentioned vector ab, bc, cd} are example, a is changed to 1 at the alphabet meta, b is 2; C is 3, and d is 4, with the position and rule quantize, so, the quantized result of ab is 3; The quantized result 5 of bc, the quantized result of cd are 7, and the vector { 3,5,7} is the proper vector of this document.

2) make up the HTM network structure, and the input data length of each node of bottom in definite HTM structure.Be 3 layers of tree structure model shown in Fig. 2 a, except that bottom layer node, each node all has two node.Be the cut-away view of individual node k in the HTM structure shown in Fig. 2 b, its have living space pond and transient state pond constitutes.Step 2) specifically may further comprise the steps:

21) the HTM network structure of a F layer of selection, outer each node of definite division bottom has the M node.

Shown in Fig. 2 a, select F=3, M=2, then this HTM network L3 layer, L2 layer, L1 layer have 1,2,4 node respectively.

22) utilize formula l=L/M (F-1) intercepting document characteristic vector, and with the document characteristic vector of intercepting successively as the input sample of each node of HTM network bottom layer, wherein L is a document characteristic vector length of utilizing byte level n-gram syntax method to extract.

Suppose document characteristic vector for 1,2,3,4,5,6,7, the length L of 8} is 8, l=2 then, promptly the input sample of each node of HTM network bottom layer be respectively 1,2}, 3,4}, 5,6}, 7,8}.

3) with proper vector as input, utilize the HTM algorithm to carry out binary sequence pattern learning training and classification is derived.

The learning training process in this stage is successively to accomplish, i.e. after the study of bottom was accomplished, when new input arrived, the node of bottom got into the derivation stage, and the output result of derivation is as the input of next node layer learning phase; For individual node, also be that the transient state pond just begins to carry out time packet after sequence pattern training in node space pond is accomplished.

Step 3) specifically may further comprise the steps:

31) as shown in Figure 3, with the input as the HTM network of the document characteristic vector of intercepting, bottom layer node gets into learning phase, and the study, transient state pond of all accomplishing spatial model up to the pond, space of all nodes of bottom is the deadline study of dividing into groups all.

Specifically, step 31) specifically comprise the following steps: again

311) the binary sequence pattern with the document characteristic vector of intercepting is input in the pond, space of bottom node, and the pond, space uses ultimate range D to learn the cluster of these sequences.The pond, space uses the method for ultimate range D to store the subclass of the binary sequence pattern of input, is called cluster centre.Along with the increase of time, the quantity of the new sequence pattern that the pond, space produced in the unit interval can reduce, and when the quantity of the new cluster centre of a unit interval cycle T is lower than the threshold value that configures, cluster process will stop.Period of time T is optional not to be 0 arbitrary value, and threshold value is non-0 integer.In order to improve learning efficiency, period of time T and threshold value generally get a less value (such as period of time T get 5s, threshold value gets 1).

The implication of D is to assert that a binary sequence pattern is different from the minimum euclidean distance of already present cluster centre.For each input binary sequence pattern, to check that all the cluster centre that whether exists within the Euclidean distance D (is divided into two kinds of situation:, then maintain the statusquo if exist; If do not exist, add this new binary sequence pattern in the cluster centre tabulation).The Euclidean distance algorithm is following: establish x, y ∈ R ^N, x then, the Euclidean distance of y is:

{(Σ_{i = 1}^{N} {(x^{i} - y^{i})}^{2})}^{\frac{1}{2}}

312) the transient state pond is exported to the binary sequence pattern of having learnt in the pond, space, divides into groups to sequence pattern according to the time adjacency of sequence pattern in the transient state pond, after all sequence patterns all are grouped, divides set of calculated to finish.

Step 312) specifically may further comprise the steps:

3121) when the input of transient state Chi Jieshoukongjianchi, the binary sequence pattern, must be cut apart in groups after the time, adjacency matrix formed by related rise time adjacency matrix of time.In HTM, adopt the Greedy algorithm to realize time packet.

3122) find the maximum that is not included in the grouping to connect the cluster point.The maximum cluster point that connects only is that its corresponding row in the time connection matrix has maximum and cluster value.

3123) select step 3122) the middle maximum preceding N that connects cluster point _Top(N _TopBe to specify parameter) the individual maximum neighbours of connection cluster point, the transient state pond adds these cluster points of selecting in the current group.

3124) the cluster point X that each new adding is divided into groups, repeating step 3123).All immediate N as X _TopAfter individual neighbours' cluster point joins grouping as X, this grouping process will stop automatically.When packet count near a certain value (largest packet number), and grouping process is not when still stopping automatically, grouping process will be terminated.

3125) result set of cluster point will join the transient state pond as a new grouping.Return step 3122 then) be grouped up to all cluster points.

32) bottom layer node is through step 31) in after learning phase finishes; Bottom layer node gets into the derivation stage; New sample (binary sequence pattern) input is exported to its father node that is positioned at one deck under the HTM network (for child node) after bottom layer node is derived, the output of child node with identical father node is after connecting; Become the input of father node learning phase, father node gets into learning phase repeating step 31) in the learning process of node.

The binary sequence pattern that is illustrated in figure 4 as input is in the derivation stage synoptic diagram that node carries out, step 32) specifically comprise the following steps:

321) calculate the probability distribution P (e of the binary sequence pattern of input based on the spatial model in pond, space ^-| c _i), after regularization is handled as the output vector y in pond, space.

The spatial model that learning phase input binary sequence pattern generates in the pond, space is i ^ThCluster centre c _iDerivation node bottom list entries e ^-Based on i ^ThProbability distribution P (the e of cluster centre ^-| c _i) be variable, can pass through computes:

P (e^{-} | c_{i}) = γ Π_{k = 1}^{M} input (m_{k}^{i}) - - - (1)

In the formula (1), γ is a proportionality constant, and the i cluster centre is expressed as

Represent non-vanishing spatial model, M is the quantity of the child node of this node, e ^-Be list entries to be identified from bottom. The representative input is if this node is a bottom node, then Input binary sequence pattern for this node; If this node is not a bottom node, then For the transient state pond output probability from the child node of this node distributes, promptly

(to P (e ^-| g _i) computing formula see formula (4)).All i ^ThCluster centre c _iProbability distribution all can pass through P (e ^-| c _i) calculate, then with P (e ^-| c _i) canonical turns to vectorial y (i), and therefore y (i) and P (e are arranged ^-| c _i) proportional, can be designated as y (i) ∝ P (e ^-| c _i), all y (i) have formed the output vector y in this node space pond, be designated as y=[y (1), y (2) ..., y (N _c)] (N _cBe space pool space pool space number of modes), all P (e ^-| c _i) constituted P (e ^-| C) be designated as

P (e^{-} | C) = [P (e^{-} | c_{1}), P (e^{-} | c_{2}), . . ., P (e^{-} | c_{N_{c}})],

Therefore y and P (e are arranged ^-| C) proportional, be designated as y ∝ P (e ^-| C).

322), calculate the output in transient state pond based on the output vector y in pond, space.

Transient state pond Application of B elief Propagation principle is carried out reasoning.As shown in Figure 4, the pond, space is output as vectorial y, and this vector length is N _c(also being space pool space number of modes), i element is corresponding to i cluster centre c in the vector _i

These cluster centres are as vector

(length is M), wherein r representes the sub-packet index of these cluster centres.I the element computing formula of y is:

y (i) = α_{1} Π_{j = 1}^{M} λ^{m_{i}} (r_{m_{j}}) - - - (2)

In the formula (2), α ₁Be a random scaling constant, for fear of the underflow of information, it is set to fixed value usually, and M is the child node number,

Expression is from child node m _iThe binary sequence pattern, Represent the i cluster centre from child node m _iSub-packet index.

According to formula (1) and step 321 procedure declaration, k has y at node ^kWith P (e ^-| C ^k) proportional, i.e. y ^k∝ P (e ^-| C ^k), y ^kWith P (e ^-| C ^k) be respectively y and P (e ^-| C) at the instance at node k place.

Output is calculated based on the input in pond, space in the transient state pond.Be output as λ, its length is N _g(transient state pond time packet number), λ=[λ (1), λ (2) ..., λ (N _c)] i element computing formula following:

λ (i) = Σ_{j = 1}^{N_{c}} P (c_{j} | g_{i}) y (j) - - - (3)

In the formula (3), P (c _j| g _i) represent spatial model c _jFor the g that divides into groups in the transient state pond _iConditional probability distribution, y (j) representative is from j the element of the y in pond, space, the value of j is 1-N _cBecause y (j) ∝ P (e ^-| c _j), and

P (e^{-} | g_{i}) = Σ_{j = 1}^{N_{c}} P (c_{j} | g_{i}) P (e^{-} | c_{j}) - - - (4)

P (e wherein ^-| g _i) represent bottom list entries e ^-Based on transient state pond grouping g _iProbability distribution, P (e ^-| G ^k) be all P (e on the node k ^-| g _i) the vector of formation.

So λ (i) ∝ P (e ^-| g _i) set up for all i, on node k, λ is arranged ^kWith P (e ^-| G ^k) proportional, i.e. λ ^k∝ P (e ^-| G ^k).The output in transient state pond is exactly the output of this node.

33) repeating step 32) process, all accomplished the learning training of binary sequence pattern up to the node of all layers of HTM network.

Like step 32), after one deck training was accomplished, this node layer changed the derivation stage over to, and next node layer (father node) utilizes the output of last layer node (child node) to carry out the study of sequence pattern as input.

4) utilize the byte level n-gram syntax to extract the proper vector of the file in the test set.

As step 1), utilize the byte level n-gram syntax to extract the proper vector of test set file.Test set can be concentrated from the malicious code test data that network provides and choose.

Like step 322) derivation; All node layers all are in the derivation stage in the whole HTM structure; Utilize the proper vector that pond, space sequence pattern is derived and the grouping derivation of transient state pond is extracted step 4) to carry out the pattern derivation; The output vector λ of top mode is the output mode vector of whole HTM network, the output probability P (e in top mode transient state pond ^-| G ^k) be the malicious code matching rate.As the output probability P in top mode transient state pond (e ^-| G ^k) when enough big, be set at greater than 85% such as us, we just can think that input file contains malicious code so, otherwise think there is not malicious code.

The present invention as training set, utilizes HTM algorithm for pattern recognition training HTM network with the sample set of malice file, utilizes the HTM network that unknown file is carried out pattern-recognition then and derives with classification, to determine whether it is the malice file.In the process of file being carried out feature extraction, adopt byte level n-gram syntax algorithm, a large amount of file characteristic attributive character is extracted.In pattern-recognition and classification learning algorithm, introduce the HTM algorithm; This algorithm is the structure of mimic human neopallium and the novel artificial intelligent algorithm of principle of work, and lasting principle and the degree of belief transfer principle shared of information between node is converted into pattern match and prediction with challenge in its application Bayesian network; The spatial sequence pattern and the temporal mode that extract sample through training divide into groups; And utilize Belief Propagation method to gather classification to each layer local mode group, finally obtain the one-piece pattern group, at cognitive phase; According to the sequence pattern of each layer study, accomplish malicious code sample identification through overmatching.The HTM algorithm can effectively improve discrimination because of its good anti-noise, fault-tolerant, adaptability, self-learning capability.

The above; Be merely the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, any technician who is familiar with the present technique field is in the technical scope that the present invention discloses; The variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. the degree of depth of unknown malicious code study detection method is characterized in that, comprises the following steps:

2. the degree of depth of unknown malicious code according to claim 1 study detection method is characterized in that said step 2) specifically comprise the following steps:

3. the degree of depth of unknown malicious code according to claim 1 study detection method is characterized in that said step 3) specifically comprises the following steps:

4. the degree of depth of unknown malicious code according to claim 3 study detection method is characterized in that said step 31) specifically comprise the following steps:

311) binary sequence that is input to node is input to the pond, space, and the pond, space uses ultimate range D to learn the cluster of these sequences; The pond, space uses the method for ultimate range D to store the subclass of input pattern, is called cluster centre; Along with the increase of time, the quantity of the new sequence pattern that the pond, space produced in the unit interval can reduce, and when the quantity of the new cluster centre of each time cycle is lower than the threshold value that configures, cluster process will stop;

5. the degree of depth of unknown malicious code according to claim 3 study detection method is characterized in that said step 32) specifically comprise the following steps:

321) utilize formula