CN103001848B - Rubbish mail filtering method and device - Google Patents

Rubbish mail filtering method and device Download PDF

Info

Publication number
CN103001848B
CN103001848B CN201110264365.1A CN201110264365A CN103001848B CN 103001848 B CN103001848 B CN 103001848B CN 201110264365 A CN201110264365 A CN 201110264365A CN 103001848 B CN103001848 B CN 103001848B
Authority
CN
China
Prior art keywords
mail
email
situation
spam
fuzzy word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110264365.1A
Other languages
Chinese (zh)
Other versions
CN103001848A (en
Inventor
郭涛
于洪涌
薛立宏
丘凌
张国威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN201110264365.1A priority Critical patent/CN103001848B/en
Publication of CN103001848A publication Critical patent/CN103001848A/en
Application granted granted Critical
Publication of CN103001848B publication Critical patent/CN103001848B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention relates to a kind of rubbish mail filtering method, comprising: when receiving Email, in scans content, whether there is the default fuzzy word of hit and context recognition storehouse discal patch object fuzzy word; If there is hit entries, then scenario analysis is carried out to Email, and the situation adjustment corresponding according to Email obtains the mail value vector of Email; According to the mail value vector calculation spam probability of Email, and spam probability and predetermined threshold value are compared, to judge that whether Email is for suspicious spam, and intercept process is carried out to the Email being defined as suspicious spam.The invention still further relates to a kind of junk mail filter device.The present invention is based on fuzzy word identification and scenario analysis, to the intercepting junk mail comprising fuzzy word, while greatly improving the interception scope of spam, ensure the accuracy of filtering, also the existing spam interception mode in keyword filtration mode is provided and further supplement and optimize.

Description

Rubbish mail filtering method and device
Technical field
The present invention relates to anti-spam technologies, particularly relate to a kind of rubbish mail filtering method and device.
Background technology
SPAM (abbreviation spam) refers to any Email be just sent to by force without user's license in the mailbox of user.Email is one of base application of current Internet user, and spam sends mainly through E-mail address.In December, 2010 Monitoring Data display, the SPAM quantity of whole world transmission every day is about 50,000,000,000.The content of spam comprises promotional advertising, adult advertisements, money-making information, and comprise the destructive Emails such as computer virus, bring many puzzlements to Email user, therefore each large mail provider is all using promoting Email anti-spam system effect as the significant concern point promoting mailbox user experience.
Conventional anti-spam system is filtered by predefined key technology, i.e. first predefine Keyword List, then from the mail passed through, capture content and Keyword List contrasts, if there is hit, carry out corresponding spam interception action.Although this simple Keyword List matching way realizes fairly simple, be easy to by spammer by inserting interference character, using homophone, using the modes such as nearly word form to evade, and then make Spam Filtering System lose efficacy.
In addition, the scheme of simple keyword filtration is also weak in the ability identifying normal email, part normal email mistake may be tackled, have influence on the normal use of Email user.
Summary of the invention
The object of the invention is to propose a kind of rubbish mail filtering method and device, while the interception scope improving spam, the accuracy of Spam filtering can be ensured.
For achieving the above object, the invention provides a kind of rubbish mail filtering method, comprising:
When receiving Email, scanning in the content of described Email and whether there is the default fuzzy word of hit and context recognition storehouse discal patch object fuzzy word;
If there is hit entries, then scenario analysis is carried out to described Email, and the situation adjustment corresponding according to described Email obtains the mail value vector of described Email;
According to the mail value vector calculation spam probability of the described Email obtained after adjustment, and described spam probability and predetermined threshold value are compared, to judge that whether described Email is for suspicious spam, and intercept process is carried out to the Email being defined as suspicious spam.
For achieving the above object, the invention provides a kind of junk mail filter device, comprising:
E-mail reception unit, for receiving Email;
Fuzzy word scanning element, for scan described Email content in whether there is the fuzzy word and context recognition storehouse discal patch object fuzzy word that hit presets;
Scenario analysis unit, for when there is hit entries, carries out scenario analysis to described Email;
Vector adjustment unit, obtains the mail value vector of described Email for the situation adjustment corresponding according to described Email;
Spam probability calculation unit, for the mail value vector calculation spam probability according to the described Email obtained after adjustment;
Threshold value comparing unit, for described spam probability and predetermined threshold value being compared, to judge that whether described Email is for suspicious spam;
E-mail processing element, for carrying out intercept process to the Email being defined as suspicious spam.
Based on technique scheme, the present invention is based on fuzzy word identification and scenario analysis, to the intercepting junk mail comprising fuzzy word, while greatly improving the interception scope of spam, ensure the accuracy of filtering, also the existing spam interception mode in keyword filtration mode is provided and further supplement and optimize.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, and form a application's part, schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the schematic flow sheet of an embodiment of rubbish mail filtering method of the present invention.
Fig. 2 is the schematic flow sheet setting up fuzzy word and context recognition storehouse in another embodiment of rubbish mail filtering method of the present invention.
Fig. 3 is the schematic flow sheet of the another embodiment of rubbish mail filtering method of the present invention.
Fig. 4 is the structural representation of an embodiment of junk mail filter device of the present invention.
Fig. 5 is the structural representation realizing building the correlation unit of storehouse process in another embodiment of junk mail filter device of the present invention.
Fig. 6 is the structural representation of the another embodiment of junk mail filter device of the present invention.
Embodiment
Below by drawings and Examples, technical scheme of the present invention is described in further detail.
The present invention has on keyword interception basis at anti-garbage mail system increases the spam method for sorting that fuzzy word (comprising homonym, similar words, fractionation word etc.) identifies, to tackle the spam through fuzzy word process.The present invention is in identifying, fuzzy word and situation process are carried out to mail, interference symbol situation, fuzzy word hit situation, the corresponding situation assistant analysis etc. of mail is considered in processing procedure, based on vector operation and probability, mail is sorted, and according to result, system is optimized.
As shown in Figure 1, be the schematic flow sheet of an embodiment of rubbish mail filtering method of the present invention.In the present embodiment, rubbish mail filtering method comprises:
Step 101, when receiving Email, scan in the content of described Email whether exist hit preset fuzzy word and context recognition storehouse discal patch object fuzzy word;
If step 102 exists hit entries, then scenario analysis is carried out to described Email, and the situation adjustment corresponding according to described Email obtains the mail value vector of described Email;
Step 103, according to the mail value vector calculation spam probability of described Email obtained after adjustment;
Step 104, described spam probability and predetermined threshold value to be compared, to judge that whether described Email is for suspicious spam;
Step 105, intercept process is carried out to the Email being defined as suspicious spam.
In the present embodiment, the inquiry of entry in storehouse is first carried out when receiving Email, the entry comprised in fuzzy word and context recognition storehouse is fuzzy word and the corresponding relation of existing rubbish keyword and corresponding reference mail value vector, and what also comprise the corresponding relation of fuzzy word and existing rubbish keyword under multiple situation affects probability.In storehouse in entry query script, mainly search in the content of Email whether there is fuzzy word identical in storehouse, locate hit entries with this.
After having hit certain entry, need to carry out scenario analysis to this Email, this analytic process can specifically comprise: analyze the situation element obtaining described Email, the situation element of the described Email obtained is mated with the situation element included by the various situations in hit entries, determines the situation that described Email is corresponding.
Here situation element can comprise the mailbox domain name etc. of the transmitting time of mail, some words that Mail Contents comprises, sender, but it is several to be not limited to illustrated this, these situation elements can give expression to different situations by combination, corresponding to different situations, occur that the probability that the Email of certain fuzzy word belongs to SPAM increases accordingly or reduces.Citing: if the transmitting time analyzing certain envelope Email is before and after the Mid-autumn Festival, occur in mail " recovery ", and the fuzzy word of this hit entries is " moon also ", these situation elements can determine the scene that moon cake reclaims before and after the Mid-autumn Festival substantially, be considered to belong to a kind of scope of improper mail, its probability as spam just adds.
After analyzing the situation corresponding to this Email, what just can utilize corresponding situation in hit entries affects the operation that probability carries out adjusting, specifically comprise: what the situation inquiry that the described Email determined according to scenario analysis is corresponding was corresponding affects probability, adjusted by the described reference mail value vector affecting probability corresponding to described hit entries, obtain the mail value vector of described Email, the mail value vector of described Email comprises keyword score value, replacement score value, situation score value and interference symbol score value.This adjustment action need is at the vectorial enterprising Row sum-equal matrix of the reference mail value that hit entries is corresponding, and the mail value vector after adjustment is as the mail value vector of this Email.
After the mail value vector obtaining this Email, to continue to calculate spam probability according to this mail value vector, computational process, mainly by the keyword score value in the mail value vector of described Email with replace product and the situation score value of score value and disturb and accord with score value and add up, obtains the spam probability that described Email is corresponding.Technical staff can determine variable in situation Adjustable calculation formula according to result of calculation and spam actual or use new computing formula, and this computing formula is only in order to illustrate, the restriction not to protection range.
After calculating spam probability, compared by the threshold value preset, can judge whether this Email is suspicious spam, such as when the spam probability calculated is greater than predetermined threshold value, then determine that this Email is suspicious spam, if the spam probability that this mail calculates is less than or equal to predetermined threshold value, then determines that this Email gets rid of spam suspicion, can normally drop.When judging, also can be when the spam probability calculated is more than or equal to predetermined threshold value, determine that this Email is suspicious spam, if the spam probability that this mail calculates is less than predetermined threshold value, then determine that this Email gets rid of spam suspicion, can normally drop.
As shown in Figure 2, for setting up the schematic flow sheet in fuzzy word and context recognition storehouse in another embodiment of rubbish mail filtering method of the present invention.Compared with a upper embodiment, the present embodiment before step 101, also comprises the operation of setting up fuzzy word and context recognition storehouse, specifically comprises:
Step 201, set up described fuzzy word and context recognition storehouse;
The existing corresponding relation between rubbish keyword and fuzzy word of step 202, basis adds entry in described fuzzy word and context recognition storehouse;
Step 203, reference the mail value vector calculating between described existing rubbish keyword and fuzzy word corresponding to corresponding relation according to the historical data in anti-garbage mail system, describedly comprise with reference to keyword score value, with reference to replacing score value, according with score value with reference to situation score value with reference to interference with reference to mail value vector;
Step 204, add and affect probability on the corresponding relation between described existing rubbish keyword and fuzzy word under multiple situation in the entry, described situation comprises at least one situation element.
In the present embodiment, each entry in fuzzy word and context recognition storehouse comprises the corresponding relation between mimetic word and existing rubbish keyword, and the reference mail value vector corresponding to this corresponding relation, and affects probability on this corresponding relation under multiple situation.Wherein, need to calculate according to the historical data of anti-garbage mail system with reference to mail value vector, the corresponding situation of this fuzzy word that what it embodied is in current data with existing and existing rubbish keyword.
Comprise with reference to keyword score value Key0, with reference to replacement score value eXchange0, with reference to situation score value Content0 with reference to interference symbol score value DisturbMark0 with reference to mail value vector eMailValue0.It is the average probability of spam when wherein representing in mail comprise this existing rubbish keyword with reference to keyword score value Key0; With reference to replace score value eXchange0 represent in mail there is this fuzzy word time, this fuzzy word is actual is the average probability of replacing existing rubbish keyword; The average influence rate of spam probability when representing that mail context situation substitutes keyword to appearance fuzzy word in mail with reference to situation score value Content0, this value can just can be born; If represent that in mail, appearance fuzzy word substitutes keyword with reference to interference symbol score value DisturbMark0, it comprises the average probability amplification that interference symbol is spam to mail.
When adjusting the mail value vector obtaining Email with reference to mail value vector, with reference to keyword score value Key0, determined by historical data with reference to replacing score value eXchange0, do not need adjustment, in other words, score value eXchange0 is replaced respectively as the keyword score value Key in the mail value vector of this Email and replacement score value eXchange with reference to keyword score value Key0, reference.And situation score value Content needs the probability P c that affects corresponding to the situation analyzed to adjust reference situation score value Content0, the result after adjustment is as the situation score value Content in the mail value vector of this Email.
Having mentioned above affects probability P c and can just can bear, if larger with the spam probability possibility that positive number to be expressed under this situation when appearance fuzzy word in this mail substitutes keyword, less with the spam probability possibility that negative to be expressed under this situation when appearance fuzzy word in this mail substitutes keyword, simple algorithm calculations so can be utilized to adjust Content0, such as direct Content0 to be added with the multiple affecting probability P c or affect probability P c, the Content after being adjusted.
Another kind of situation is feasible too, spam probability possibility when substituting keyword with appearance fuzzy word in this mail under this situation of larger numerical expression is larger, spam probability possibility when substituting keyword with appearance fuzzy word in this mail under this situation of less numerical expression is less, corresponding needs selects suitable account form to adjust Content0, such as be added affecting probability P c with Content0 with the difference of certain constant, the Content after being adjusted.Here given several adjustment examples are only convenient to better understanding, and not the adjustment mode concrete to certain limits.
In each embodiment above-mentioned, if there are some interference symbols in Email, will have influence on fuzzy word and context recognition storehouse discal patch object coupling, interference symbol is here often referred to typesetting, the symbol of the non-language be inserted in vocabulary, such as newline, " | ", "/", " # " "! " etc.; also comprising space etc.; this vocabulary is become and is difficult to by computer recognizing; in order to improve discrimination, after receiving Email, scanning in the content of described Email before whether there is the default fuzzy word of hit and context recognition storehouse discal patch object fuzzy word; also interference symbol Transformatin is carried out to the non-language part in this Email; by deleting these non-language symbols, making the words in mail more coherent, thus improve discrimination.
In addition, if there is interference symbol in many places in an envelope mail, then represent that this mail is that the likelihood ratio of spam is higher, therefore when adjusting mail value vector, can according to whether there is interference symbol, interference according with the frequency of occurrences, disturbing the factors such as symbol occurrence number to adjust according with score value DisturbMark0 with reference to interference, thus makes the spam probability finally calculated can embody the impact disturbed and accord with and causing.
After intercept process is carried out to the Email being defined as suspicious spam, also according to this judged result in conjunction with historical data recalculate described existing rubbish keyword corresponding with the reference mail value vector sum situation corresponding to the corresponding relation between fuzzy word affect probability, and upgrade corresponding entry in described fuzzy word and context recognition storehouse, thus constantly upgrade entry according to the filter process of spam, follow-up filter process is more tallied with the actual situation, and its filter result is more accurate.
As shown in Figure 3, be the schematic flow sheet of the another embodiment of rubbish mail filtering method of the present invention.Give a more specific Spam filtering example in the present embodiment, comprising:
Step 301, receive Email;
Step 302, the non-language part in described Email carried out to interference symbol Transformatin;
Step 303, scan described Email content in whether there is the fuzzy word and context recognition storehouse discal patch object fuzzy word that hit presets, exist and then perform step 304, otherwise perform step 310;
Step 304, analysis obtain the situation element of described Email;
Step 305, the situation element of described Email obtained to be mated with the situation element included by the various situations in hit entries, determine the situation that described Email is corresponding;
What the situation inquiry that step 306, the described Email determined according to scenario analysis are corresponding was corresponding affects probability;
Step 307, to be adjusted by the described reference mail value vector affecting probability corresponding to described hit entries, obtain the mail value vector of described Email;
Step 308, by the keyword score value in the mail value vector of described Email with replace product and the situation score value of score value and disturb and accord with score value and add up, obtain the spam probability that described Email is corresponding;
Step 309, the spam probability obtained and predetermined threshold value being compared, to judge that whether described Email is for suspicious spam, is perform step 310, otherwise performs step 311;
Step 310, intercept process is carried out to the Email being defined as suspicious spam, and perform step 312;
Step 311, this Email to be dropped as normal email;
Step 312, be optimized adjustment according to the threshold value of result to the entry in fuzzy word and context recognition storehouse and spam probabilistic determination of this Email.
In each embodiment above-mentioned, if there is multiple hit entries, then can carry out fuzzy word process to each hit entries respectively, calculate spam probability and threshold value to compare, and the conclusion that comprehensively each hit entries obtains carries out the judgement whether described Email is suspicious spam.
The embodiment of the present invention can be used as supplementing of keyword interception mode, such as judge with keyword interception mode simultaneously, comprehensive conclusion judges again, also can when keyword interception mode unidentified go out spam proceed the identification etc. of the spam comprising fuzzy word, thus expand the interception scope of anti-garbage mail system.
Pass through the present invention, those mails attempting to evade by inserting the modes such as interference character, use homophone, use nearly word form filtration can be made to be not easy to get by under false pretences, increase the interception scope of anti-garbage mail system, the differentiation simultaneously utilizing scenario analysis to be spam increases situation dimension, improve spam sorting precision, avoid the erroneous judgement of spam.In certain embodiments, also utilize the feedback adjusting after mail treatment, make the continuous self-optimization of this rubbish intercepting system, avoid system cures to be not suitable with the new change of spam.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can have been come by the hardware that program command is relevant, aforesaid program can be stored in a computer read/write memory medium, this program, when performing, performs the step comprising said method embodiment; And aforesaid storage medium comprises: ROM, RAM, magnetic disc or CD etc. various can be program code stored medium.
As shown in Figure 4, be the structural representation of an embodiment of junk mail filter device of the present invention.In the present embodiment, junk mail filter device comprises: e-mail reception unit 11, fuzzy word scanning element 12, scenario analysis unit 13, vectorial adjustment unit 14, probability calculation unit 15, threshold value comparing unit 16 and e-mail processing element 17.
E-mail reception unit 11 is responsible for receiving Email.Fuzzy word scanning element 12 is responsible for whether there is the default fuzzy word of hit and context recognition storehouse discal patch object fuzzy word in the content of the described Email of scanning.Scenario analysis unit 13 is responsible for when there is hit entries, carries out scenario analysis to described Email.Vector adjustment unit 14 is responsible for the mail value vector that the situation adjustment corresponding according to described Email obtains described Email.
Probability calculation unit 15 is responsible for the mail value vector calculation spam probability according to the described Email obtained after adjustment.Threshold value comparing unit 16 is responsible for described spam probability and predetermined threshold value being compared, to judge that whether described Email is for suspicious spam.E-mail processing element 17 is responsible for carrying out intercept process to the Email being defined as suspicious spam.
As shown in Figure 5, for realizing the structural representation building the correlation unit of storehouse process in another embodiment of junk mail filter device of the present invention.Compared with a upper embodiment, add the correlation unit realizing building storehouse process in the present embodiment, comprising: build library unit 21, entry adding device 22, reference vector computing unit 23 and situation probability adding device 24.Wherein, build library unit 21 to be responsible for setting up described fuzzy word and context recognition storehouse.Entry adding device 22 is responsible in described fuzzy word and context recognition storehouse, adding entry according to existing corresponding relation between rubbish keyword and fuzzy word.Reference vector computing unit 23 is responsible for the reference mail value vector corresponding to the corresponding relation between the historical data described existing rubbish keyword of calculating in anti-garbage mail system and fuzzy word, and described reference mail value vector comprises with reference to keyword score value, with reference to replacement score value, with reference to situation score value and reference interference symbol score value.Situation probability adding device 24 is responsible for adding in the entry and is affected probability on the corresponding relation between described existing rubbish keyword and fuzzy word under multiple situation, and described situation comprises at least one situation element.
As shown in Figure 6, be the structural representation of the another embodiment of junk mail filter device of the present invention.The present embodiment can also increase interference symbol removal unit 17 compared to device embodiment before, be connected with described fuzzy word scanning element 12 with described e-mail reception unit 11, before this unit is responsible for whether there is the default fuzzy word of hit and context recognition storehouse discal patch object fuzzy word in the content of the described Email of scanning, interference symbol Transformatin is carried out to the non-language part in described Email.
Scenario analysis unit 13 may further include: situation elementary analysis assembly 13a, situation Match of elemental composition assembly 13b and situation determination assembly 13c.Wherein, situation elementary analysis assembly 13a is responsible for analyzing the situation element obtaining described Email.Situation Match of elemental composition assembly 13b is responsible for the situation element of the described Email obtained to mate with the situation element included by the various situations in hit entries.Situation determination assembly 13c is responsible for determining according to match condition the situation that described Email is corresponding.
Vector adjustment unit 14 may further include: affect probabilistic query assembly 14a and vector adjustment assembly 14b.Wherein, affect situation inquiry corresponding to described Email that probabilistic query assembly 14a is responsible for determining according to scenario analysis corresponding affect probability.Vector adjustment assembly 14b is responsible for being adjusted by the described reference mail value vector affecting probability corresponding to described hit entries, obtain the mail value vector of described Email, the mail value vector of described Email comprises keyword score value, replacement score value, situation score value and interference symbol score value.
In another embodiment, storehouse updating block 18 can also be increased in junk mail filter device, this unit is connected with e-mail processing element 17, be responsible for judging that described Email is suspicious spam, and after carrying out intercept process, according to this judged result in conjunction with historical data recalculate described existing rubbish keyword corresponding with the reference mail value vector sum situation corresponding to the corresponding relation between fuzzy word affect probability, and upgrade corresponding entry in described fuzzy word and context recognition storehouse.
If there is multiple hit entries, then for each hit entries, respectively through described scenario analysis unit, vectorial adjustment unit, spam probability calculation unit and the process of threshold value comparing unit, and the conclusion that comprehensively each hit entries obtains carries out the judgement whether described Email is suspicious spam.
Pass through the present invention, those mails attempting to evade by inserting the modes such as interference character, use homophone, use nearly word form filtration can be made to be not easy to get by under false pretences, increase the interception scope of anti-garbage mail system, the differentiation simultaneously utilizing scenario analysis to be spam increases situation dimension, improve spam sorting precision, avoid the erroneous judgement of spam.In certain embodiments, also utilize the feedback adjusting after mail treatment, make the continuous self-optimization of this rubbish intercepting system, avoid system cures to be not suitable with the new change of spam.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; Although with reference to preferred embodiment to invention has been detailed description, those of ordinary skill in the field are to be understood that: still can modify to the specific embodiment of the present invention or carry out equivalent replacement to portion of techniques feature; And not departing from the spirit of technical solution of the present invention, it all should be encompassed in the middle of the technical scheme scope of request of the present invention protection.

Claims (15)

1. a rubbish mail filtering method, comprising:
When receiving Email, scanning in the content of described Email and whether there is the default fuzzy word of hit and context recognition storehouse discal patch object fuzzy word;
If there is hit entries, then scenario analysis is carried out to described Email, and the situation adjustment corresponding according to described Email obtains the mail value vector of described Email, the mail value vector of described Email comprises keyword score value, replacement score value, situation score value and interference symbol score value;
According to the mail value vector calculation spam probability of the described Email obtained after adjustment, and described spam probability and predetermined threshold value are compared, to judge that whether described Email is for suspicious spam, and intercept process is carried out to the Email being defined as suspicious spam.
2. rubbish mail filtering method according to claim 1, wherein before reception Email, also comprises the operation of setting up fuzzy word and context recognition storehouse, specifically comprises:
Set up described fuzzy word and context recognition storehouse;
In described fuzzy word and context recognition storehouse, entry is added according to the corresponding relation between existing rubbish keyword and fuzzy word;
According to the reference mail value vector corresponding to the corresponding relation between the historical data described existing rubbish keyword of calculating in anti-garbage mail system and fuzzy word, described reference mail value vector comprises with reference to keyword score value, with reference to replacement score value, with reference to situation score value and reference interference symbol score value;
Adding in the entry affects probability on the corresponding relation between described existing rubbish keyword and fuzzy word under multiple situation, and described situation comprises at least one situation element.
3. rubbish mail filtering method according to claim 2, wherein saidly specifically comprises the operation that Email carries out scenario analysis:
Analyze the situation element obtaining described Email;
The situation element of the described Email obtained is mated with the situation element included by the various situations in hit entries, determines the situation that described Email is corresponding.
4. rubbish mail filtering method according to claim 3, the operation that wherein corresponding according to described Email situation adjustment obtains the mail value vector of described Email specifically comprises:
What the situation inquiry that the described Email determined according to scenario analysis is corresponding was corresponding affects probability;
Adjusted by the described reference mail value vector affecting probability corresponding to described hit entries, obtain the mail value vector of described Email.
5. rubbish mail filtering method according to claim 4, the operation of the mail value vector calculation spam probability of the wherein said described Email according to obtaining after adjustment is specially:
By the keyword score value in the mail value vector of described Email with replace product and the situation score value of score value and disturb and accord with score value and add up, obtain the spam probability that described Email is corresponding.
6., according to the arbitrary described rubbish mail filtering method of claim 2 ~ 5, before wherein whether there is the default fuzzy word of hit and context recognition storehouse discal patch object fuzzy word in the content of the described Email of scanning, also comprise:
Interference symbol Transformatin is carried out to the non-language part in described Email.
7. rubbish mail filtering method according to claim 6, wherein judging that described Email is suspicious spam, and after carrying out intercept process, also according to this judged result in conjunction with historical data recalculate described existing rubbish keyword corresponding with the reference mail value vector sum situation corresponding to the corresponding relation between fuzzy word affect probability, and upgrade corresponding entry in described fuzzy word and context recognition storehouse.
8. rubbish mail filtering method according to claim 1, if wherein there is multiple hit entries, then respectively fuzzy word process is carried out to each hit entries, calculate spam probability and threshold value to compare, and the conclusion that comprehensively each hit entries obtains carries out the judgement whether described Email is suspicious spam.
9. a junk mail filter device, comprising:
E-mail reception unit, for receiving Email;
Fuzzy word scanning element, for scan described Email content in whether there is the fuzzy word and context recognition storehouse discal patch object fuzzy word that hit presets;
Scenario analysis unit, for when there is hit entries, carries out scenario analysis to described Email;
Vector adjustment unit, obtains the mail value vector of described Email for the situation adjustment corresponding according to described Email, the mail value vector of described Email comprises keyword score value, replacement score value, situation score value and interference symbol score value;
Probability calculation unit, for the mail value vector calculation spam probability according to the described Email obtained after adjustment;
Threshold value comparing unit, for described spam probability and predetermined threshold value being compared, to judge that whether described Email is for suspicious spam;
E-mail processing element, for carrying out intercept process to the Email being defined as suspicious spam.
10. junk mail filter device according to claim 9, wherein also comprises:
Build library unit, for setting up described fuzzy word and context recognition storehouse;
Entry adding device, for adding entry according to existing corresponding relation between rubbish keyword and fuzzy word in described fuzzy word and context recognition storehouse;
Reference vector computing unit, for calculating the reference mail value vector corresponding to the corresponding relation between described existing rubbish keyword and fuzzy word according to the historical data in anti-garbage mail system, described comprising with reference to mail value vector replaces score value with reference to keyword score value, reference, accords with score value with reference to situation score value with reference to interference;
Situation probability adding device, affect probability for adding in the entry on the corresponding relation between described existing rubbish keyword and fuzzy word under multiple situation, described situation comprises at least one situation element.
11. junk mail filter devices according to claim 10, wherein said scenario analysis unit comprises further:
Situation elementary analysis assembly, for analyzing the situation element obtaining described Email;
Situation Match of elemental composition assembly, for mating the situation element of the described Email obtained with the situation element included by the various situations in hit entries;
Situation determination assembly, for determining the situation that described Email is corresponding according to match condition.
12. junk mail filter devices according to claim 11, wherein said vectorial adjustment unit comprises further:
Affect probabilistic query assembly, what the situation inquiry that the described Email for determining according to scenario analysis is corresponding was corresponding affects probability;
Vector adjustment assembly, for being adjusted by the described reference mail value vector affecting probability corresponding to described hit entries, obtains the mail value vector of described Email.
13., according to the arbitrary described junk mail filter device of claim 10 ~ 12, wherein also comprise:
Interference symbol removal unit, be connected with described fuzzy word scanning element with described e-mail reception unit, for whether there is the default fuzzy word of hit and context recognition storehouse discal patch object fuzzy word in the content of the described Email of scanning before, interference symbol Transformatin is carried out to the non-language part in described Email.
14. junk mail filter devices according to claim 13, wherein also comprise:
Storehouse updating block, be connected with described e-mail processing element, for judging that described Email is suspicious spam, and after carrying out intercept process, according to this judged result in conjunction with historical data recalculate described existing rubbish keyword corresponding with the reference mail value vector sum situation corresponding to the corresponding relation between fuzzy word affect probability, and upgrade corresponding entry in described fuzzy word and context recognition storehouse.
15. junk mail filter devices according to claim 9, if wherein there is multiple hit entries, then for each hit entries, respectively through described scenario analysis unit, vectorial adjustment unit, probability calculation unit and the process of threshold value comparing unit, and the conclusion that comprehensively each hit entries obtains carries out the judgement whether described Email is suspicious spam.
CN201110264365.1A 2011-09-08 2011-09-08 Rubbish mail filtering method and device Active CN103001848B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110264365.1A CN103001848B (en) 2011-09-08 2011-09-08 Rubbish mail filtering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110264365.1A CN103001848B (en) 2011-09-08 2011-09-08 Rubbish mail filtering method and device

Publications (2)

Publication Number Publication Date
CN103001848A CN103001848A (en) 2013-03-27
CN103001848B true CN103001848B (en) 2015-10-21

Family

ID=47930004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110264365.1A Active CN103001848B (en) 2011-09-08 2011-09-08 Rubbish mail filtering method and device

Country Status (1)

Country Link
CN (1) CN103001848B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716335A (en) * 2014-01-12 2014-04-09 绵阳师范学院 Detecting and filtering method of spam mail based on counterfeit sender
CN103944810B (en) * 2014-05-06 2017-02-15 厦门大学 Spam e-mail intention recognition system
CN111563721B (en) * 2020-04-21 2023-07-11 上海爱数信息技术股份有限公司 Mail classification method suitable for different label distribution occasions

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200516484A (en) * 2003-10-27 2005-05-16 Softnext Technologies Co Ltd Filtering method for SPAM
CN101137087A (en) * 2007-08-01 2008-03-05 浙江大学 Short message monitoring center and monitoring method
CN101588558A (en) * 2009-03-30 2009-11-25 网易(杭州)网络有限公司 Spam filtering method and system
US7680890B1 (en) * 2004-06-22 2010-03-16 Wei Lin Fuzzy logic voting method and system for classifying e-mail using inputs from multiple spam classifiers

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7644127B2 (en) * 2004-03-09 2010-01-05 Gozoom.Com, Inc. Email analysis using fuzzy matching of text
US7500265B2 (en) * 2004-08-27 2009-03-03 International Business Machines Corporation Apparatus and method to identify SPAM emails
CN101257671B (en) * 2007-07-06 2010-12-08 浙江大学 Method for real time filtering large scale rubbish SMS based on content
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200516484A (en) * 2003-10-27 2005-05-16 Softnext Technologies Co Ltd Filtering method for SPAM
US7680890B1 (en) * 2004-06-22 2010-03-16 Wei Lin Fuzzy logic voting method and system for classifying e-mail using inputs from multiple spam classifiers
CN101137087A (en) * 2007-08-01 2008-03-05 浙江大学 Short message monitoring center and monitoring method
CN101588558A (en) * 2009-03-30 2009-11-25 网易(杭州)网络有限公司 Spam filtering method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
垃圾邮件过滤系统模型的研究与设计;胡锡衡;《鞍山师范学院学报》;20090430;第11卷(第2期);全文 *
罗倩等.反垃圾邮件技术综述.《渤海大学学报(自然科学版)》.2008,第29卷(第4期), *

Also Published As

Publication number Publication date
CN103001848A (en) 2013-03-27

Similar Documents

Publication Publication Date Title
US9906554B2 (en) Suspicious message processing and incident response
US7716297B1 (en) Message stream analysis for spam detection and filtering
AU2012367398B2 (en) Systems and methods for spam detection using character histograms
CN104982011B (en) Use the document classification of multiple dimensioned text fingerprints
US8489689B1 (en) Apparatus and method for obfuscation detection within a spam filtering model
US10212114B2 (en) Systems and methods for spam detection using frequency spectra of character strings
Gansterer et al. E-mail classification for phishing defense
US8112484B1 (en) Apparatus and method for auxiliary classification for generating features for a spam filtering model
Laorden et al. Study on the effectiveness of anomaly detection for spam filtering
JP7049087B2 (en) Technology to detect suspicious electronic messages
CN107483451B (en) Method and system for processing network security data based on serial-parallel structure and social network
CN111526136A (en) Malicious attack detection method, system, device and medium based on cloud WAF
CN109660517B (en) Abnormal behavior detection method, device and equipment
CN103001848B (en) Rubbish mail filtering method and device
WO2019242441A1 (en) Dynamic feature-based malware recognition method and system and related apparatus
US11647046B2 (en) Fuzzy inclusion based impersonation detection
Abhila et al. Spam detection system using supervised ML
CN115964478A (en) Network attack detection method, model training method and device, equipment and medium
CN112073360A (en) Detection method, device, terminal equipment and medium for hypertext transmission data
CN110138723A (en) The determination method and system of malice community in a kind of mail network
Frederic Text Mining applied to SPAM detection
CN116886400A (en) Malicious domain name detection method, system and medium
Gu et al. Online Imbalanced Support Vector Machine for Phishing Emails Filtering
WO2019224907A1 (en) Unauthorized email determination device, unauthorized email determination method and unauthorized email determination program
CN116318781A (en) Phishing mail detection method, device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant