WO2009023583A3 - Domain name statistical classification using character-based n-grams - Google Patents
Domain name statistical classification using character-based n-grams Download PDFInfo
- Publication number
- WO2009023583A3 WO2009023583A3 PCT/US2008/072668 US2008072668W WO2009023583A3 WO 2009023583 A3 WO2009023583 A3 WO 2009023583A3 US 2008072668 W US2008072668 W US 2008072668W WO 2009023583 A3 WO2009023583 A3 WO 2009023583A3
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- character
- domain name
- grams
- classification
- domain
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Character Discrimination (AREA)
Abstract
Systems and methods of classifying domain names are disclosed herein. Character-based n-grams are derived from a domain name in order to classify such domain name in one or more pre-established categories. In one aspect, a geometrical approach is used. Domain name character-based n-grams are mapped to vector points in a multidimensional space. In addition, vector points for various other domain names, which belong to a domain name classification, can be mapped multidimensional space. The number of dimensions in the multidimensional space is the number of different n-grams that can exist for an n-character combination. The relationship between the domain name vector point and the vector points of the various other domain names is used as an indicator of the classification of the domain name vector point. In another aspect, the classification system can be configured to utilize statistical methods. Relative frequencies of one or more character-based n-grams in various classifications are used as indicators. For example, a dictionary set of character-based n-grams can be derived from one or more domain names. The character-based n-grams in the dictionary set can be associated with probability indicative to the likelihood that the character-based n-gram is found in a domain name of a given classification. Such probability can serve as an estimator of a classification of a new domain name having such character-based n-gram.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/837,476 US8005782B2 (en) | 2007-08-10 | 2007-08-10 | Domain name statistical classification using character-based N-grams |
US11/837,476 | 2007-08-10 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2009023583A2 WO2009023583A2 (en) | 2009-02-19 |
WO2009023583A3 true WO2009023583A3 (en) | 2009-04-02 |
Family
ID=40347433
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2008/072668 WO2009023583A2 (en) | 2007-08-10 | 2008-08-08 | Domain name statistical classification using character-based n-grams |
Country Status (2)
Country | Link |
---|---|
US (1) | US8005782B2 (en) |
WO (1) | WO2009023583A2 (en) |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080162506A1 (en) * | 2007-01-03 | 2008-07-03 | Stephenson Janette W | Device and method for world wide web organization |
US8041662B2 (en) * | 2007-08-10 | 2011-10-18 | Microsoft Corporation | Domain name geometrical classification using character-based n-grams |
US11201848B2 (en) | 2011-07-06 | 2021-12-14 | Akamai Technologies, Inc. | DNS-based ranking of domain names |
US10742591B2 (en) | 2011-07-06 | 2020-08-11 | Akamai Technologies Inc. | System for domain reputation scoring |
US9843601B2 (en) | 2011-07-06 | 2017-12-12 | Nominum, Inc. | Analyzing DNS requests for anomaly detection |
US8768935B2 (en) | 2011-10-10 | 2014-07-01 | Verisign, Inc. | Bigram suggestions |
US9477756B1 (en) * | 2012-01-16 | 2016-10-25 | Amazon Technologies, Inc. | Classifying structured documents |
US9020911B2 (en) | 2012-01-18 | 2015-04-28 | International Business Machines Corporation | Name search using multiple bitmap distributions |
US9218335B2 (en) * | 2012-10-10 | 2015-12-22 | Verisign, Inc. | Automated language detection for domain names |
IN2013MU02217A (en) * | 2013-07-01 | 2015-06-12 | Tata Consultancy Services Ltd | |
US10171415B2 (en) * | 2013-10-11 | 2019-01-01 | Verisign, Inc. | Characterization of domain names based on changes of authoritative name servers |
CN104598452B (en) * | 2013-10-30 | 2018-09-11 | 秒针信息技术有限公司 | User's gender analysis method and apparatus |
US10140282B2 (en) * | 2014-04-01 | 2018-11-27 | Verisign, Inc. | Input string matching for domain names |
US9569522B2 (en) | 2014-06-04 | 2017-02-14 | International Business Machines Corporation | Classifying uniform resource locators |
US10467276B2 (en) | 2016-01-28 | 2019-11-05 | Ceeq It Corporation | Systems and methods for merging electronic data collections |
CN106844687B (en) * | 2017-01-23 | 2021-01-01 | 炫彩互动网络科技有限公司 | Method and system for determining gender of user based on game log |
KR102608683B1 (en) * | 2017-07-16 | 2023-11-30 | 쥐에스아이 테크놀로지 인코포레이티드 | Natural language processing with knn |
US10785188B2 (en) | 2018-05-22 | 2020-09-22 | Proofpoint, Inc. | Domain name processing systems and methods |
US11115338B2 (en) * | 2019-12-10 | 2021-09-07 | Hughes Network Systems, Llc | Intelligent conversion of internet domain names to vector embeddings |
US11727089B2 (en) | 2020-09-08 | 2023-08-15 | Nasdaq, Inc. | Modular machine learning systems and methods |
CN117157712A (en) * | 2021-03-30 | 2023-12-01 | Gsi 科技公司 | N-tuple based classification with associated processing units |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020035611A1 (en) * | 2000-01-14 | 2002-03-21 | Dooley Thomas P. | System and method for providing an information network on the internet |
US6560596B1 (en) * | 1998-08-31 | 2003-05-06 | Multilingual Domains Llc | Multiscript database system and method |
US20060059337A1 (en) * | 2004-09-16 | 2006-03-16 | Nokia Corporation | Systems and methods for secured domain name system use based on pre-existing trust |
US20060095404A1 (en) * | 2004-10-29 | 2006-05-04 | The Go Daddy Group, Inc | Presenting search engine results based on domain name related reputation |
Family Cites Families (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2501771B2 (en) * | 1993-01-19 | 1996-05-29 | インターナショナル・ビジネス・マシーンズ・コーポレイション | Method and apparatus for obtaining multiple valid signatures of an unwanted software entity |
JPH08221447A (en) | 1995-02-10 | 1996-08-30 | Canon Inc | Automatic document sorting device |
WO1996041281A1 (en) * | 1995-06-07 | 1996-12-19 | International Language Engineering Corporation | Machine assisted translation tools |
US6266664B1 (en) | 1997-10-01 | 2001-07-24 | Rulespace, Inc. | Method for scanning, analyzing and rating digital information content |
JP2000231559A (en) | 1999-02-12 | 2000-08-22 | Matsushita Electric Ind Co Ltd | Information processor |
EP1049030A1 (en) * | 1999-04-28 | 2000-11-02 | SER Systeme AG Produkte und Anwendungen der Datenverarbeitung | Classification method and apparatus |
US6684201B1 (en) * | 2000-03-31 | 2004-01-27 | Microsoft Corporation | Linguistic disambiguation system and method using string-based pattern training to learn to resolve ambiguity sites |
US8396859B2 (en) | 2000-06-26 | 2013-03-12 | Oracle International Corporation | Subject matter context search engine |
US6578032B1 (en) * | 2000-06-28 | 2003-06-10 | Microsoft Corporation | Method and system for performing phrase/word clustering and cluster merging |
KR20020011671A (en) | 2000-08-03 | 2002-02-09 | 김후길, 이명성, 오경묵 | A n-dimension information search method and system thereof |
AUPR033800A0 (en) * | 2000-09-25 | 2000-10-19 | Telstra R & D Management Pty Ltd | A document categorisation system |
US6826576B2 (en) | 2001-05-07 | 2004-11-30 | Microsoft Corporation | Very-large-scale automatic categorizer for web content |
US7133860B2 (en) * | 2002-01-23 | 2006-11-07 | Matsushita Electric Industrial Co., Ltd. | Device and method for automatically classifying documents using vector analysis |
US20030233232A1 (en) * | 2002-06-12 | 2003-12-18 | Lucent Technologies Inc. | System and method for measuring domain independence of semantic classes |
US20040162895A1 (en) | 2003-02-19 | 2004-08-19 | B2B Booster, Inc. | Web site management with electronic storefront and page categorization |
US20040167982A1 (en) | 2003-02-26 | 2004-08-26 | Cohen Michael A. | Multiple registrars |
US7260568B2 (en) * | 2004-04-15 | 2007-08-21 | Microsoft Corporation | Verifying relevance between keywords and web site contents |
JP4713870B2 (en) * | 2004-10-13 | 2011-06-29 | ヒューレット−パッカード デベロップメント カンパニー エル.ピー. | Document classification apparatus, method, and program |
US20070094500A1 (en) | 2005-10-20 | 2007-04-26 | Marvin Shannon | System and Method for Investigating Phishing Web Sites |
US8683031B2 (en) | 2004-10-29 | 2014-03-25 | Trustwave Holdings, Inc. | Methods and systems for scanning and monitoring content on a network |
US20060149710A1 (en) * | 2004-12-30 | 2006-07-06 | Ross Koningstein | Associating features with entities, such as categories of web page documents, and/or weighting such features |
US7707203B2 (en) * | 2005-03-11 | 2010-04-27 | Yahoo! Inc. | Job seeking system and method for managing job listings |
US20060212142A1 (en) * | 2005-03-16 | 2006-09-21 | Omid Madani | System and method for providing interactive feature selection for training a document classification system |
US7519588B2 (en) * | 2005-06-20 | 2009-04-14 | Efficient Frontier | Keyword characterization and application |
US7971147B2 (en) | 2005-07-25 | 2011-06-28 | Billeo, Inc. | Methods and systems for automatically creating a site menu |
US8041662B2 (en) | 2007-08-10 | 2011-10-18 | Microsoft Corporation | Domain name geometrical classification using character-based n-grams |
-
2007
- 2007-08-10 US US11/837,476 patent/US8005782B2/en not_active Expired - Fee Related
-
2008
- 2008-08-08 WO PCT/US2008/072668 patent/WO2009023583A2/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6560596B1 (en) * | 1998-08-31 | 2003-05-06 | Multilingual Domains Llc | Multiscript database system and method |
US20020035611A1 (en) * | 2000-01-14 | 2002-03-21 | Dooley Thomas P. | System and method for providing an information network on the internet |
US20060059337A1 (en) * | 2004-09-16 | 2006-03-16 | Nokia Corporation | Systems and methods for secured domain name system use based on pre-existing trust |
US20060095404A1 (en) * | 2004-10-29 | 2006-05-04 | The Go Daddy Group, Inc | Presenting search engine results based on domain name related reputation |
Also Published As
Publication number | Publication date |
---|---|
WO2009023583A2 (en) | 2009-02-19 |
US20090043720A1 (en) | 2009-02-12 |
US8005782B2 (en) | 2011-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2009023583A3 (en) | Domain name statistical classification using character-based n-grams | |
Chowdhury et al. | Tweet4act: Using incident-specific profiles for classifying crisis-related messages. | |
WO2007143223A3 (en) | System and method for entity based information categorization | |
WO2009029905A3 (en) | Identification of semantic relationships within reported speech | |
WO2012177794A3 (en) | Identifying information related to a particular entity from electronic sources, using dimensional reduction and quantum clustering | |
CN103176962B (en) | The statistical method of text similarity and system | |
WO2007106403A3 (en) | Methods and systems to generate rules to identify data items | |
WO2014008247A3 (en) | Systems and methods for detecting tax refund fraud | |
WO2007149341A3 (en) | System to associate a demographic to a user of an electronic system | |
WO2008144964A8 (en) | Detecting name entities and new words | |
EP1703416A3 (en) | Method and Computer-Readable Medium for Providing Spreadsheet-Driven Key Performance Indicators | |
WO2012068238A3 (en) | Shipping system and method with taxonomic tariff harmonization | |
WO2012148950A3 (en) | Representing information from documents | |
EP1840766A3 (en) | Systems and methods for a distributed in-memory database and distributed cache | |
WO2012050887A3 (en) | Systems and methods for navigating electronic texts | |
WO2010096193A3 (en) | Identifying a document by performing spectral analysis on the contents of the document | |
WO2012129149A3 (en) | Aggregating search results based on associating data instances with knowledge base entities | |
WO2013112062A8 (en) | Systems and methods for spam detection using character histograms | |
WO2008046063A3 (en) | Methods and apparatuses for searching and categorizing messages within a network system | |
CA2879417A1 (en) | Structured search queries based on social-graph information | |
CN102624703A (en) | Method and device for filtering uniform resource locators (URLs) | |
WO2009036392A3 (en) | Multi-modal relevancy matching | |
EP2234048A3 (en) | Suggesting potential custodians for cases in an enterprise-wide electronic discovery system | |
Pitel et al. | Count-min-log sketch: Approximately counting with approximate counters | |
RU2013156261A (en) | METHOD OF CONSTRUCTION AND DETECTION OF THE THEMATIC STRUCTURE OF THE HOUSING |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08797522 Country of ref document: EP Kind code of ref document: A2 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 08797522 Country of ref document: EP Kind code of ref document: A2 |