WO2009023583A3 - Domain name statistical classification using character-based n-grams - Google Patents

Domain name statistical classification using character-based n-grams Download PDF

Info

Publication number
WO2009023583A3
WO2009023583A3 PCT/US2008/072668 US2008072668W WO2009023583A3 WO 2009023583 A3 WO2009023583 A3 WO 2009023583A3 US 2008072668 W US2008072668 W US 2008072668W WO 2009023583 A3 WO2009023583 A3 WO 2009023583A3
Authority
WO
WIPO (PCT)
Prior art keywords
character
domain name
grams
classification
domain
Prior art date
Application number
PCT/US2008/072668
Other languages
French (fr)
Other versions
WO2009023583A2 (en
Inventor
Ilia Reznik
Roger N Simonson
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of WO2009023583A2 publication Critical patent/WO2009023583A2/en
Publication of WO2009023583A3 publication Critical patent/WO2009023583A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Discrimination (AREA)

Abstract

Systems and methods of classifying domain names are disclosed herein. Character-based n-grams are derived from a domain name in order to classify such domain name in one or more pre-established categories. In one aspect, a geometrical approach is used. Domain name character-based n-grams are mapped to vector points in a multidimensional space. In addition, vector points for various other domain names, which belong to a domain name classification, can be mapped multidimensional space. The number of dimensions in the multidimensional space is the number of different n-grams that can exist for an n-character combination. The relationship between the domain name vector point and the vector points of the various other domain names is used as an indicator of the classification of the domain name vector point. In another aspect, the classification system can be configured to utilize statistical methods. Relative frequencies of one or more character-based n-grams in various classifications are used as indicators. For example, a dictionary set of character-based n-grams can be derived from one or more domain names. The character-based n-grams in the dictionary set can be associated with probability indicative to the likelihood that the character-based n-gram is found in a domain name of a given classification. Such probability can serve as an estimator of a classification of a new domain name having such character-based n-gram.
PCT/US2008/072668 2007-08-10 2008-08-08 Domain name statistical classification using character-based n-grams WO2009023583A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/837,476 US8005782B2 (en) 2007-08-10 2007-08-10 Domain name statistical classification using character-based N-grams
US11/837,476 2007-08-10

Publications (2)

Publication Number Publication Date
WO2009023583A2 WO2009023583A2 (en) 2009-02-19
WO2009023583A3 true WO2009023583A3 (en) 2009-04-02

Family

ID=40347433

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/072668 WO2009023583A2 (en) 2007-08-10 2008-08-08 Domain name statistical classification using character-based n-grams

Country Status (2)

Country Link
US (1) US8005782B2 (en)
WO (1) WO2009023583A2 (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080162506A1 (en) * 2007-01-03 2008-07-03 Stephenson Janette W Device and method for world wide web organization
US8041662B2 (en) * 2007-08-10 2011-10-18 Microsoft Corporation Domain name geometrical classification using character-based n-grams
US11201848B2 (en) 2011-07-06 2021-12-14 Akamai Technologies, Inc. DNS-based ranking of domain names
US10742591B2 (en) 2011-07-06 2020-08-11 Akamai Technologies Inc. System for domain reputation scoring
US9843601B2 (en) 2011-07-06 2017-12-12 Nominum, Inc. Analyzing DNS requests for anomaly detection
US8768935B2 (en) 2011-10-10 2014-07-01 Verisign, Inc. Bigram suggestions
US9477756B1 (en) * 2012-01-16 2016-10-25 Amazon Technologies, Inc. Classifying structured documents
US9020911B2 (en) 2012-01-18 2015-04-28 International Business Machines Corporation Name search using multiple bitmap distributions
US9218335B2 (en) * 2012-10-10 2015-12-22 Verisign, Inc. Automated language detection for domain names
IN2013MU02217A (en) * 2013-07-01 2015-06-12 Tata Consultancy Services Ltd
US10171415B2 (en) * 2013-10-11 2019-01-01 Verisign, Inc. Characterization of domain names based on changes of authoritative name servers
CN104598452B (en) * 2013-10-30 2018-09-11 秒针信息技术有限公司 User's gender analysis method and apparatus
US10140282B2 (en) * 2014-04-01 2018-11-27 Verisign, Inc. Input string matching for domain names
US9569522B2 (en) 2014-06-04 2017-02-14 International Business Machines Corporation Classifying uniform resource locators
US10467276B2 (en) 2016-01-28 2019-11-05 Ceeq It Corporation Systems and methods for merging electronic data collections
CN106844687B (en) * 2017-01-23 2021-01-01 炫彩互动网络科技有限公司 Method and system for determining gender of user based on game log
KR102608683B1 (en) * 2017-07-16 2023-11-30 쥐에스아이 테크놀로지 인코포레이티드 Natural language processing with knn
US10785188B2 (en) 2018-05-22 2020-09-22 Proofpoint, Inc. Domain name processing systems and methods
US11115338B2 (en) * 2019-12-10 2021-09-07 Hughes Network Systems, Llc Intelligent conversion of internet domain names to vector embeddings
US11727089B2 (en) 2020-09-08 2023-08-15 Nasdaq, Inc. Modular machine learning systems and methods
CN117157712A (en) * 2021-03-30 2023-12-01 Gsi 科技公司 N-tuple based classification with associated processing units

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020035611A1 (en) * 2000-01-14 2002-03-21 Dooley Thomas P. System and method for providing an information network on the internet
US6560596B1 (en) * 1998-08-31 2003-05-06 Multilingual Domains Llc Multiscript database system and method
US20060059337A1 (en) * 2004-09-16 2006-03-16 Nokia Corporation Systems and methods for secured domain name system use based on pre-existing trust
US20060095404A1 (en) * 2004-10-29 2006-05-04 The Go Daddy Group, Inc Presenting search engine results based on domain name related reputation

Family Cites Families (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2501771B2 (en) * 1993-01-19 1996-05-29 インターナショナル・ビジネス・マシーンズ・コーポレイション Method and apparatus for obtaining multiple valid signatures of an unwanted software entity
JPH08221447A (en) 1995-02-10 1996-08-30 Canon Inc Automatic document sorting device
WO1996041281A1 (en) * 1995-06-07 1996-12-19 International Language Engineering Corporation Machine assisted translation tools
US6266664B1 (en) 1997-10-01 2001-07-24 Rulespace, Inc. Method for scanning, analyzing and rating digital information content
JP2000231559A (en) 1999-02-12 2000-08-22 Matsushita Electric Ind Co Ltd Information processor
EP1049030A1 (en) * 1999-04-28 2000-11-02 SER Systeme AG Produkte und Anwendungen der Datenverarbeitung Classification method and apparatus
US6684201B1 (en) * 2000-03-31 2004-01-27 Microsoft Corporation Linguistic disambiguation system and method using string-based pattern training to learn to resolve ambiguity sites
US8396859B2 (en) 2000-06-26 2013-03-12 Oracle International Corporation Subject matter context search engine
US6578032B1 (en) * 2000-06-28 2003-06-10 Microsoft Corporation Method and system for performing phrase/word clustering and cluster merging
KR20020011671A (en) 2000-08-03 2002-02-09 김후길, 이명성, 오경묵 A n-dimension information search method and system thereof
AUPR033800A0 (en) * 2000-09-25 2000-10-19 Telstra R & D Management Pty Ltd A document categorisation system
US6826576B2 (en) 2001-05-07 2004-11-30 Microsoft Corporation Very-large-scale automatic categorizer for web content
US7133860B2 (en) * 2002-01-23 2006-11-07 Matsushita Electric Industrial Co., Ltd. Device and method for automatically classifying documents using vector analysis
US20030233232A1 (en) * 2002-06-12 2003-12-18 Lucent Technologies Inc. System and method for measuring domain independence of semantic classes
US20040162895A1 (en) 2003-02-19 2004-08-19 B2B Booster, Inc. Web site management with electronic storefront and page categorization
US20040167982A1 (en) 2003-02-26 2004-08-26 Cohen Michael A. Multiple registrars
US7260568B2 (en) * 2004-04-15 2007-08-21 Microsoft Corporation Verifying relevance between keywords and web site contents
JP4713870B2 (en) * 2004-10-13 2011-06-29 ヒューレット−パッカード デベロップメント カンパニー エル.ピー. Document classification apparatus, method, and program
US20070094500A1 (en) 2005-10-20 2007-04-26 Marvin Shannon System and Method for Investigating Phishing Web Sites
US8683031B2 (en) 2004-10-29 2014-03-25 Trustwave Holdings, Inc. Methods and systems for scanning and monitoring content on a network
US20060149710A1 (en) * 2004-12-30 2006-07-06 Ross Koningstein Associating features with entities, such as categories of web page documents, and/or weighting such features
US7707203B2 (en) * 2005-03-11 2010-04-27 Yahoo! Inc. Job seeking system and method for managing job listings
US20060212142A1 (en) * 2005-03-16 2006-09-21 Omid Madani System and method for providing interactive feature selection for training a document classification system
US7519588B2 (en) * 2005-06-20 2009-04-14 Efficient Frontier Keyword characterization and application
US7971147B2 (en) 2005-07-25 2011-06-28 Billeo, Inc. Methods and systems for automatically creating a site menu
US8041662B2 (en) 2007-08-10 2011-10-18 Microsoft Corporation Domain name geometrical classification using character-based n-grams

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6560596B1 (en) * 1998-08-31 2003-05-06 Multilingual Domains Llc Multiscript database system and method
US20020035611A1 (en) * 2000-01-14 2002-03-21 Dooley Thomas P. System and method for providing an information network on the internet
US20060059337A1 (en) * 2004-09-16 2006-03-16 Nokia Corporation Systems and methods for secured domain name system use based on pre-existing trust
US20060095404A1 (en) * 2004-10-29 2006-05-04 The Go Daddy Group, Inc Presenting search engine results based on domain name related reputation

Also Published As

Publication number Publication date
WO2009023583A2 (en) 2009-02-19
US20090043720A1 (en) 2009-02-12
US8005782B2 (en) 2011-08-23

Similar Documents

Publication Publication Date Title
WO2009023583A3 (en) Domain name statistical classification using character-based n-grams
Chowdhury et al. Tweet4act: Using incident-specific profiles for classifying crisis-related messages.
WO2007143223A3 (en) System and method for entity based information categorization
WO2009029905A3 (en) Identification of semantic relationships within reported speech
WO2012177794A3 (en) Identifying information related to a particular entity from electronic sources, using dimensional reduction and quantum clustering
CN103176962B (en) The statistical method of text similarity and system
WO2007106403A3 (en) Methods and systems to generate rules to identify data items
WO2014008247A3 (en) Systems and methods for detecting tax refund fraud
WO2007149341A3 (en) System to associate a demographic to a user of an electronic system
WO2008144964A8 (en) Detecting name entities and new words
EP1703416A3 (en) Method and Computer-Readable Medium for Providing Spreadsheet-Driven Key Performance Indicators
WO2012068238A3 (en) Shipping system and method with taxonomic tariff harmonization
WO2012148950A3 (en) Representing information from documents
EP1840766A3 (en) Systems and methods for a distributed in-memory database and distributed cache
WO2012050887A3 (en) Systems and methods for navigating electronic texts
WO2010096193A3 (en) Identifying a document by performing spectral analysis on the contents of the document
WO2012129149A3 (en) Aggregating search results based on associating data instances with knowledge base entities
WO2013112062A8 (en) Systems and methods for spam detection using character histograms
WO2008046063A3 (en) Methods and apparatuses for searching and categorizing messages within a network system
CA2879417A1 (en) Structured search queries based on social-graph information
CN102624703A (en) Method and device for filtering uniform resource locators (URLs)
WO2009036392A3 (en) Multi-modal relevancy matching
EP2234048A3 (en) Suggesting potential custodians for cases in an enterprise-wide electronic discovery system
Pitel et al. Count-min-log sketch: Approximately counting with approximate counters
RU2013156261A (en) METHOD OF CONSTRUCTION AND DETECTION OF THE THEMATIC STRUCTURE OF THE HOUSING

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08797522

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08797522

Country of ref document: EP

Kind code of ref document: A2