CA2593378A1 - Local item extraction - Google Patents

Local item extraction Download PDF

Info

Publication number
CA2593378A1
CA2593378A1 CA002593378A CA2593378A CA2593378A1 CA 2593378 A1 CA2593378 A1 CA 2593378A1 CA 002593378 A CA002593378 A CA 002593378A CA 2593378 A CA2593378 A CA 2593378A CA 2593378 A1 CA2593378 A1 CA 2593378A1
Authority
CA
Canada
Prior art keywords
address
terms
title
business information
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CA002593378A
Other languages
French (fr)
Other versions
CA2593378C (en
Inventor
Michael Dennis Riley
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google Inc.
Michael Dennis Riley
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google Inc., Michael Dennis Riley filed Critical Google Inc.
Publication of CA2593378A1 publication Critical patent/CA2593378A1/en
Application granted granted Critical
Publication of CA2593378C publication Critical patent/CA2593378C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Abstract

A system identifies a document that includes an address and locates business information in the document. The system assigns a confidence score to the business information, where the confidence score relates to a probability that the business information is associated with the address. The system determines whether to associate the business information with the address based on the assigned confidence score.

Claims (42)

WHAT IS CLAIMED IS:
1. A method, comprising:
identifying a document that includes an address;
locating business information in the document;
assigning a confidence score to the business information, the confidence score relating to a probability that the business information is associated with the address; and determining whether to associate the business information with the address based on the assigned confidence score.
2. The method of claim 1, wherein the business information corresponds to a title; and wherein locating business information in the document includes:
analyzing a plurality of terms that precede the address in the document, determining a probability that each of the terms is part of a title associated with the address, and identifying a candidate title based on one or more of the terms that have a high probability of being part of a title associated with the address.
3. The method of claim 2, wherein the plurality of terms include a first term that immediately precedes the address in the document and one or more second terms that precede the first term in the document.
4. The method of claim 2, wherein determining a probability that each of the terms is included in a title includes:
predicting whether one of the terms is part of the title, and predicting whether another one of the terms is part of the title based on the prediction regarding the one of the terms.
5. The method of claim 2, wherein the probability that one of the terms is included in a title is based on a window of terms around the term.
6. The method of claim 2, wherein the probability that one of the terms is included in a title is based on the probability associated with another one of the terms.
7. The method of claim 2, wherein the probability that one of the terms is included in a title is based on a set of features associated with the term.
8. The method of claim 7, wherein the set of features includes at least one of a distance of the term from the address, characteristics of the term, boundary information between the term and a preceding or following term, or punctuation information between the term and a preceding or following term.
9. The method of claim 2, wherein the probability that one of the terms is included in a title is determined from a statistical model generated by analyzing features associated with a plurality of documents with known addresses and associated titles.
10. The method of claim 1, wherein the business information corresponds to a title; and wherein locating business information in the document includes:
analyzing a plurality of terms that precede the address in the document, determining a probability that each of the terms is part of a title associated with the address, and identifying a plurality of candidate titles based on one or more groups of the terms that have a high probability of being part of a title associated with the address.
H. The method of claim 1, wherein the business information corresponds to a telephone number; and wherein locating business information in the document includes:
identifying a set of candidate telephone numbers in the document, and determining a probability that each candidate telephone number in the set of candidate telephone numbers is associated with the address.
12. The method of claim 11, wherein the probability that one of the candidate telephone numbers is associated with the address is based on a set of features associated with the candidate telephone number.
13. The method of claim 12, wherein the set of features includes at least one of a distance of the candidate telephone number from the address, boundary information between the candidate telephone number and the address, whether a common telephone number term appears before the candidate telephone number, whether a common facsimile number term appears before the candidate telephone number, or whether another candidate telephone number exists between the candidate telephone number and the address.
14. The method of claim 12, wherein the probability that one of the candidate telephone numbers is associated with the address is determined from a statistical model generated by analyzing features associated with a plurality of documents with known addresses and associated telephone numbers.
15. The method of claim 1, wherein the probability that the business information is associated with the address is determined from a statistical model generated by analyzing features associated with a plurality of documents with known addresses and associated business information.
16. The method of claim 1, wherein the business information includes at least one of a title, a telephone number, business hours, or a link to a web site or map associated with the address.
17. The method of claim 1, wherein determining whether to associate the business information with the address includes:
analyzing strings of terms in the document, and determining one of the strings that maximizes a probability that the terms of the string include the business information.
18. The method of claim 1, further comprising:
creating or supplementing a business listing based on the business information and the address when the business information is associated with the address.
19. A system, comprising:
means for identifying a document that includes an address;
means for locating one or more business information candidates in the document;
means for assigning a confidence score to each of the one or more business information candidates, the confidence score associated with one of the business information candidates relating to a probability that the business information candidate is associated with the address; and means for determining whether to associate one of the one or more business information candidates with the address based on the assigned confidence score.
20. A system, comprising:
a memory to store a statistical model; and a processor, connected to the memory, to:
identify a document that includes an address, identify business information in the document, predict whether the business information is associated with the address based on the statistical model, and determine whether to associate the business information with the address based on the prediction.
21. The system of claim 20, wherein the business information corresponds to a title; and wherein when identifying business information in the document, the processor is configured to:
analyze a plurality of terms that precede the address in the document, determine a probability that each of the terms is part of a title associated with the address based on the statistical model, and identify a candidate title based on one or more of the terms that have a high probability of being part of a title associated with the address.
22. The system of claim 21, wherein the plurality of terms includes a first term that immediately precedes the address in the document and one or more second terms that precede the first term in the document.
23. The system of claim 21, wherein when determining a probability that each of the terms is included in a title, the processor is configured to:
predict whether one of the terms is part of the title, and predict whether another one of the terms is part of the title based on the prediction regarding the one of the terms.
24. The system of claim 21, wherein the probability that one of the terms is included in a title is based on a window of terms around the term.
25. The system of claim 21, wherein the probability that one of the terms is included in a title is based on the probability associated with another one of the terms.
26. The system of claim 21, wherein the probability that one of the terms is included in a title is based on a set of features associated with the term.
27. The system of claim 26, wherein the set of features includes at least one of a distance of the term from the address, characteristics of the term, boundary information between the term and a preceding or following term, or punctuation information between the term and a preceding or following term.
28. The system of claim 20, wherein the statistical model is generated by analyzing features associated with a plurality of documents with known addresses and associated titles.
29. The system of claim 20, wherein the business information corresponds to a title; and wherein when identifying business information in the document, the processor is configured to:
analyze a plurality of terms that precedes the address in the document, determine a probability that each of the terms is part of a title associated with the address, and identify a plurality of candidate titles based on one or more groups of the terms that have a high probability of being part of a title associated with the address.
30. The system of claim 20, wherein the business information corresponds to a telephone number; and wherein when identifying business information in the document, the processor is configured to:
identify a set of candidate telephone numbers in the document, and determine a probability that each candidate telephone number in the set of candidate telephone numbers is associated with the address based on the statistical model.
31. The system of claim 30, wherein the probability that one of the candidate telephone numbers is associated with the address is based on a set of features associated with the candidate telephone number.
32. The system of claim 31, wherein the set of features includes at least one of a distance of the candidate telephone number from the address, boundary information between the candidate telephone number and the address, whether a common telephone number term appears before the candidate telephone number, whether a common facsimile number term appears before the candidate telephone number, or whether another candidate telephone number exists between the candidate telephone number and the address.
33. The system of claim 31 wherein the statistical model is generated by analyzing features associated with a plurality of documents with known addresses and associated telephone numbers.
34. The system of claim 20, wherein the statistical model is generated by analyzing features associated with a plurality of documents with known addresses and associated business information.
35. The system of claim 20, wherein the business information includes at least one of a title, a telephone number, business hours, or a link to a web site or map associated with the address.
36. The system of claim 20, wherein when determining whether to associate the business information with the address, the processor is configured to:
analyze strings of terms in the document, and determine one of the strings that maximizes a probability that the terms of the string include the business information.
37. The system of claim 20, wherein the processor is further configured to create or supplement a business listing based on the business information and the address when the business information is associated with the address.
38. A method, comprising:
identifying a document that includes an address;
identifying a plurality of terms that precede the address in the document;
determining a probability that each of the terms is part of a title associated with the address;
identifying a candidate title based on one or more of the terms that have a high probability of being part of a title associated with the address;
assigning a confidence score to the candidate title; and determining whether to associate the candidate title with the address based on the assigned confidence score.
39. A method, comprising:
identifying a document that includes an address;
identifying a set of candidate telephone numbers in the document;

determining a probability that each candidate telephone number in the set of candidate telephone numbers is associated with the address; and determining whether to associate one of the candidate telephone numbers with the address based on the determined probability.
40. A method, comprising:
identifying a web page that includes a landmark;
identifying an attribute in the web page;
assigning a confidence score to the attribute, the confidence score relating to a probability that the attribute is associated with the landmark; and determining whether to associate the attribute with the landmark based on the assigned confidence score.
41. The method of claim 40, wherein the landmark corresponds to a postal address and the attribute corresponds to information relating to one of a title, a telephone number, business hours, or a link to a web site or map associated with the postal address.
42. The method of claim 40, wherein the landmark corresponds to a product and the attribute corresponds to one of a price or a product identification number.
CA2593378A 2004-12-30 2005-12-30 Local item extraction Expired - Fee Related CA2593378C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US11/024,765 2004-12-30
US11/024,765 US7831438B2 (en) 2004-12-30 2004-12-30 Local item extraction
PCT/US2005/047391 WO2006074052A1 (en) 2004-12-30 2005-12-30 Local item extraction

Publications (2)

Publication Number Publication Date
CA2593378A1 true CA2593378A1 (en) 2006-07-13
CA2593378C CA2593378C (en) 2012-06-05

Family

ID=36218348

Family Applications (1)

Application Number Title Priority Date Filing Date
CA2593378A Expired - Fee Related CA2593378C (en) 2004-12-30 2005-12-30 Local item extraction

Country Status (8)

Country Link
US (2) US7831438B2 (en)
EP (2) EP1839211A1 (en)
JP (2) JP2008527502A (en)
KR (1) KR100974905B1 (en)
CN (1) CN101128819B (en)
AU (1) AU2005322850C1 (en)
CA (1) CA2593378C (en)
WO (1) WO2006074052A1 (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7831438B2 (en) * 2004-12-30 2010-11-09 Google Inc. Local item extraction
US8731954B2 (en) 2006-03-27 2014-05-20 A-Life Medical, Llc Auditing the coding and abstracting of documents
US8682823B2 (en) 2007-04-13 2014-03-25 A-Life Medical, Llc Multi-magnitudinal vectors with resolution based on source vector features
US7908552B2 (en) 2007-04-13 2011-03-15 A-Life Medical Inc. Mere-parsing with boundary and semantic driven scoping
WO2008129339A1 (en) * 2007-04-18 2008-10-30 Mitsco - Seekport Fz-Llc Method for location identification in web pages and location-based ranking of internet search results
US9946846B2 (en) 2007-08-03 2018-04-17 A-Life Medical, Llc Visualizing the documentation and coding of surgical procedures
US20090182759A1 (en) * 2008-01-11 2009-07-16 Yahoo! Inc. Extracting entities from a web page
US8812362B2 (en) * 2009-02-20 2014-08-19 Yahoo! Inc. Method and system for quantifying user interactions with web advertisements
US8468144B2 (en) * 2010-03-19 2013-06-18 Honeywell International Inc. Methods and apparatus for analyzing information to identify entities of significance
US10541053B2 (en) 2013-09-05 2020-01-21 Optum360, LLCq Automated clinical indicator recognition with natural language processing
US10133727B2 (en) 2013-10-01 2018-11-20 A-Life Medical, Llc Ontologically driven procedure coding
US9317873B2 (en) 2014-03-28 2016-04-19 Google Inc. Automatic verification of advertiser identifier in advertisements
US20150287099A1 (en) 2014-04-07 2015-10-08 Google Inc. Method to compute the prominence score to phone numbers on web pages and automatically annotate/attach it to ads
US11115529B2 (en) 2014-04-07 2021-09-07 Google Llc System and method for providing and managing third party content with call functionality
US10469424B2 (en) 2016-10-07 2019-11-05 Google Llc Network based data traffic latency reduction
CN109933785B (en) * 2019-02-03 2023-06-20 北京百度网讯科技有限公司 Method, apparatus, device and medium for entity association

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6701307B2 (en) 1998-10-28 2004-03-02 Microsoft Corporation Method and apparatus of expanding web searching capabilities
US6374241B1 (en) * 1999-03-31 2002-04-16 Verizon Laboratories Inc. Data merging techniques
EP1269357A4 (en) 2000-02-22 2005-10-12 Metacarta Inc Spatially coding and displaying information
US20020156779A1 (en) 2001-09-28 2002-10-24 Elliott Margaret E. Internet search engine
US6965900B2 (en) 2001-12-19 2005-11-15 X-Labs Holdings, Llc Method and apparatus for electronically extracting application specific multidimensional information from documents selected from a set of documents electronically extracted from a library of electronically searchable documents
JP4005477B2 (en) 2002-05-15 2007-11-07 日本電信電話株式会社 Named entity extraction apparatus and method, and numbered entity extraction program
EP1540524A2 (en) 2002-08-05 2005-06-15 Metacarta, Inc. Desktop client interaction with a geographic text search system
CA2519236A1 (en) 2003-03-18 2004-09-30 Metacarta, Inc. Corpus clustering, confidence refinement, and ranking for geographic text search and information retrieval
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
US8346770B2 (en) * 2003-09-22 2013-01-01 Google Inc. Systems and methods for clustering search results
US7349901B2 (en) * 2004-05-21 2008-03-25 Microsoft Corporation Search engine spam detection using external data
US7831438B2 (en) * 2004-12-30 2010-11-09 Google Inc. Local item extraction
CA2702439C (en) * 2006-12-20 2017-01-31 Victor David Uy Method and apparatus for scoring electronic documents
US7877385B2 (en) * 2007-09-21 2011-01-25 Microsoft Corporation Information retrieval using query-document pair information
WO2010141799A2 (en) * 2009-06-05 2010-12-09 West Services Inc. Feature engineering and user behavior analysis

Also Published As

Publication number Publication date
US7831438B2 (en) 2010-11-09
AU2005322850A1 (en) 2006-07-13
US20110047151A1 (en) 2011-02-24
EP2372584A1 (en) 2011-10-05
CA2593378C (en) 2012-06-05
CN101128819B (en) 2011-06-22
US8433704B2 (en) 2013-04-30
AU2005322850B2 (en) 2010-02-11
KR100974905B1 (en) 2010-08-09
JP2011129154A (en) 2011-06-30
AU2005322850C1 (en) 2010-07-15
US20060149565A1 (en) 2006-07-06
CN101128819A (en) 2008-02-20
JP5226095B2 (en) 2013-07-03
EP1839211A1 (en) 2007-10-03
WO2006074052A1 (en) 2006-07-13
KR20070092755A (en) 2007-09-13
JP2008527502A (en) 2008-07-24

Similar Documents

Publication Publication Date Title
CA2593378A1 (en) Local item extraction
US7251644B2 (en) Processing an electronic document for information extraction
KR101219366B1 (en) Classification of ambiguous geographic references
JP3639126B2 (en) Address recognition device and address recognition method
US20080109403A1 (en) Method of describing the structure of graphical objects.
JP2009524852A5 (en)
CN103544186B (en) The method and apparatus excavating the subject key words in picture
JPH08171614A (en) Character string reader
CN110489578A (en) Image processing method, device and computer equipment
CN104102639A (en) Text classification based promotion triggering method and device
CN106407450A (en) File searching method and apparatus
CN112214737B (en) Method, system, device and medium for identifying picture-based fraudulent webpage
CN115438340A (en) Mining behavior identification method and system based on morpheme characteristics
CN107908724A (en) A kind of data model matching process, device, equipment and storage medium
CN111723296A (en) Search processing method and device and computer equipment
CN105677827A (en) Method and device for obtaining form
US20090252415A1 (en) Method for retrieving text blocks in documents
KR20120019706A (en) System for recognizing adress of mailings
KR100774547B1 (en) Method and system for providing search information useing search-result caching
JP2009146158A (en) Excessive structure elimination method of document classification device
KR20230092048A (en) System and method for collecting business information and computer program for the same
JP2000029877A (en) Method and device for analyzing document structure and storage medium storing document structure analyzing program
JPH07116606A (en) Device and method for recognizing mail address
JPH02151984A (en) Image recognizing system
JP3800165B2 (en) Reading estimation system, reading estimation method, and reading estimation program

Legal Events

Date Code Title Description
EEER Examination request
MKLA Lapsed

Effective date: 20161230