CA2593378A1

CA2593378A1 - Local item extraction

Info

Publication number: CA2593378A1
Application number: CA002593378A
Authority: CA
Inventors: Michael Dennis Riley
Original assignee: Google Inc.; Michael Dennis Riley
Current assignee: Google LLC
Priority date: 2004-12-30
Filing date: 2005-12-30
Publication date: 2006-07-13
Anticipated expiration: 2025-12-30
Also published as: US7831438B2; AU2005322850A1; US20110047151A1; EP2372584A1; CA2593378C; CN101128819B; US8433704B2; AU2005322850B2; KR100974905B1; JP2011129154A; AU2005322850C1; US20060149565A1; CN101128819A; JP5226095B2; EP1839211A1; WO2006074052A1; KR20070092755A; JP2008527502A

Abstract

A system identifies a document that includes an address and locates business information in the document. The system assigns a confidence score to the business information, where the confidence score relates to a probability that the business information is associated with the address. The system determines whether to associate the business information with the address based on the assigned confidence score.

Claims

WHAT IS CLAIMED IS:

1. A method, comprising:
identifying a document that includes an address;
locating business information in the document;
assigning a confidence score to the business information, the confidence score relating to a probability that the business information is associated with the address; and determining whether to associate the business information with the address based on the assigned confidence score.

2. The method of claim 1, wherein the business information corresponds to a title; and wherein locating business information in the document includes:
analyzing a plurality of terms that precede the address in the document, determining a probability that each of the terms is part of a title associated with the address, and identifying a candidate title based on one or more of the terms that have a high probability of being part of a title associated with the address.

3. The method of claim 2, wherein the plurality of terms include a first term that immediately precedes the address in the document and one or more second terms that precede the first term in the document.

4. The method of claim 2, wherein determining a probability that each of the terms is included in a title includes:
predicting whether one of the terms is part of the title, and predicting whether another one of the terms is part of the title based on the prediction regarding the one of the terms.

5. The method of claim 2, wherein the probability that one of the terms is included in a title is based on a window of terms around the term.

6. The method of claim 2, wherein the probability that one of the terms is included in a title is based on the probability associated with another one of the terms.

7. The method of claim 2, wherein the probability that one of the terms is included in a title is based on a set of features associated with the term.

8. The method of claim 7, wherein the set of features includes at least one of a distance of the term from the address, characteristics of the term, boundary information between the term and a preceding or following term, or punctuation information between the term and a preceding or following term.

9. The method of claim 2, wherein the probability that one of the terms is included in a title is determined from a statistical model generated by analyzing features associated with a plurality of documents with known addresses and associated titles.

10. The method of claim 1, wherein the business information corresponds to a title; and wherein locating business information in the document includes:
analyzing a plurality of terms that precede the address in the document, determining a probability that each of the terms is part of a title associated with the address, and identifying a plurality of candidate titles based on one or more groups of the terms that have a high probability of being part of a title associated with the address.

H. The method of claim 1, wherein the business information corresponds to a telephone number; and wherein locating business information in the document includes:
identifying a set of candidate telephone numbers in the document, and determining a probability that each candidate telephone number in the set of candidate telephone numbers is associated with the address.

12. The method of claim 11, wherein the probability that one of the candidate telephone numbers is associated with the address is based on a set of features associated with the candidate telephone number.

13. The method of claim 12, wherein the set of features includes at least one of a distance of the candidate telephone number from the address, boundary information between the candidate telephone number and the address, whether a common telephone number term appears before the candidate telephone number, whether a common facsimile number term appears before the candidate telephone number, or whether another candidate telephone number exists between the candidate telephone number and the address.

14. The method of claim 12, wherein the probability that one of the candidate telephone numbers is associated with the address is determined from a statistical model generated by analyzing features associated with a plurality of documents with known addresses and associated telephone numbers.

15. The method of claim 1, wherein the probability that the business information is associated with the address is determined from a statistical model generated by analyzing features associated with a plurality of documents with known addresses and associated business information.

16. The method of claim 1, wherein the business information includes at least one of a title, a telephone number, business hours, or a link to a web site or map associated with the address.

17. The method of claim 1, wherein determining whether to associate the business information with the address includes:
analyzing strings of terms in the document, and determining one of the strings that maximizes a probability that the terms of the string include the business information.

18. The method of claim 1, further comprising:
creating or supplementing a business listing based on the business information and the address when the business information is associated with the address.

19. A system, comprising:
means for identifying a document that includes an address;
means for locating one or more business information candidates in the document;
means for assigning a confidence score to each of the one or more business information candidates, the confidence score associated with one of the business information candidates relating to a probability that the business information candidate is associated with the address; and means for determining whether to associate one of the one or more business information candidates with the address based on the assigned confidence score.

20. A system, comprising:
a memory to store a statistical model; and a processor, connected to the memory, to:
identify a document that includes an address, identify business information in the document, predict whether the business information is associated with the address based on the statistical model, and determine whether to associate the business information with the address based on the prediction.

21. The system of claim 20, wherein the business information corresponds to a title; and wherein when identifying business information in the document, the processor is configured to:
analyze a plurality of terms that precede the address in the document, determine a probability that each of the terms is part of a title associated with the address based on the statistical model, and identify a candidate title based on one or more of the terms that have a high probability of being part of a title associated with the address.

22. The system of claim 21, wherein the plurality of terms includes a first term that immediately precedes the address in the document and one or more second terms that precede the first term in the document.

23. The system of claim 21, wherein when determining a probability that each of the terms is included in a title, the processor is configured to:
predict whether one of the terms is part of the title, and predict whether another one of the terms is part of the title based on the prediction regarding the one of the terms.

24. The system of claim 21, wherein the probability that one of the terms is included in a title is based on a window of terms around the term.

25. The system of claim 21, wherein the probability that one of the terms is included in a title is based on the probability associated with another one of the terms.

26. The system of claim 21, wherein the probability that one of the terms is included in a title is based on a set of features associated with the term.

27. The system of claim 26, wherein the set of features includes at least one of a distance of the term from the address, characteristics of the term, boundary information between the term and a preceding or following term, or punctuation information between the term and a preceding or following term.

28. The system of claim 20, wherein the statistical model is generated by analyzing features associated with a plurality of documents with known addresses and associated titles.

29. The system of claim 20, wherein the business information corresponds to a title; and wherein when identifying business information in the document, the processor is configured to:
analyze a plurality of terms that precedes the address in the document, determine a probability that each of the terms is part of a title associated with the address, and identify a plurality of candidate titles based on one or more groups of the terms that have a high probability of being part of a title associated with the address.

30. The system of claim 20, wherein the business information corresponds to a telephone number; and wherein when identifying business information in the document, the processor is configured to:
identify a set of candidate telephone numbers in the document, and determine a probability that each candidate telephone number in the set of candidate telephone numbers is associated with the address based on the statistical model.

31. The system of claim 30, wherein the probability that one of the candidate telephone numbers is associated with the address is based on a set of features associated with the candidate telephone number.

32. The system of claim 31, wherein the set of features includes at least one of a distance of the candidate telephone number from the address, boundary information between the candidate telephone number and the address, whether a common telephone number term appears before the candidate telephone number, whether a common facsimile number term appears before the candidate telephone number, or whether another candidate telephone number exists between the candidate telephone number and the address.

33. The system of claim 31 wherein the statistical model is generated by analyzing features associated with a plurality of documents with known addresses and associated telephone numbers.

34. The system of claim 20, wherein the statistical model is generated by analyzing features associated with a plurality of documents with known addresses and associated business information.

35. The system of claim 20, wherein the business information includes at least one of a title, a telephone number, business hours, or a link to a web site or map associated with the address.

36. The system of claim 20, wherein when determining whether to associate the business information with the address, the processor is configured to:
analyze strings of terms in the document, and determine one of the strings that maximizes a probability that the terms of the string include the business information.

37. The system of claim 20, wherein the processor is further configured to create or supplement a business listing based on the business information and the address when the business information is associated with the address.

38. A method, comprising:
identifying a document that includes an address;
identifying a plurality of terms that precede the address in the document;
determining a probability that each of the terms is part of a title associated with the address;
identifying a candidate title based on one or more of the terms that have a high probability of being part of a title associated with the address;
assigning a confidence score to the candidate title; and determining whether to associate the candidate title with the address based on the assigned confidence score.

39. A method, comprising:
identifying a document that includes an address;
identifying a set of candidate telephone numbers in the document;

determining a probability that each candidate telephone number in the set of candidate telephone numbers is associated with the address; and determining whether to associate one of the candidate telephone numbers with the address based on the determined probability.

40. A method, comprising:
identifying a web page that includes a landmark;
identifying an attribute in the web page;
assigning a confidence score to the attribute, the confidence score relating to a probability that the attribute is associated with the landmark; and determining whether to associate the attribute with the landmark based on the assigned confidence score.

41. The method of claim 40, wherein the landmark corresponds to a postal address and the attribute corresponds to information relating to one of a title, a telephone number, business hours, or a link to a web site or map associated with the postal address.

42. The method of claim 40, wherein the landmark corresponds to a product and the attribute corresponds to one of a price or a product identification number.