US20080077588A1 - Identifying and measuring related queries - Google Patents

Identifying and measuring related queries Download PDF

Info

Publication number
US20080077588A1
US20080077588A1 US11/948,374 US94837407A US2008077588A1 US 20080077588 A1 US20080077588 A1 US 20080077588A1 US 94837407 A US94837407 A US 94837407A US 2008077588 A1 US2008077588 A1 US 2008077588A1
Authority
US
United States
Prior art keywords
query
queries
chinese
search
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/948,374
Inventor
Wei Zhang
Rosie Jones
Benjamin Rey
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yahoo Inc
Original Assignee
Yahoo Inc until 2017
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yahoo Inc until 2017 filed Critical Yahoo Inc until 2017
Priority to US11/948,374 priority Critical patent/US20080077588A1/en
Assigned to YAHOO! INC. reassignment YAHOO! INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: REY, BENJAMIN, JONES, ROSIE, ZHANG, WEI VIVIAN
Publication of US20080077588A1 publication Critical patent/US20080077588A1/en
Assigned to YAHOO HOLDINGS, INC. reassignment YAHOO HOLDINGS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: YAHOO! INC.
Assigned to OATH INC. reassignment OATH INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: YAHOO HOLDINGS, INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99934Query formulation, input preparation, or translation

Definitions

  • Online advertising may be an important source of revenue for enterprises engaged in electronic commerce.
  • a number of different kinds of web page based online advertisements are currently in use, along with various associated distribution requirements, advertising metrics, and pricing mechanisms.
  • Processes associated with technologies such as Hypertext Markup Language (HTML) and Hypertext Transfer Protocol (HTTP) enable a web page to be configured to contain a location for inclusion of an advertisement.
  • HTML Hypertext Markup Language
  • HTTP Hypertext Transfer Protocol
  • a page may not only be a web page, but any other electronically created page or document.
  • An advertisement can be selected for display each time the page is requested, for example, by a browser or server application.
  • Online advertising may be linked to online searching.
  • Online searching is a common way for consumers to locate information, goods, or services on the Internet.
  • a consumer may use an online search engine to type in a query to search for other pages or web sites with information related to that query.
  • the search may be referred to as a sponsored search.
  • Sponsored searching may require advertisers to bid for search keywords.
  • the search keywords are associated with the search query for displaying advertisements with the search results. It may be difficult to identify which keyword(s) that a search query is related to. In particular, users may enter search queries that are misspelled or that are in a different language.
  • FIG. 1 is a block diagram of an exemplary network system
  • FIG. 2 is a block diagram of a language analyzer
  • FIG. 3 is a block diagram of exemplary conversion forms
  • FIG. 4 is a block diagram of exemplary comparisons of queries
  • FIG. 5 is a flow diagram for identifying related queries.
  • FIG. 6 is a block diagram of a general computer system for use with the disclosed embodiments.
  • the embodiments described below include a system and method for identifying and measuring related queries.
  • the embodiments relate to identifying similar Chinese queries.
  • a user query may be compared with known search keywords or other search queries.
  • the search keywords may be used by advertisers for sponsored searching.
  • the user query may be a non-native language query, such as a Chinese related query in an English language website or a query in a Chinese website.
  • the user query is converted into a different form before comparing with other converted queries or the search keywords.
  • the embodiments are described in terms of a Chinese related query, but other languages or query platforms may be used.
  • a similarity score based on various features may be used for comparing the queries. Based on the similarity score or other comparison features, the original user query may be substituted by other queries or be associated with one or more search keywords.
  • the associated search keywords may be used for selecting the advertisements that are displayed with the search results for that search query.
  • related queries may be identified from a reformulation of the original query.
  • the reformulation may be based on stored query logs and used to compare the original query with stored queries.
  • various features including language specific features, may be used to measure query similarity.
  • the original query may be substituted for a stored query or search keyword for identifying the relevant advertisements to display.
  • a user's query may be misspelled and the system may identify a related query that is correctly spelled that replaces the initial user query.
  • Chinese related queries may be identified and measured due to an increased interest in Chinese search and advertising markets.
  • FIG. 1 provides a simplified view of a network system 100 in which the present embodiments may be implemented. Not all of the depicted components may be required, however, and some embodiments of the invention may include additional components not shown in the figure. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional, different or fewer components may be provided.
  • FIG. 1 is a block diagram illustrating an embodiment of an exemplary network system 100 for language analysis and comparison.
  • system 100 includes a language analyzer 104 that may receive and convert a user's search query for comparison with other queries or search keywords.
  • a user device 106 is coupled with a search engine 102 through the network 109 .
  • the search engine 102 is coupled with a search log database 112 , and both are coupled with the language analyzer 104 .
  • the search log database 112 is coupled with a data source 113 and a unit dictionary 116 .
  • An ad server 103 may be coupled with the search engine 102 and/or coupled with the language analyzer 104 .
  • the phrase “coupled with” is defined to mean directly connected to or indirectly connected through one or more intermediate components. Such intermediate components may include both hardware and software based components. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein.
  • the user device 106 may be a computing device for a user to connect to a network 109 , such as the Internet. Examples of a user device include but are not limited to a personal computer, personal digital assistant (“PDA”), cellular phone, or other electronic device.
  • PDA personal digital assistant
  • the user device 106 may be configured to access other data/information in addition to web pages over the network 109 with a web browser, such as INTERNET EXPLORER® (sold by Microsoft Corp., Redmond, Wash.).
  • the user device 106 may enable a user to view pages over the network 109 , such as the Internet.
  • the user device 106 may be the user device described below with respect to FIG. 6 .
  • the user device 106 may be configured to allow a user to interact with the search engine 102 , the ad server 103 , or other components of the system 100 .
  • the user device 106 may receive and display a site or page provided by the search engine 102 .
  • the user device 106 may include a keyboard, keypad or a cursor control device, such as a mouse, or a joystick, touch screen display, remote control or any other device operative to allow a user to interact with the page(s) provided by the search engine 102 and/or the ad server 103 .
  • the search engine 102 is coupled with the user device 106 through the network 109 , as well as being coupled with the language analyzer 104 , the ad server 103 and/or the search log database 112 .
  • the search engine 102 is a web server.
  • the search engine 102 may provide a site or a page over a network, such as the network 109 or the Internet.
  • a site or page may refer to a web page or a series of related web pages which may be received or viewed over a network.
  • the site or page is not limited to a web page, and may include any information accessible over a network that may be displayed at the user device 106 .
  • a site may refer to a series of pages which are linked by a site map.
  • the web site of www.yahoo.com may include thousands of pages, which are included at yahoo.com.
  • a page will be described as a web page, a web site, or any other site/page accessible over a network.
  • a user of the user device 106 may access a page provided by the search engine 102 over the network 109 .
  • the page provided by the search engine 102 may be a search page that receives a search query from the user device 106 and provides search results that are based on the received search query.
  • the search engine 102 may include an interface, such as a web page, e.g., the web page which may be accessed on the World Wide Web at yahoo.com, which is used to search for pages which are accessible via the network 109 .
  • the user device 106 autonomously or at the direction of the user, may input a search query (also referred to as a user query, original query, search term or a search keyword) for the search engine 102 .
  • a single search query may include multiple words or phrases.
  • the search engine 102 may perform a search for the search query and display the results of the search on the user device 106 .
  • the results of a search may include a listing of related pages or sites that is provided by the search engine 102 in response to receiving the search query.
  • the ad server 103 is coupled with the search engine 102 and/or the language analyzer 104 .
  • the ad server 103 may be configured to provide advertisements to the search engine 102 .
  • the search engine 102 and the ad server 103 may be a common component and/or the search engine 102 may select and provide advertisements.
  • the ad server 103 may include or be coupled with an advertisement database that includes advertisements that are available to be displayed by the search engine 102 for sponsored searching.
  • the advertisements may be associated with one or more search keywords.
  • the search keywords may be purchased or bid on by advertisers. Accordingly, when that search keyword is searched for, the advertiser who purchased or placed the highest bid is selected and their advertisement is displayed.
  • the ad server 103 may include or be coupled with a database, such as an advertisement database, that stores search keywords and the respective price or bid for each keyword from advertisers that is referenced for each search query.
  • a search query is received and compared with known search keywords or other search queries when the ad server 103 selects and provides the advertisement to the search engine 102 .
  • the search log database 112 includes records or logs of at least a subset of the search queries entered in the search engine 102 over a period of time and may also be referred to as a search query log, search term database, keyword database or query database.
  • the search log database 112 may store the search keywords that are used by the ad server 103 in selecting an advertisement for a particular search query.
  • the search log database 112 may include search queries from any number of users over any period of time.
  • the search log database 112 may include records or logs of a subset of the queries or requests for data entered at the search engine 102 over a period of time.
  • the search log database 112 may also store associations between search queries from the search engine 102 . For example, a search query may be associated with a search keyword or other search queries after a conversion and comparison by the language analyzer 104 as discussed below.
  • the search log database 112 may also be coupled with a data source 113 .
  • the data source 113 may be an internal source of data, an external source of search data, or a combination of the two.
  • An external data source may include search results from other search engines or other sources.
  • a search engine other than search engine 102 may be an external data source and provide search logs to the search log database 112 .
  • An internal data source may include search data or other data from the search engine 102 .
  • Other data may include other searching or web browsing tendencies identified by the search engine 102 .
  • the search log database 112 may also be coupled with a unit dictionary 116 .
  • the unit dictionary 116 may be a database of user queries or search keywords that are coupled with one another as units. Units may also be referred to as concepts or topics and are sequences of one or more words that appear in search queries.
  • the search query “New York City law enforcement” may include two units, e.g. “New York City” may be one unit and “law enforcement” may be another unit.
  • a unit is a phrase of common words that identify a single concept.
  • the search query “Chicago art museums” may include two units, e.g. “Chicago” and “art museums.” The “Chicago” unit is a single word, and “art museums” is a two-word unit.
  • Units identify common groups of keywords to maximize the efficiency and relevance of search results.
  • the unit dictionary 116 may include Chinese related queries, as well as Chinese related units that include Chinese characters. Categorization of search queries into units is discussed in commonly owned U.S. Pat. No. 7,051,023 issued May 23, 2006, entitled “SYSTEMS AND METHODS FOR GENERATING CONCEPT UNITS FROM SEARCH QUERIES,” which is hereby incorporated by reference.
  • the unit dictionary 116 and the categorization of search queries into units may be used to compare and analyze search queries received by the search engine 102 .
  • a search query may be broken into units that are compared with units from other queries or search keywords.
  • past search queries and search keywords are stored in the search log database 112 as units that may be used in an analysis by the language analyzer 104 .
  • the ad server 103 , the search engine 102 and/or the search log database 112 may be coupled with the language analyzer 104 .
  • the language analyzer 104 receives a user query from the user device 106 and matches or identifies other queries or search keywords.
  • the user query may be converted to a different form for comparing various features of the user query with search keywords as discussed with respect to FIG. 2 .
  • the language analyzer 104 may be a computing device as described below with respect FIG. 6 .
  • the language analyzer 104 includes a processor 105 , memory 107 , software 108 and an interface 110 .
  • the language analyzer 104 may be a separate component from the search engine 102 and the ad server 103 .
  • any of the language analyzer 104 , search engine 102 , and the ad server 103 may be combined as a single component.
  • the interface 110 may communicate with any of the search engine 102 , search log database 112 , and ad server 103 .
  • the interface 110 may include a user interface configured to allow a user to interact with any of the components of the language analyzer 104 . For example, a user may be able to modify the conversion form or comparison features that are used by the language analyzer 104 .
  • the processor 105 in the language analyzer 104 may include a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP) or other type of processing device.
  • the processor 105 may be a component in any one of a variety of systems.
  • the processor 105 may be part of a standard personal computer or a workstation.
  • the processor 105 may be one or more general processors, digital signal processors, application specific integrated circuits, field programmable gate arrays, servers, networks, digital circuits, analog circuits, combinations thereof, or other now known or later developed devices for analyzing and processing data.
  • the processor 105 may operate in conjunction with a software program, such as code generated manually (i.e., programmed).
  • the processor 105 may be coupled with a memory 107 , or the memory 107 may be a separate component.
  • the interface 110 and/or the software 108 may be stored in the memory 107 .
  • the memory 107 may include, but is not limited to computer readable storage media such as various types of volatile and non-volatile storage media, including to random access memory, read-only memory, programmable read-only memory, electrically programmable read-only memory, electrically erasable read-only memory, flash memory, magnetic tape or disk, optical media and the like.
  • the memory 107 includes a random access memory for the processor 105 .
  • the memory 107 is separate from the processor 105 , such as a cache memory of a processor, the system memory, or other memory.
  • the memory 107 may be an external storage device or database for storing recorded image data. Examples include a hard drive, compact disc (“CD”), digital video disc (“DVD”), memory card, memory stick, floppy disc, universal serial bus (“USB”) memory device, or any other device operative to store image data.
  • the memory 107 is operable to store instructions executable by the processor 105 .
  • the functions, acts or tasks illustrated in the figures or described herein may be performed by the programmed processor executing the instructions stored in the memory 107 .
  • the functions, acts or tasks are independent of the particular type of instruction set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firm-ware, micro-code and the like, operating alone or in combination.
  • processing strategies may include multiprocessing, multitasking, parallel processing and the like.
  • the processor 105 is configured to execute the software 108 .
  • the software 108 may include instructions for analyzing and converting search queries and comparing features with other queries or search keywords.
  • the interface 110 may be a user input device or a display.
  • the interface 110 may include a keyboard, keypad or a cursor control device, such as a mouse, or a joystick, touch screen display, remote control or any other device operative to interact with the language analyzer 104 .
  • the interface 110 may include a display coupled with the processor 105 and configured to display an output from the processor 105 .
  • the display may be a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, a cathode ray tube (CRT), a projector, a printer or other now known or later developed display device for outputting determined information.
  • LCD liquid crystal display
  • OLED organic light emitting diode
  • CRT cathode ray tube
  • projector a printer or other now known or later developed display device for outputting determined information.
  • the display may act as an interface for the user to see the functioning of the processor 105 , or as an interface with the software 108 for providing input parameters.
  • the interface 110 may allow a user to interact with the language analyzer 104 to establish a conversion of a user query and the features that are compared in matching a query with a search keyword.
  • any of the components in system 100 may be coupled with one another through a network.
  • the language analyzer 104 may be coupled with the search engine 102 , search log database 112 , or ad server 103 via a network.
  • Any of the components in system 100 may include communication ports configured to connect with a network.
  • the present disclosure contemplates a computer-readable medium that includes instructions or receives and executes instructions responsive to a propagated signal, so that a device connected to a network can communicate voice, video, audio, images or any other data over a network.
  • the instructions may be transmitted or received over the network via a communication port or may be a separate component.
  • the communication port may be created in software or may be a physical connection in hardware.
  • the communication port may be configured to connect with a network, external media, display, or any other components in system 100 , or combinations thereof.
  • the connection with the network may be a physical connection, such as a wired Ethernet connection or may be established wirelessly as discussed below.
  • the connections with other components of the system 100 may be physical connections or may be established wirelessly.
  • the network or networks that may connect any of the components in the system 100 to enable communication of data between the devices may include wired networks, wireless networks, or combinations thereof.
  • the wireless network may be a cellular telephone network, a network operating according to a standardized protocol such as IEEE 802.11, 802.16, 802.20, published by the Institute of Electrical and Electronics Engineers, Inc., or a WiMax network.
  • the network(s) may be a public network, such as the Internet, a private network, such as an intranet, or combinations thereof, and may utilize a variety of networking protocols now available or later developed including, but not limited to TCP/IP based networking protocols.
  • the network(s) may include one or more of a local area network (LAN), a wide area network (WAN), a direct connection such as through a Universal Serial Bus (USB) port, and the like, and may include the set of interconnected networks that make up the Internet.
  • the network(s) may include any communication method or employ any form of machine-readable media for communicating information from one device to another.
  • the ad server 103 or the search engine 102 may provide pages to the user device 106 over a network, such as the network 109 .
  • the network or networks described above, including the network 109 may be the network discussed below with respect to FIG. 6 .
  • the ad server 103 , the search engine 102 , the search log database 112 , the language analyzer 104 , the unit dictionary 116 and/or the user device 106 may represent computing devices of various kinds, such as the components described with respect to FIG. 6 .
  • Such computing devices may generally include any device that is configured to perform computation and that is capable of sending and receiving data communications by way of one or more wired and/or wireless communication interfaces.
  • Such devices may be configured to communicate in accordance with any of a variety of network protocols, as discussed above.
  • the user device 106 may be configured to execute a browser application that employs HTTP to request information, such as a web page, from the search engine 102 or ad server 103 .
  • the present disclosure contemplates a computer-readable medium that includes instructions or receives and executes instructions responsive to a propagated signal, so that any device connected to a network can communicate voice, video, audio, images or any other data over a network.
  • FIG. 2 illustrates an embodiment of a language analyzer.
  • the language analyzer 104 may convert a search query into a different form for comparing its features with other queries or search keywords that are used for selecting matching advertisements to be displayed on a search results page.
  • the language analyzer 104 may include a receiver 202 , a converter 204 , a comparator 206 , and a calculator 208 .
  • the language analyzer 104 or any of its components may represent computing devices of various kinds, such as the components described with respect to FIG. 6 .
  • the receiver 202 may receive a user query from the search engine 102 , which may receive the user query from the user device 106 .
  • the receiver 202 may also receive search keywords from the ad server 103 .
  • the search keywords may be matched with advertisements, such that when a user inputs the search keyword in a search engine, the search results page includes the matched advertisement.
  • the language analyzer 104 may match user queries with search keywords for selecting advertisements to be displayed on the search query results page.
  • the converter 204 is coupled with the receiver 202 .
  • the converter 204 receives the user query or other search keywords and converts them into a different form for comparison.
  • the user query may be a Chinese related query and the converter 204 may convert the Chinese related query into a different form to aid comparison.
  • a Chinese related query may include any Chinese characters, including Roman characters that represent a Chinese character or phrase.
  • Chinese related queries may also include queries that originate from or are received by a Chinese search engine and may be simplified Chinese and/or traditional Chinese.
  • FIG. 3 illustrates exemplary conversion forms.
  • the converter 204 may utilize any of the conversion forms 302 to convert a Chinese related query.
  • the converter 204 may convert a search query into any of the conversion forms 302 to compare the query with other converted queries or converted search terms.
  • the conversion may include a transformation of the query by adding, deleting, and/or substituting characters or words in the queries.
  • the conversion or transformation may result in a common format or common form that may be used for comparing the queries.
  • the conversion forms 302 shown in FIG. 3 are merely exemplary. In alternate embodiments, there may be additional conversions forms 302 that are not illustrated or described.
  • the conversion may receive a Chinese related query and convert each element or selected elements of the query into an array that represents the converted form of the Chinese related query.
  • a first conversion form is a conversion into Chinese soundex 304 .
  • the Chinese characters are converted into pinyin without tone, while the roman letters remain.
  • the query is then converted into a Chinese soundex-like representation by first retaining the first letter of a string. Second, all occurrences of a, e, h, i, o, and u are removed, unless it is the first letter.
  • characters may be replaced, such as, replacing “zh” with “z,” “ch” with “c,” “sh” with “s,” “ng” with “n,” “rd” with “d,” “rl” with “l,” “rn” with “n,” “rs” with “s,” and/or “rt” with “t.”
  • Fifth if two or more letters are adjacent, then the first letter remains and the others are omitted. Sixth, the spaces are removed. Seventh, all characters remaining are returned.
  • a second conversion form is converting the Chinese characters into the keyboard input form zhuyin (Bopomofo) 306 .
  • Each element in the array is either all zhuyin characters for one corresponding Chinese character or a roman character originally in the query without transformation.
  • a third conversion form is a similar zhuyin (Bopomofo) conversion 308 , except each element in the array is either one zhuyin character or a roman character originally in the query without transformation.
  • a fourth conversion form is converting Chinese characters into radicals 310 .
  • Each element in the array is either the radical for a Chinese character or the roman character originally in the query without transformation.
  • a radical 310 may be the semantic root (i.e., portion bearing the meaning) of a Chinese character.
  • a radical may be part of a Chinese character and/or the semantic component of this Chinese character. For example, in the character pronounced as jie with a meaning of “sister”, the left part (pronounced n ⁇ umlaut over ( ⁇ hacek over (u) ⁇ ) ⁇ in Mandarin Chinese) is the semantic component.
  • Chinese characters may have at least one or two radicals.
  • the radicals may be used for Chinese Hanzi.
  • a dictionary may be used to match a Chinese character with its radical(s). When a Chinese character has multiple radicals, the most meaningful radical (which may be identified in a dictionary) may be considered for comparison.
  • a fifth conversion form is converting Chinese characters into pinyin without tone 412 .
  • Each element in the array is either the complete pinyin without tone for one corresponding Chinese character or a roman character originally in the query without any transformation.
  • a sixth conversion form is converting Chinese characters into pinyin without tone 414 in which each element in the array is either one pinyin character or a roman character originally in the query without transformation.
  • Pinyin may be a Standard Mandarin Romanization system. In pinyin, the pin refers to a “spelling” and the yin refers to a “sound.” There may be a pinyin corresponding to each Chinese Character. One pinyin may include more than two roman characters.
  • each pinyin may be a unit for similarity comparison.
  • each character within pinyin may be a unit for comparison.
  • a seventh conversion form is converting Chinese characters into pinyin with tone 416 . Each element in the array is either the complete pinyin and its tone for one corresponding Chinese character or a roman character originally in the query without transformation.
  • An eighth conversion form is converting Chinese characters into pinyin with tone 418 in which each element in the array is either one pinyin character, its tone, or a roman character originally in the query without transformation.
  • a ninth conversion form is converting queries into two character-based arrays 420 . In particular, if a character is Chinese, three bytes in Chinese (utf8) is an element in the array. In other words, each Chinese character is represented in three bytes. If a character is roman, then the roman character itself is an element.
  • a tenth conversion form is the removal of Chinese characters 422 .
  • the roman characters are left in the query and the Chinese characters are removed.
  • an eleventh conversion form removes the roman characters 424 , and keeps the Chinese characters in the query.
  • a twelve conversion form includes leaving the query as inputted 426 . In other words, the twelve conversion is no conversion 426 .
  • the receiver 202 receives two queries that are to be compared to determine the similarity between those queries.
  • the queries are converted into at least one of the conversion forms by the converter 204 .
  • both queries are converted into the twelve exemplary conversion forms 302 and the queries are compared in all twelve converted forms.
  • certain conversion forms are selected for converting the queries and the queries are compared for each of those converted forms.
  • the queries may be compared by the comparator 206 .
  • the comparator 206 may be configured to perform comparison of a user's search query with other queries or with search keywords that are used by the ad server 103 for displaying relevant advertisements that are linked to particular search keywords.
  • the comparator 206 determines the similarity between two queries.
  • the queries are first converted into a similar form or similar forms by the converter 204 and each of those forms are compared by the comparator 206 .
  • the queries are converted into the twelve forms illustrated in FIG. 3 and the comparator 206 makes twelve comparisons between the queries for each of the twelve conversions of each query. In alternative embodiments, there may be more or fewer conversion forms that are compared by the comparator 206 .
  • a user query may be compared with a candidate set of queries to determine which of the candidate set is most similar to the user query.
  • the candidate set may be made up of search keywords which are compared with the user query to determine which search keyword is most similar.
  • the candidate set of queries or keywords for comparison may be chosen based on an initial analysis of the user query compared with the search log database 112 .
  • when the user query is received the candidate set is identified and each member of the candidate set is compared with the user query to determine which is most similar.
  • a similarity score may be calculated for each member of the candidate set that represents a similarity with the user query.
  • the member of the candidate set with the closest similarity score may be most similar to the user query.
  • the candidate set may include one query or include all queries, such as those stored in the search log database 112 .
  • FIG. 4 illustrates exemplary comparisons of queries.
  • the comparator 206 may utilize comparison features 402 when comparing queries.
  • the comparison features 402 shown in FIG. 4 are merely exemplary. In alternate embodiments, there may be additional comparison features 402 that are not illustrated or described.
  • the comparison may involve comparing various forms of converted Chinese related queries.
  • the comparator 206 may compare an array of elements that is generated by the converter 204 as a converted form of a Chinese related query.
  • the comparator 402 may compare queries as described in the commonly owned U.S. application entitled, “SYSTEM AND METHOD FOR IDENTIFYING RELATED QUERIES FOR LANGUAGES WITH MULTIPLE WRITING SYSTEMS,” U.S. Pat. Pub. No. 2007/0203894, filed Feb. 28, 2006, the disclosure of which is hereby incorporated by reference.
  • a first comparison feature may be an edit distance 404 between two queries.
  • the edit distance may be a measure of the difference between two character strings, such as queries.
  • the edit distance may be a minimum number of edit operations required to transform the first query into the second query.
  • the edit operation may include inserting or deleting a character into a string or replacing a character by another character.
  • weights may be assigned for different edit operations. For example, a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a.
  • the edit distance may be the Levenshtein distance or the Damerau-Levenshtein distance when a transposition of characters counts as a single edit operation.
  • a second comparison feature may be an edit distance without a domain 406 .
  • the domain may be a web domain, such as “.com” that is removed. The removal of the domain may be helpful because a user querying “yahoo.com” and “yahoo.net” is likely making the same query.
  • a third comparison feature may be a character level prefix overlap 408 .
  • the character level prefix overlap 408 may be a measure of the characters/words that are the same at the beginning of the queries. For example, “auto cleaners” and “auto cleaning” have a prefix overlap of “auto clean.” The prefix overlap may indicate increased similarity.
  • a fourth comparison feature may be a character level suffix overlap 410 .
  • the character level suffix overlap 410 measures the similarity between queries at the end of the query. For example, “auto insurance agent” and “home insurance agent” share a suffix overlap of “insurance agent.” Similar, to the prefix overlap, the suffix overlap may indicate increased similarity.
  • a fifth comparison feature may be a minimum edit distance 412 over all the conversion forms.
  • a sixth comparison feature may be a maximum edit distance 414 over all the conversion forms. Given twelve conversion forms and twelve edit distances for each conversion, the minimum edit distance 412 and the maximum edit distance 414 may be identified. In one embodiment, the minimum and maximum may be removed as outliers. Alternatively, the minimum or maximum may be weighted higher when computing a similarity score.
  • a seventh comparison feature may be a minimum edit distance without a domain 416 and an eighth comparison feature may be a maximum edit distance without a domain 418 . As discussed above, the domains in a query may not be valuable in terms of determining what the user is searching for, so the domains are removed before comparison.
  • Additional comparison features may be a word level edit distance 420 , a word level prefix overlap 422 , or a word level suffix overlap 424 .
  • the word level comparisons are similar to the character level comparisons, except entire words are compared rather than individual characters.
  • a length difference 426 between two queries may also be used for comparing.
  • the comparator 206 may be coupled with a calculator 208 that may calculate a similarity score.
  • the similarity score may be a measure of the similarity between the queries.
  • the similarity score may be calculated based on individual comparisons of different conversion forms of two queries with each individual comparison being assigned a weighted value.
  • the multiple conversion forms described with respect to FIG. 3 may each result in a separate comparison between two queries. Accordingly, using the twelve conversion forms 302 , there may be twelve different edit distances or similarity scores, one for each conversion. Those twelve converted forms may be compared and multiplied by a weight for each form to get an overall similarity score between the queries. Alternatively, a subset of the twelve conversion forms or additional conversion forms not described may be utilized to convert Chinese related queries into different forms for comparison.
  • the equation presented in Table A may be used to calculate a similarity score indicating the strength of similarity between a query pair.
  • the query pair may include a given query q and a comparison query MODS(q), either of which may be written according to one or more Chinese writing systems.
  • MODS(q) may represent a converted query.
  • both q and MODS(q) may be converted to the same form for comparison, or MODS(q) is converted into a form for comparison with q.
  • MODS(q) may represent a related query that is identified as a potential substitute for the user query q.
  • MODS(q) When MODS(q) has good similarity score with the user original query q, MODS(q) may be used as a search keyword for fetching advertisements.
  • MODS(q) may also be referred to as a rewritten query. Both user original query q and MODS(q) may be converted to the same form for comparison.
  • the equation in Table A makes use of a subset of the conversion forms 302 and the comparison features 402 that are discussed above. In alternative embodiments, different conversion forms or comparison features may be utilized to generate a similarity score.
  • the equation illustrated in Table A is merely exemplary and may be modified so as to provide for the calculation of a similarity score for multiple writing systems.
  • a formula may be optimized based on the source of the query, because queries from Taiwan may be different from queries from Hong Kong. Accordingly, the conversions, comparisons, and weights may be modified for different types of queries.
  • q represents a given query written according to one or more Chinese writing systems and MODS(q) represents a query selected from a candidate set of potential queries related to query Q.
  • query q may be referred to as query q 1 and MODS(q) may be referred to as query q 2 or q′.
  • the initial number before each feature is a weight that may be used to emphasize or deemphasize features.
  • Pq 12 min may be a function for calculating the query substitution probability of query q 1 following query q 2 in a log of user query sessions, such as from the search log database 112 .
  • the search log database 112 may identify the order of the one or more queries submitted by the user, for example, to provide an indication of how the user refined a query, how the user rewrote a query, how the user utilized one or more alternate writing systems of a language with multiple writings systems to express a query Q, etc.
  • queries q 1 and q 2 follow one another in a search log database 112 , it may be an indication that they are similar because q 2 may be a refinement of q 1 .
  • the pq 12 min function calculates a query substitution probability of a given query q 1 following a given query q 2 , and may also be used to calculate a unit substitution of a unit u following a given unit u′.
  • pq 12 min prob(U_i ⁇ >U_i′
  • pq 12 min may be the normalized probability of q 2 as q 1 's substitution.
  • a normalized probability is computed of the units in q 1 substituted by corresponding units in q 2 , and take their minimum as pq 12 min.
  • Levroman is a comparison using the roman characters of a query, such as with conversion form 322 , which removes Chinese characters. For each query all non-roman characters may be removed, but spaces are left in the query. The roman character parts are changed into arrays. Each roman character is an element in the array, including any spaces. The Levenshtein distance is measured between the two arrays. In the case that neither q 1 nor q 2 has roman characters, levroman is set to 0. In the case that one of q 1 or q 2 has roman characters but the other does not have roman characters, levroman is set to 1.
  • the first query does not include a space before map, but the second query includes a space before map.
  • the queries are converted into arrays, in which q 1 is represented as the array: and q 2 is represented as the array:
  • q 1 is represented as the array:
  • q 2 is represented as the array:
  • Agreechar may relate to character agreement without removing a space regardless of the order of characters. Agreechar may be similar to wordr discussed below, except it is for the character level rather than the word level.
  • q 1 and q 2 have 7 unique characters in total, which are “m”, “a”, “p” and a space.
  • Query q 1 and q 2 share 5 unique characters, which are “m”, “a” and “p”. Therefore, agreechar is 0.714 (calculated by 5/7) for this query pair.
  • Wordr is similar to agreechar except is matches words rather than characters.
  • the queries are separated into words, segments, or units as described above.
  • the percentage of unique words not in common is determined for wordr.
  • Dlevpynchar utilizes the complete pinyin without tone 312 conversion form.
  • the first query q 1 and second query q 2 first have a common domain removed and each roman character (including spaces) are kept, while each Chinese character is converted into pinyin without tone.
  • the queries are then transformed into arrays. Each roman character is an element in the array and each Chinese character's pinyin without tone is an element in the array.
  • the Levenshtein distance is then measured. In the example described above, when query q 1 map” and query q 2 map” where there is no space in query q 1 , but there is a space in query q 2 .
  • the first query q 1 is converted into an array:
  • the second query q 2 is converted into an array:
  • Q 1 bidded is 1 if q 1 is bidded and q 1 bidded is 0 if q 1 is not bidded.
  • q 1 is a user query and q 1 is bidded, it may mean that an advertiser chooses q 1 as a keyword for the advertisements they want to show. This bidding process may also identify a cost they would like to pay if web searchers click the ads fetched by the keyword.
  • a query identifying system may identify a related query (e.g. MODS(q)) to substitute for the user query.
  • Q 2 hasroman is 1 if q 2 contains any roman characters, but not including any spaces.
  • Q 2 hasroman is 0 if q 2 does not contain any roman characters.
  • the queries that are analyzed may be from Chinese search engine or in a search engine that receives Chinese related queries.
  • a Chinese search engine may receive queries with roman characters due to the usage of roman characters in Chinese and the popularity of roman character based languages such as English.
  • the Chinese characters and roman characters maybe processed differently. For example, a Chinese character may be converted into Pinyin for a similarity comparison, while Roman characters are not converted into Pinyin. Accordingly, a similarity score computation may be adjusted based on the presence of Roman characters.
  • Pq 21 max may be a function for calculating the query substitution probability of query q 1 following query q 2 in a log of user query sessions, such as from the search log database 112 .
  • pq 21 max prob(U_i ⁇ >U_i′
  • the normalized probability may be calculated according above equation for each unit pair in the query pair and the maximum is used as pq 21 max.
  • Lengthdiffn is the length difference in characters between q 1 and q 2 , which is normalized by their maximum length in characters.
  • Entropy 21 min is an uncertainty that may be associated with a similarity between q 1 and q 2 .
  • entropy ⁇ ⁇ 21 ⁇ min ⁇ i ⁇ ( freq ⁇ ( q 1 ⁇ q 2 i ) / freq ⁇ ( q 2 i ) ) ⁇ log ⁇ ( ( freq ⁇ ( q 1 ⁇ q 2 i ) / freq ⁇ ( q 2 i ) ) ) , where i is the number of possible q 1 query substitutions with q 2 .
  • entropy ⁇ ⁇ 21 ⁇ min min j ⁇ ⁇ i ⁇ ( freq ⁇ ( q 1 ⁇ j ⁇ q 2 ⁇ j i ) / freq ( q 2 ⁇ j i ⁇ ) ) i ⁇ log ⁇ ( ( freq ⁇ ( q 1 ⁇ j ⁇ q 2 ⁇ j i ) / freq ⁇ ( q 2 ⁇ j i ) ) ) , where j is the number of unit substitution between q 1 and q 2 , and i is the number of possible q 1 j 's unit substitutions.
  • LenthsubtminGT 3 utilizes a substitution of characters.
  • lengthsubstminGT 3 is 1 if the minimum length of q 1 and q 2 is less than 3 in characters. Otherwise, lengthsubstminGT 3 is 0.
  • lengthsubstminGT 3 is 1 if the minimum length of any of the substitution units in characters is greater than 3. Otherwise, lengthsubstminGT 3 is 0.
  • Query suggestion may refer to a generation of related queries based on an original user query. The user query may be broken into units as described above. A related unit may be found for each unit and combined to form a related query.
  • “New York hotel” when a user enters a query for “New York hotel,” it may be split into two units “New York” and “hotel.” “New York” may be rewritten to a related query “Manhattan” and “hotel” may be rewritten to “motel.” Accordingly, “Manhattan motel” may be a related candidate query for an original user query of “New York hotel.”
  • the equation in Table A and the corresponding features that are used to calculate a similarity score in the calculator 208 are exemplary.
  • a different equation, different weights and different features may be utilized to compute a similarity score.
  • the edit distance may be computed for each of the comparison forms 302 and averaged to become the similarity score.
  • weights may be added to each converted form, or additional comparison features 402 may be used.
  • the equation that is used to determine the similarity score is analyzed by comparing with a human or editorial control set.
  • the editorial control set may include a human review of the similarity scores for pairs of queries to determine an accuracy of the equation used for calculating the similarity score.
  • the human review may be used to optimize the equation that calculates the similarity score.
  • Human editors may label query pairs with a relevance score.
  • the relevance score may be used as a training label for the similarity score calculation, such as for the weights used in the equation in Table A.
  • the editorial score may be a response variable and/or a dependent variable.
  • the model may be fitted using linear regression.
  • FIG. 5 is an illustration for identifying related queries.
  • a user query is received.
  • the user query may be Chinese-related and include at least one Chinese character.
  • the user query may be received by a search engine 102 .
  • the user query may be compared with a selected candidate set of queries or search keywords in block 504 .
  • the candidate set may be selected form the search log database 112 .
  • the candidate set may be chosen based on an initial comparison of similarity with the user query.
  • the user query and/or the candidate set of queries may be converted into a different form or format for comparison, such as the conversion forms 302 .
  • the user query and a member of the candidate set are compared in block 508 .
  • a similarity score is calculated to measure a similarity between the user query and the member of the candidate set.
  • the similarity score may be based on utilizing any of the comparison features 402 for comparing a converted form of the user query with a converted form of the member.
  • another comparison at block 508 occurs for another member from the candidate set and continues until all members of the candidate set have been compared and have a similarity score.
  • the similarity scores between the candidate set may be reviewed to identify the member of the candidate set with the closest similarity score to the user query.
  • the identification of a similar member such as a similar search keyword, may be used to identify which advertisements to display for sponsored searching.
  • an illustrative embodiment of a general computer system is shown and is designated 600 .
  • the user device 106 , ad server 103 , the search engine 102 , the search log database 112 , the data source 113 , the unit dictionary 116 , and/or the language analyzer 104 may be a computer or computing devices, such as the computer system 600 or any of its components.
  • the computer system 600 can include a set of instructions that can be executed to cause the computer system 600 to perform any one or more of the methods or computer based functions disclosed herein.
  • the computer system 600 may operate as a standalone device or may be connected, e.g., using a network, to other computer systems or peripheral devices.
  • the computer system may operate in the capacity of a server or as a client user computer in a server-client user network environment, or as a peer computer system in a peer-to-peer (or distributed) network environment.
  • the computer system 600 can also be implemented as or incorporated into various devices, such as a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a mobile device, a palmtop computer, a laptop computer, a desktop computer, a communications device, a wireless telephone, a land-line telephone, a control system, a camera, a scanner, a facsimile machine, a printer, a pager, a personal trusted device, a web appliance, a network router, switch or bridge, or any other machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
  • the computer system 600 can be implemented using electronic devices that provide voice, video or data communication.
  • the term “system” shall also be taken to include any collection of systems or sub-systems that individually or jointly execute a set, or multiple sets, of instructions to perform one or more computer functions.
  • the computer system 600 may include a processor 602 , e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both.
  • the processor 602 may be a component in a variety of systems.
  • the processor 602 may be part of a standard personal computer or a workstation.
  • the processor 602 may be one or more general processors, digital signal processors, application specific integrated circuits, field programmable gate arrays, servers, networks, digital circuits, analog circuits, combinations thereof, or other now known or later developed devices for analyzing and processing data.
  • the processor 602 may implement a software program, such as code generated manually (i.e., programmed).
  • the computer system 600 may include a memory 604 that can communicate via a bus 608 .
  • the memory 604 may be a main memory, a static memory, or a dynamic memory.
  • the memory 604 may include, but is not limited to computer readable storage media such as various types of volatile and non-volatile storage media, including but not limited to random access memory, read-only memory, programmable read-only memory, electrically programmable read-only memory, electrically erasable read-only memory, flash memory, magnetic tape or disk, optical media and the like.
  • the memory 604 includes a cache or random access memory for the processor 602 .
  • the memory 604 is separate from the processor 602 , such as a cache memory of a processor, the system memory, or other memory.
  • the memory 604 may be an external storage device or database for storing data. Examples include a hard drive, compact disc (“CD”), digital video disc (“DVD”), memory card, memory stick, floppy disc, universal serial bus (“USB”) memory device, or any other device operative to store data.
  • the memory 604 is operable to store instructions executable by the processor 602 .
  • the functions, acts or tasks illustrated in the figures or described herein may be performed by the programmed processor 602 executing the instructions stored in the memory 604 .
  • processing strategies may include multiprocessing, multitasking, parallel processing and the like.
  • the computer system 600 may further include a display unit 614 , such as a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, a cathode ray tube (CRT), a projector, a printer or other now known or later developed display device for outputting determined information.
  • a display unit 614 such as a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, a cathode ray tube (CRT), a projector, a printer or other now known or later developed display device for outputting determined information.
  • the display 614 may act as an interface for the user to see the functioning of the processor 602 , or specifically as an interface with the software stored in the memory 604 or in the drive unit 606 .
  • the computer system 600 may include an input device 616 configured to allow a user to interact with any of the components of system 600 .
  • the input device 616 may be a number pad, a keyboard, or a cursor control device, such as a mouse, or a joystick, touch screen display, remote control or any other device operative to interact with the system 600 .
  • the computer system 600 may also include a disk or optical drive unit 606 .
  • the disk drive unit 606 may include a computer-readable medium 610 in which one or more sets of instructions 612 , e.g. software, can be embedded. Further, the instructions 612 may embody one or more of the methods or logic as described herein. In a particular embodiment, the instructions 612 may reside completely, or at least partially, within the memory 604 and/or within the processor 602 during execution by the computer system 600 .
  • the memory 604 and the processor 602 also may include computer-readable media as discussed above.
  • the present disclosure contemplates a computer-readable medium that includes instructions 612 or receives and executes instructions 612 responsive to a propagated signal, so that a device connected to a network 620 can communicate voice, video, audio, images or any other data over the network 620 .
  • the instructions 612 may be transmitted or received over the network 620 via a communication port 618 .
  • the communication port 618 may be a part of the processor 602 or may be a separate component.
  • the communication port 618 may be created in software or may be a physical connection in hardware.
  • the communication port 618 is configured to connect with a network 620 , external media, the display 614 , or any other components in system 600 , or combinations thereof.
  • the connection with the network 620 may be a physical connection, such as a wired Ethernet connection or may be established wirelessly as discussed below.
  • the additional connections with other components of the system 600 may be physical connections or may be established wirelessly.
  • the network 620 may include wired networks, wireless networks, or combinations thereof.
  • the wireless network may be a cellular telephone network, an 802.11, 802.16, 802.20, or WiMax network.
  • the network 620 may be a public network, such as the Internet, a private network, such as an intranet, or combinations thereof, and may utilize a variety of networking protocols now available or later developed including, but not limited to TCP/IP based networking protocols.
  • While the computer-readable medium is shown to be a single medium, the term “computer-readable medium” includes a single medium or multiple media, such as a centralized or distributed database, and/or associated caches and servers that store one or more sets of instructions.
  • the term “computer-readable medium” shall also include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor or that cause a computer system to perform any one or more of the methods or operations disclosed herein.
  • the computer-readable medium can include a solid-state memory such as a memory card or other package that houses one or more non-volatile read-only memories. Further, the computer-readable medium can be a random access memory or other volatile re-writable memory. Additionally, the computer-readable medium can include a magneto-optical or optical medium, such as a disk or tapes or other storage device to capture carrier wave signals such as a signal communicated over a transmission medium. A digital file attachment to an e-mail or other self-contained information archive or set of archives may be considered a distribution medium that is a tangible storage medium. Accordingly, the disclosure is considered to include any one or more of a computer-readable medium or a distribution medium and other equivalents and successor media, in which data or instructions may be stored.
  • dedicated hardware implementations such as application specific integrated circuits, programmable logic arrays and other hardware devices, can be constructed to implement one or more of the methods described herein.
  • Applications that may include the apparatus and systems of various embodiments can broadly include a variety of electronic and computer systems.
  • One or more embodiments described herein may implement functions using two or more specific interconnected hardware modules or devices with related control and data signals that can be communicated between and through the modules, or as portions of an application-specific integrated circuit. Accordingly, the present system encompasses software, firmware, and hardware implementations.
  • the methods described herein may be implemented by software programs executable by a computer system.
  • implementations can include distributed processing, component/object distributed processing, and parallel processing.
  • virtual computer system processing can be constructed to implement one or more of the methods or functionality as described herein.
  • inventions of the disclosure may be referred to herein, individually and/or collectively, by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any particular invention or inventive concept.
  • inventions merely for convenience and without intending to voluntarily limit the scope of this application to any particular invention or inventive concept.
  • specific embodiments have been illustrated and described herein, it should be appreciated that any subsequent arrangement designed to achieve the same or similar purpose may be substituted for the specific embodiments shown.
  • This disclosure is intended to cover any and all subsequent adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the description.

Abstract

A system and method are disclosed for identifying similar queries. A user query may be compared with known search keywords. The user query may be a Chinese related query, which is converted into a different form before comparing with other converted queries or keywords. A similarity score based on different features may be used for comparing the queries.

Description

  • This application is a continuation-in-part application to U.S. patent application Ser. No. 11/363,315 (U.S. Pat. Pub. No. 2007/0203894), entitled “SYSTEM AND METHOD FOR IDENTIFYING RELATED QUERIES FOR LANGUAGES WITH MULTIPLE WRITING SYSTEMS,” filed Feb. 28, 2006, the disclosure of which is hereby incorporated by reference.
  • BACKGROUND
  • Online advertising may be an important source of revenue for enterprises engaged in electronic commerce. A number of different kinds of web page based online advertisements are currently in use, along with various associated distribution requirements, advertising metrics, and pricing mechanisms. Processes associated with technologies such as Hypertext Markup Language (HTML) and Hypertext Transfer Protocol (HTTP) enable a web page to be configured to contain a location for inclusion of an advertisement. A page may not only be a web page, but any other electronically created page or document. An advertisement can be selected for display each time the page is requested, for example, by a browser or server application.
  • Online advertising may be linked to online searching. Online searching is a common way for consumers to locate information, goods, or services on the Internet. A consumer may use an online search engine to type in a query to search for other pages or web sites with information related to that query. When the advertising that is shown on the search engine page is related to the query, the search may be referred to as a sponsored search. Sponsored searching may require advertisers to bid for search keywords. The search keywords are associated with the search query for displaying advertisements with the search results. It may be difficult to identify which keyword(s) that a search query is related to. In particular, users may enter search queries that are misspelled or that are in a different language.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The system and method may be better understood with reference to the following drawings and description. Non-limiting and non-exhaustive embodiments are described with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. In the drawings, like referenced numerals designate corresponding parts throughout the different views.
  • FIG. 1 is a block diagram of an exemplary network system;
  • FIG. 2 is a block diagram of a language analyzer;
  • FIG. 3 is a block diagram of exemplary conversion forms;
  • FIG. 4 is a block diagram of exemplary comparisons of queries;
  • FIG. 5 is a flow diagram for identifying related queries; and
  • FIG. 6 is a block diagram of a general computer system for use with the disclosed embodiments.
  • DETAILED DESCRIPTION
  • By way of introduction, the embodiments described below include a system and method for identifying and measuring related queries. The embodiments relate to identifying similar Chinese queries. A user query may be compared with known search keywords or other search queries. The search keywords may be used by advertisers for sponsored searching. The user query may be a non-native language query, such as a Chinese related query in an English language website or a query in a Chinese website. The user query is converted into a different form before comparing with other converted queries or the search keywords. For explanation purposes, the embodiments are described in terms of a Chinese related query, but other languages or query platforms may be used. A similarity score based on various features may be used for comparing the queries. Based on the similarity score or other comparison features, the original user query may be substituted by other queries or be associated with one or more search keywords. The associated search keywords may be used for selecting the advertisements that are displayed with the search results for that search query.
  • Alternatively, related queries may be identified from a reformulation of the original query. The reformulation may be based on stored query logs and used to compare the original query with stored queries. As part of the comparison, various features, including language specific features, may be used to measure query similarity. Based on the query similarity the original query may be substituted for a stored query or search keyword for identifying the relevant advertisements to display. A user's query may be misspelled and the system may identify a related query that is correctly spelled that replaces the initial user query. Chinese related queries may be identified and measured due to an increased interest in Chinese search and advertising markets.
  • Other systems, methods, features and advantages will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the following claims. Nothing in this section should be taken as a limitation on those claims. Further aspects and advantages are discussed below in conjunction with the embodiments.
  • FIG. 1 provides a simplified view of a network system 100 in which the present embodiments may be implemented. Not all of the depicted components may be required, however, and some embodiments of the invention may include additional components not shown in the figure. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional, different or fewer components may be provided.
  • FIG. 1 is a block diagram illustrating an embodiment of an exemplary network system 100 for language analysis and comparison. In particular, system 100 includes a language analyzer 104 that may receive and convert a user's search query for comparison with other queries or search keywords. A user device 106 is coupled with a search engine 102 through the network 109. The search engine 102 is coupled with a search log database 112, and both are coupled with the language analyzer 104. The search log database 112 is coupled with a data source 113 and a unit dictionary 116. An ad server 103 may be coupled with the search engine 102 and/or coupled with the language analyzer 104. Herein, the phrase “coupled with” is defined to mean directly connected to or indirectly connected through one or more intermediate components. Such intermediate components may include both hardware and software based components. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein.
  • The user device 106 may be a computing device for a user to connect to a network 109, such as the Internet. Examples of a user device include but are not limited to a personal computer, personal digital assistant (“PDA”), cellular phone, or other electronic device. The user device 106 may be configured to access other data/information in addition to web pages over the network 109 with a web browser, such as INTERNET EXPLORER® (sold by Microsoft Corp., Redmond, Wash.). The user device 106 may enable a user to view pages over the network 109, such as the Internet. The user device 106 may be the user device described below with respect to FIG. 6.
  • The user device 106 may be configured to allow a user to interact with the search engine 102, the ad server 103, or other components of the system 100. In one embodiment, the user device 106 may receive and display a site or page provided by the search engine 102. The user device 106 may include a keyboard, keypad or a cursor control device, such as a mouse, or a joystick, touch screen display, remote control or any other device operative to allow a user to interact with the page(s) provided by the search engine 102 and/or the ad server 103.
  • The search engine 102 is coupled with the user device 106 through the network 109, as well as being coupled with the language analyzer 104, the ad server 103 and/or the search log database 112. In one embodiment, the search engine 102 is a web server. The search engine 102 may provide a site or a page over a network, such as the network 109 or the Internet. A site or page may refer to a web page or a series of related web pages which may be received or viewed over a network. The site or page is not limited to a web page, and may include any information accessible over a network that may be displayed at the user device 106. In one embodiment, a site may refer to a series of pages which are linked by a site map. For example, the web site of www.yahoo.com (operated by Yahoo! Inc., in Sunnyvale, Calif.) may include thousands of pages, which are included at yahoo.com. Hereinafter, a page will be described as a web page, a web site, or any other site/page accessible over a network. A user of the user device 106 may access a page provided by the search engine 102 over the network 109. As described below, the page provided by the search engine 102 may be a search page that receives a search query from the user device 106 and provides search results that are based on the received search query.
  • The search engine 102 may include an interface, such as a web page, e.g., the web page which may be accessed on the World Wide Web at yahoo.com, which is used to search for pages which are accessible via the network 109. The user device 106, autonomously or at the direction of the user, may input a search query (also referred to as a user query, original query, search term or a search keyword) for the search engine 102. A single search query may include multiple words or phrases. The search engine 102 may perform a search for the search query and display the results of the search on the user device 106. The results of a search may include a listing of related pages or sites that is provided by the search engine 102 in response to receiving the search query.
  • The ad server 103 is coupled with the search engine 102 and/or the language analyzer 104. The ad server 103 may be configured to provide advertisements to the search engine 102. In an alternate embodiment, the search engine 102 and the ad server 103 may be a common component and/or the search engine 102 may select and provide advertisements. The ad server 103 may include or be coupled with an advertisement database that includes advertisements that are available to be displayed by the search engine 102 for sponsored searching. In addition, the advertisements may be associated with one or more search keywords. The search keywords may be purchased or bid on by advertisers. Accordingly, when that search keyword is searched for, the advertiser who purchased or placed the highest bid is selected and their advertisement is displayed. The ad server 103 may include or be coupled with a database, such as an advertisement database, that stores search keywords and the respective price or bid for each keyword from advertisers that is referenced for each search query. In one embodiment, a search query is received and compared with known search keywords or other search queries when the ad server 103 selects and provides the advertisement to the search engine 102.
  • The search log database 112 includes records or logs of at least a subset of the search queries entered in the search engine 102 over a period of time and may also be referred to as a search query log, search term database, keyword database or query database. In one embodiment, the search log database 112 may store the search keywords that are used by the ad server 103 in selecting an advertisement for a particular search query. The search log database 112 may include search queries from any number of users over any period of time. Alternatively, the search log database 112 may include records or logs of a subset of the queries or requests for data entered at the search engine 102 over a period of time. The search log database 112 may also store associations between search queries from the search engine 102. For example, a search query may be associated with a search keyword or other search queries after a conversion and comparison by the language analyzer 104 as discussed below.
  • The search log database 112 may also be coupled with a data source 113. The data source 113 may be an internal source of data, an external source of search data, or a combination of the two. An external data source may include search results from other search engines or other sources. For example, a search engine other than search engine 102 may be an external data source and provide search logs to the search log database 112. An internal data source may include search data or other data from the search engine 102. Other data may include other searching or web browsing tendencies identified by the search engine 102.
  • The search log database 112 may also be coupled with a unit dictionary 116. The unit dictionary 116 may be a database of user queries or search keywords that are coupled with one another as units. Units may also be referred to as concepts or topics and are sequences of one or more words that appear in search queries. For example, the search query “New York City law enforcement” may include two units, e.g. “New York City” may be one unit and “law enforcement” may be another unit. A unit is a phrase of common words that identify a single concept. As another example, the search query “Chicago art museums” may include two units, e.g. “Chicago” and “art museums.” The “Chicago” unit is a single word, and “art museums” is a two-word unit. Units identify common groups of keywords to maximize the efficiency and relevance of search results. The unit dictionary 116 may include Chinese related queries, as well as Chinese related units that include Chinese characters. Categorization of search queries into units is discussed in commonly owned U.S. Pat. No. 7,051,023 issued May 23, 2006, entitled “SYSTEMS AND METHODS FOR GENERATING CONCEPT UNITS FROM SEARCH QUERIES,” which is hereby incorporated by reference.
  • The unit dictionary 116 and the categorization of search queries into units may be used to compare and analyze search queries received by the search engine 102. A search query may be broken into units that are compared with units from other queries or search keywords. In one embodiment, past search queries and search keywords are stored in the search log database 112 as units that may be used in an analysis by the language analyzer 104.
  • In one embodiment, the ad server 103, the search engine 102 and/or the search log database 112 may be coupled with the language analyzer 104. The language analyzer 104 receives a user query from the user device 106 and matches or identifies other queries or search keywords. The user query may be converted to a different form for comparing various features of the user query with search keywords as discussed with respect to FIG. 2.
  • The language analyzer 104 may be a computing device as described below with respect FIG. 6. In one embodiment, the language analyzer 104 includes a processor 105, memory 107, software 108 and an interface 110. The language analyzer 104 may be a separate component from the search engine 102 and the ad server 103. In an alternative embodiment, any of the language analyzer 104, search engine 102, and the ad server 103 may be combined as a single component. The interface 110 may communicate with any of the search engine 102, search log database 112, and ad server 103. In one embodiment, the interface 110 may include a user interface configured to allow a user to interact with any of the components of the language analyzer 104. For example, a user may be able to modify the conversion form or comparison features that are used by the language analyzer 104.
  • The processor 105 in the language analyzer 104 may include a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP) or other type of processing device. The processor 105 may be a component in any one of a variety of systems. For example, the processor 105 may be part of a standard personal computer or a workstation. The processor 105 may be one or more general processors, digital signal processors, application specific integrated circuits, field programmable gate arrays, servers, networks, digital circuits, analog circuits, combinations thereof, or other now known or later developed devices for analyzing and processing data. The processor 105 may operate in conjunction with a software program, such as code generated manually (i.e., programmed).
  • The processor 105 may be coupled with a memory 107, or the memory 107 may be a separate component. The interface 110 and/or the software 108 may be stored in the memory 107. The memory 107 may include, but is not limited to computer readable storage media such as various types of volatile and non-volatile storage media, including to random access memory, read-only memory, programmable read-only memory, electrically programmable read-only memory, electrically erasable read-only memory, flash memory, magnetic tape or disk, optical media and the like. In one embodiment, the memory 107 includes a random access memory for the processor 105. In alternative embodiments, the memory 107 is separate from the processor 105, such as a cache memory of a processor, the system memory, or other memory. The memory 107 may be an external storage device or database for storing recorded image data. Examples include a hard drive, compact disc (“CD”), digital video disc (“DVD”), memory card, memory stick, floppy disc, universal serial bus (“USB”) memory device, or any other device operative to store image data. The memory 107 is operable to store instructions executable by the processor 105.
  • The functions, acts or tasks illustrated in the figures or described herein may be performed by the programmed processor executing the instructions stored in the memory 107. The functions, acts or tasks are independent of the particular type of instruction set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firm-ware, micro-code and the like, operating alone or in combination. Likewise, processing strategies may include multiprocessing, multitasking, parallel processing and the like. The processor 105 is configured to execute the software 108. The software 108 may include instructions for analyzing and converting search queries and comparing features with other queries or search keywords.
  • The interface 110 may be a user input device or a display. The interface 110 may include a keyboard, keypad or a cursor control device, such as a mouse, or a joystick, touch screen display, remote control or any other device operative to interact with the language analyzer 104. The interface 110 may include a display coupled with the processor 105 and configured to display an output from the processor 105. The display may be a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, a cathode ray tube (CRT), a projector, a printer or other now known or later developed display device for outputting determined information. The display may act as an interface for the user to see the functioning of the processor 105, or as an interface with the software 108 for providing input parameters. In particular, the interface 110 may allow a user to interact with the language analyzer 104 to establish a conversion of a user query and the features that are compared in matching a query with a search keyword.
  • Any of the components in system 100 may be coupled with one another through a network. For example, the language analyzer 104 may be coupled with the search engine 102, search log database 112, or ad server 103 via a network. Any of the components in system 100 may include communication ports configured to connect with a network. The present disclosure contemplates a computer-readable medium that includes instructions or receives and executes instructions responsive to a propagated signal, so that a device connected to a network can communicate voice, video, audio, images or any other data over a network. The instructions may be transmitted or received over the network via a communication port or may be a separate component. The communication port may be created in software or may be a physical connection in hardware. The communication port may be configured to connect with a network, external media, display, or any other components in system 100, or combinations thereof. The connection with the network may be a physical connection, such as a wired Ethernet connection or may be established wirelessly as discussed below. Likewise, the connections with other components of the system 100 may be physical connections or may be established wirelessly.
  • The network or networks that may connect any of the components in the system 100 to enable communication of data between the devices may include wired networks, wireless networks, or combinations thereof. The wireless network may be a cellular telephone network, a network operating according to a standardized protocol such as IEEE 802.11, 802.16, 802.20, published by the Institute of Electrical and Electronics Engineers, Inc., or a WiMax network. Further, the network(s) may be a public network, such as the Internet, a private network, such as an intranet, or combinations thereof, and may utilize a variety of networking protocols now available or later developed including, but not limited to TCP/IP based networking protocols. The network(s) may include one or more of a local area network (LAN), a wide area network (WAN), a direct connection such as through a Universal Serial Bus (USB) port, and the like, and may include the set of interconnected networks that make up the Internet. The network(s) may include any communication method or employ any form of machine-readable media for communicating information from one device to another. For example, the ad server 103 or the search engine 102 may provide pages to the user device 106 over a network, such as the network 109. The network or networks described above, including the network 109, may be the network discussed below with respect to FIG. 6.
  • The ad server 103, the search engine 102, the search log database 112, the language analyzer 104, the unit dictionary 116 and/or the user device 106 may represent computing devices of various kinds, such as the components described with respect to FIG. 6. Such computing devices may generally include any device that is configured to perform computation and that is capable of sending and receiving data communications by way of one or more wired and/or wireless communication interfaces. Such devices may be configured to communicate in accordance with any of a variety of network protocols, as discussed above. For example, the user device 106 may be configured to execute a browser application that employs HTTP to request information, such as a web page, from the search engine 102 or ad server 103. The present disclosure contemplates a computer-readable medium that includes instructions or receives and executes instructions responsive to a propagated signal, so that any device connected to a network can communicate voice, video, audio, images or any other data over a network.
  • FIG. 2 illustrates an embodiment of a language analyzer. As described with respect to FIG. 1, the language analyzer 104 may convert a search query into a different form for comparing its features with other queries or search keywords that are used for selecting matching advertisements to be displayed on a search results page. The language analyzer 104 may include a receiver 202, a converter 204, a comparator 206, and a calculator 208. As shown, the language analyzer 104 or any of its components may represent computing devices of various kinds, such as the components described with respect to FIG. 6.
  • The receiver 202 may receive a user query from the search engine 102, which may receive the user query from the user device 106. The receiver 202 may also receive search keywords from the ad server 103. The search keywords may be matched with advertisements, such that when a user inputs the search keyword in a search engine, the search results page includes the matched advertisement. Accordingly, the language analyzer 104 may match user queries with search keywords for selecting advertisements to be displayed on the search query results page.
  • The converter 204 is coupled with the receiver 202. The converter 204 receives the user query or other search keywords and converts them into a different form for comparison. As described, the user query may be a Chinese related query and the converter 204 may convert the Chinese related query into a different form to aid comparison. A Chinese related query may include any Chinese characters, including Roman characters that represent a Chinese character or phrase. Chinese related queries may also include queries that originate from or are received by a Chinese search engine and may be simplified Chinese and/or traditional Chinese.
  • FIG. 3 illustrates exemplary conversion forms. In particular, the converter 204 may utilize any of the conversion forms 302 to convert a Chinese related query. The converter 204 may convert a search query into any of the conversion forms 302 to compare the query with other converted queries or converted search terms. As described below, the conversion may include a transformation of the query by adding, deleting, and/or substituting characters or words in the queries. The conversion or transformation may result in a common format or common form that may be used for comparing the queries. The conversion forms 302 shown in FIG. 3 are merely exemplary. In alternate embodiments, there may be additional conversions forms 302 that are not illustrated or described. The conversion may receive a Chinese related query and convert each element or selected elements of the query into an array that represents the converted form of the Chinese related query.
  • A first conversion form is a conversion into Chinese soundex 304. The Chinese characters are converted into pinyin without tone, while the roman letters remain. The query is then converted into a Chinese soundex-like representation by first retaining the first letter of a string. Second, all occurrences of a, e, h, i, o, and u are removed, unless it is the first letter. Third, characters may be replaced, such as, replacing “zh” with “z,” “ch” with “c,” “sh” with “s,” “ng” with “n,” “rd” with “d,” “rl” with “l,” “rn” with “n,” “rs” with “s,” and/or “rt” with “t.” Fourth, the remaining letters after the first letter are assigned a number, such as, (m, n, l)=1, (b, p)=2, (f, v, w, h)=3, (d, t)=4, (j, z, s, x, q, c, g, k)=5, (r)=6, (y)=7, and (a)=8. Fifth, if two or more letters are adjacent, then the first letter remains and the others are omitted. Sixth, the spaces are removed. Seventh, all characters remaining are returned.
  • A second conversion form is converting the Chinese characters into the keyboard input form zhuyin (Bopomofo) 306. Each element in the array is either all zhuyin characters for one corresponding Chinese character or a roman character originally in the query without transformation. A third conversion form is a similar zhuyin (Bopomofo) conversion 308, except each element in the array is either one zhuyin character or a roman character originally in the query without transformation.
  • A fourth conversion form is converting Chinese characters into radicals 310. Each element in the array is either the radical for a Chinese character or the roman character originally in the query without transformation. A radical 310 may be the semantic root (i.e., portion bearing the meaning) of a Chinese character. A radical may be part of a Chinese character and/or the semantic component of this Chinese character. For example, in the character
    Figure US20080077588A1-20080327-P00015
    pronounced as jie with a meaning of “sister”, the left part
    Figure US20080077588A1-20080327-P00016
    (pronounced n{umlaut over ({hacek over (u)})} in Mandarin Chinese) is the semantic component. Chinese characters may have at least one or two radicals. The radicals may be used for Chinese Hanzi. A dictionary may be used to match a Chinese character with its radical(s). When a Chinese character has multiple radicals, the most meaningful radical (which may be identified in a dictionary) may be considered for comparison.
  • A fifth conversion form is converting Chinese characters into pinyin without tone 412. Each element in the array is either the complete pinyin without tone for one corresponding Chinese character or a roman character originally in the query without any transformation. A sixth conversion form is converting Chinese characters into pinyin without tone 414 in which each element in the array is either one pinyin character or a roman character originally in the query without transformation. Pinyin may be a Standard Mandarin Romanization system. In pinyin, the pin refers to a “spelling” and the yin refers to a “sound.” There may be a pinyin corresponding to each Chinese Character. One pinyin may include more than two roman characters. In the fifth conversion form, each pinyin may be a unit for similarity comparison. In the sixth conversion form, each character within pinyin may be a unit for comparison.
  • A seventh conversion form is converting Chinese characters into pinyin with tone 416. Each element in the array is either the complete pinyin and its tone for one corresponding Chinese character or a roman character originally in the query without transformation. An eighth conversion form is converting Chinese characters into pinyin with tone 418 in which each element in the array is either one pinyin character, its tone, or a roman character originally in the query without transformation. A ninth conversion form is converting queries into two character-based arrays 420. In particular, if a character is Chinese, three bytes in Chinese (utf8) is an element in the array. In other words, each Chinese character is represented in three bytes. If a character is roman, then the roman character itself is an element.
  • A tenth conversion form is the removal of Chinese characters 422. The roman characters are left in the query and the Chinese characters are removed. Likewise, an eleventh conversion form removes the roman characters 424, and keeps the Chinese characters in the query. A twelve conversion form includes leaving the query as inputted 426. In other words, the twelve conversion is no conversion 426.
  • In one embodiment, the receiver 202 receives two queries that are to be compared to determine the similarity between those queries. The queries are converted into at least one of the conversion forms by the converter 204. In one embodiment, both queries are converted into the twelve exemplary conversion forms 302 and the queries are compared in all twelve converted forms. Alternatively, certain conversion forms are selected for converting the queries and the queries are compared for each of those converted forms.
  • After being converted, the queries may be compared by the comparator 206. The comparator 206 may be configured to perform comparison of a user's search query with other queries or with search keywords that are used by the ad server 103 for displaying relevant advertisements that are linked to particular search keywords. In one embodiment, the comparator 206 determines the similarity between two queries. The queries are first converted into a similar form or similar forms by the converter 204 and each of those forms are compared by the comparator 206. In one embodiment, the queries are converted into the twelve forms illustrated in FIG. 3 and the comparator 206 makes twelve comparisons between the queries for each of the twelve conversions of each query. In alternative embodiments, there may be more or fewer conversion forms that are compared by the comparator 206.
  • In one embodiment, a user query may be compared with a candidate set of queries to determine which of the candidate set is most similar to the user query. The candidate set may be made up of search keywords which are compared with the user query to determine which search keyword is most similar. The candidate set of queries or keywords for comparison may be chosen based on an initial analysis of the user query compared with the search log database 112. In one embodiment, when the user query is received the candidate set is identified and each member of the candidate set is compared with the user query to determine which is most similar. As described below, a similarity score may be calculated for each member of the candidate set that represents a similarity with the user query. The member of the candidate set with the closest similarity score may be most similar to the user query. In an alternative embodiment, the candidate set may include one query or include all queries, such as those stored in the search log database 112.
  • FIG. 4 illustrates exemplary comparisons of queries. In particular, the comparator 206 may utilize comparison features 402 when comparing queries. The comparison features 402 shown in FIG. 4 are merely exemplary. In alternate embodiments, there may be additional comparison features 402 that are not illustrated or described. The comparison may involve comparing various forms of converted Chinese related queries. In particular, the comparator 206 may compare an array of elements that is generated by the converter 204 as a converted form of a Chinese related query. In one embodiment, the comparator 402 may compare queries as described in the commonly owned U.S. application entitled, “SYSTEM AND METHOD FOR IDENTIFYING RELATED QUERIES FOR LANGUAGES WITH MULTIPLE WRITING SYSTEMS,” U.S. Pat. Pub. No. 2007/0203894, filed Feb. 28, 2006, the disclosure of which is hereby incorporated by reference.
  • A first comparison feature may be an edit distance 404 between two queries. The edit distance may be a measure of the difference between two character strings, such as queries. In one embodiment, the edit distance may be a minimum number of edit operations required to transform the first query into the second query. The edit operation may include inserting or deleting a character into a string or replacing a character by another character. In an alternative embodiment, weights may be assigned for different edit operations. For example, a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a. The edit distance may be the Levenshtein distance or the Damerau-Levenshtein distance when a transposition of characters counts as a single edit operation. In alternative embodiments, there may be other algorithms that are used for determining the edit distance between queries or there may be more or fewer edit operations that are used in determining an edit distance between queries.
  • A second comparison feature may be an edit distance without a domain 406. In particular, two queries may have their domains removed before computing the edit distance. The domain may be a web domain, such as “.com” that is removed. The removal of the domain may be helpful because a user querying “yahoo.com” and “yahoo.net” is likely making the same query. A third comparison feature may be a character level prefix overlap 408. The character level prefix overlap 408 may be a measure of the characters/words that are the same at the beginning of the queries. For example, “auto cleaners” and “auto cleaning” have a prefix overlap of “auto clean.” The prefix overlap may indicate increased similarity. A fourth comparison feature may be a character level suffix overlap 410. The character level suffix overlap 410 measures the similarity between queries at the end of the query. For example, “auto insurance agent” and “home insurance agent” share a suffix overlap of “insurance agent.” Similar, to the prefix overlap, the suffix overlap may indicate increased similarity.
  • A fifth comparison feature may be a minimum edit distance 412 over all the conversion forms. Likewise, a sixth comparison feature may be a maximum edit distance 414 over all the conversion forms. Given twelve conversion forms and twelve edit distances for each conversion, the minimum edit distance 412 and the maximum edit distance 414 may be identified. In one embodiment, the minimum and maximum may be removed as outliers. Alternatively, the minimum or maximum may be weighted higher when computing a similarity score. A seventh comparison feature may be a minimum edit distance without a domain 416 and an eighth comparison feature may be a maximum edit distance without a domain 418. As discussed above, the domains in a query may not be valuable in terms of determining what the user is searching for, so the domains are removed before comparison.
  • Additional comparison features may be a word level edit distance 420, a word level prefix overlap 422, or a word level suffix overlap 424. The word level comparisons are similar to the character level comparisons, except entire words are compared rather than individual characters. A length difference 426 between two queries may also be used for comparing.
  • The comparator 206 may be coupled with a calculator 208 that may calculate a similarity score. The similarity score may be a measure of the similarity between the queries. The similarity score may be calculated based on individual comparisons of different conversion forms of two queries with each individual comparison being assigned a weighted value. The multiple conversion forms described with respect to FIG. 3 may each result in a separate comparison between two queries. Accordingly, using the twelve conversion forms 302, there may be twelve different edit distances or similarity scores, one for each conversion. Those twelve converted forms may be compared and multiplied by a weight for each form to get an overall similarity score between the queries. Alternatively, a subset of the twelve conversion forms or additional conversion forms not described may be utilized to convert Chinese related queries into different forms for comparison.
  • In one embodiment, the equation presented in Table A may be used to calculate a similarity score indicating the strength of similarity between a query pair. The query pair may include a given query q and a comparison query MODS(q), either of which may be written according to one or more Chinese writing systems. MODS(q) may represent a converted query. In alternative embodiments, both q and MODS(q) may be converted to the same form for comparison, or MODS(q) is converted into a form for comparison with q. MODS(q) may represent a related query that is identified as a potential substitute for the user query q. When MODS(q) has good similarity score with the user original query q, MODS(q) may be used as a search keyword for fetching advertisements. MODS(q) may also be referred to as a rewritten query. Both user original query q and MODS(q) may be converted to the same form for comparison. The equation in Table A makes use of a subset of the conversion forms 302 and the comparison features 402 that are discussed above. In alternative embodiments, different conversion forms or comparison features may be utilized to generate a similarity score. Those of skill in the art recognize that the equation illustrated in Table A is merely exemplary and may be modified so as to provide for the calculation of a similarity score for multiple writing systems. A formula may be optimized based on the source of the query, because queries from Taiwan may be different from queries from Hong Kong. Accordingly, the conversions, comparisons, and weights may be modified for different types of queries.
    TABLE A
    LM 1 Score ( q , MODS ( q ) ) = 2.542 - 0.1778 . × pq 12 min ( q , MODS ( q ) ) - 0.3316 × levroman ( q , MODS ( q ) ) - 1.064 . × agreechar ( q , MODS ( q ) ) + 1.098 × dlevpynchar ( q , MODS ( q ) ) - 0.2432 . × q 1 bidded ( q , MODS ( q ) ) + 0.3486 . × wordr ( q , MODS ( q ) ) + 0.2487 . × q 2 hasroman ( q , MODS ( q ) ) - 0.1284 . × pq 12 max ( q , MODS ( q ) ) - 0.4667 . × levtaiwanchar ( q , MODS ( q ) ) - 0.2875 . × lengthdiffn ( q , MODS ( q ) ) - 0.0006 . × entropy 21 min ( q , MODS ( q ) ) - 0.2875 × lengthsubtmin GT 3 ( q , MODS ( q ) )
  • According to the equation presented in Table A, q represents a given query written according to one or more Chinese writing systems and MODS(q) represents a query selected from a candidate set of potential queries related to query Q. Alternatively, query q may be referred to as query q1 and MODS(q) may be referred to as query q2 or q′. The initial number before each feature is a weight that may be used to emphasize or deemphasize features. The exemplary features utilized in the equation presented in Table A are described below.
  • Pq12min may be a function for calculating the query substitution probability of query q1 following query q2 in a log of user query sessions, such as from the search log database 112. The search log database 112 may identify the order of the one or more queries submitted by the user, for example, to provide an indication of how the user refined a query, how the user rewrote a query, how the user utilized one or more alternate writing systems of a language with multiple writings systems to express a query Q, etc. When queries q1 and q2 follow one another in a search log database 112, it may be an indication that they are similar because q2 may be a refinement of q1. According to one embodiment, the pq12min function calculates a query substitution probability of a given query q1 following a given query q2, and may also be used to calculate a unit substitution of a unit u following a given unit u′. In one embodiment, pq12min=prob(U_i−>U_i′|U_i)/max_j prob(U_i−>U_j|U_i), where U_i is q1 or its units, U_i′ is possible U_i substitutions, and U_j is q2 or its units. For query suggestions, pq12min may be the normalized probability of q2 as q1's substitution. In one embodiment, a normalized probability is computed of the units in q1 substituted by corresponding units in q2, and take their minimum as pq12min.
  • Levroman is a comparison using the roman characters of a query, such as with conversion form 322, which removes Chinese characters. For each query all non-roman characters may be removed, but spaces are left in the query. The roman character parts are changed into arrays. Each roman character is an element in the array, including any spaces. The Levenshtein distance is measured between the two arrays. In the case that neither q1 nor q2 has roman characters, levroman is set to 0. In the case that one of q1 or q2 has roman characters but the other does not have roman characters, levroman is set to 1. As an example, consider a first query q1=
    Figure US20080077588A1-20080327-P00001
    map” and a second query q2=
    Figure US20080077588A1-20080327-P00002
    map.” The first query does not include a space before map, but the second query includes a space before map. After the Chinese characters are removed, the queries are converted into arrays, in which q1 is represented as the array:
    Figure US20080077588A1-20080327-C00001

    and q2 is represented as the array:
    Figure US20080077588A1-20080327-C00002

    The Levenshtein distance between the two arrays is one because of the space in the first element of q2. Accordingly because there are four elements, the Levenshtein distance may be represented as ¼=0.25 and levroman is 0.25 for this query pair.
  • Agreechar may relate to character agreement without removing a space regardless of the order of characters. Agreechar may be similar to wordr discussed below, except it is for the character level rather than the word level. In one embodiment, agreechar is the proportion of unique characters in common between a query pair, such as: agreechar = C q 1 C q 2 C q 1 C q 2 ,
    in which Cq1 is the set of unique characters (including space) in q1, and Cq2 is the set of unique characters (including space) in q2. In the levroman example, q1 and q2 have 7 unique characters in total, which are
    Figure US20080077588A1-20080327-P00005
    Figure US20080077588A1-20080327-P00003
    “m”, “a”, “p” and a space. Query q1 and q2 share 5 unique characters, which are
    Figure US20080077588A1-20080327-P00004
    “m”, “a” and “p”. Therefore, agreechar is 0.714 (calculated by 5/7) for this query pair.
  • Wordr is similar to agreechar except is matches words rather than characters. The queries are separated into words, segments, or units as described above. The percentage of unique words not in common is determined for wordr. In other words, wordr=1−proportion of unique words in common, such as wordr = 1 - w q 1 w q 2 w q 1 w q 2 ,
    in which wq1 is the set of unique words in q1, and wq2 is the set of unique words (including space) in q2. In the previous example of levroman,
    Figure US20080077588A1-20080327-P00006
    Figure US20080077588A1-20080327-P00009
    map” is segmented into two words
    Figure US20080077588A1-20080327-P00008
    and “map” and
    Figure US20080077588A1-20080327-P00007
    map” is segmented into two words
    Figure US20080077588A1-20080327-P00010
    and “map”. There are three unique words and one of them is common between q1 and q2, so wordr is 1−⅓=0.666.
  • Dlevpynchar utilizes the complete pinyin without tone 312 conversion form. The first query q1 and second query q2 first have a common domain removed and each roman character (including spaces) are kept, while each Chinese character is converted into pinyin without tone. The queries are then transformed into arrays. Each roman character is an element in the array and each Chinese character's pinyin without tone is an element in the array. The Levenshtein distance is then measured. In the example described above, when query q1
    Figure US20080077588A1-20080327-P00011
    map” and query q2
    Figure US20080077588A1-20080327-P00012
    map” where there is no space in query q1, but there is a space in query q2. The first query q1 is converted into an array:
    Figure US20080077588A1-20080327-C00003

    The second query q2 is converted into an array:
    Figure US20080077588A1-20080327-C00004

    The Levenshtein distance is computed between the two arrays to be ⅙=0.167, which may also be the dlevpynchar value for this query pair.
  • Q1bidded is 1 if q1 is bidded and q1bidded is 0 if q1 is not bidded. When q1 is a user query and q1 is bidded, it may mean that an advertiser chooses q1 as a keyword for the advertisements they want to show. This bidding process may also identify a cost they would like to pay if web searchers click the ads fetched by the keyword. When q1 is not bidded that may mean there are no matched keywords in the advertisement database. Therefore, a query identifying system may identify a related query (e.g. MODS(q)) to substitute for the user query.
  • Q2hasroman is 1 if q2 contains any roman characters, but not including any spaces. Q2hasroman is 0 if q2 does not contain any roman characters. The queries that are analyzed may be from Chinese search engine or in a search engine that receives Chinese related queries. A Chinese search engine may receive queries with roman characters due to the usage of roman characters in Chinese and the popularity of roman character based languages such as English. The Chinese characters and roman characters maybe processed differently. For example, a Chinese character may be converted into Pinyin for a similarity comparison, while Roman characters are not converted into Pinyin. Accordingly, a similarity score computation may be adjusted based on the presence of Roman characters.
  • Pq21max may be a function for calculating the query substitution probability of query q1 following query q2 in a log of user query sessions, such as from the search log database 112. In one embodiment, pq21max=prob(U_i−>U_i′|U_i′)/max_j prob(U_i−>U_j|U_j), where U_i is q1 or its units, U_i′ is possible U_i substitutions, and U_j is q2 or its units. The normalized probability may be calculated according above equation for each unit pair in the query pair and the maximum is used as pq21max.
  • Levtaiwanchar utilizes the removal of roman characters 324 conversion. In particular, all non-Chinese characters are removed and the remaining Chinese character parts are put into an array where each Chinese character is an element in the array. The Levenshtein distance is measured between the two arrays. When neither query q1 nor query q2 includes Chinese characters, levtaiwanchar is 0. When only one of q1 or q2 has Chinese characters levtaiwanchar is 1. In the example described above, when query q1=
    Figure US20080077588A1-20080327-P00013
    map” and query q2=
    Figure US20080077588A1-20080327-P00014
    where there is no space in query q1, but there is a space in query q2. The first query q1 is converted into an array:
    Figure US20080077588A1-20080327-P00017
  • Query q2 becomes the array:
    Figure US20080077588A1-20080327-P00018
  • Accordingly, the Levenshtein distance is computed between the two arrays, which is ⅓=0.333 and levtaiwanchar is 0.333 for this query pair.
  • Lengthdiffn is the length difference in characters between q1 and q2, which is normalized by their maximum length in characters. In one embodiment, lengthdiffn is: lengthdiffn = abs ( q 1 - q 2 ) max ( q 1 , q 2 ) .
  • Entropy21min is an uncertainty that may be associated with a similarity between q1 and q2. For a whole query substitution, entropy 21 min = i ( freq ( q 1 q 2 i ) / freq ( q 2 i ) ) × log ( ( freq ( q 1 q 2 i ) / freq ( q 2 i ) ) ) ,
    where i is the number of possible q1 query substitutions with q2. For unit substitution, entropy 21 min = min j i ( freq ( q 1 j q 2 j i ) / freq ( q 2 j i ) ) i × log ( ( freq ( q 1 j q 2 j i ) / freq ( q 2 j i ) ) ) ,
    where j is the number of unit substitution between q1 and q2, and i is the number of possible q1 j's unit substitutions.
  • LenthsubtminGT3 utilizes a substitution of characters. For query suggestions, lengthsubstminGT3 is 1 if the minimum length of q1 and q2 is less than 3 in characters. Otherwise, lengthsubstminGT3 is 0. For unit suggestions, lengthsubstminGT3 is 1 if the minimum length of any of the substitution units in characters is greater than 3. Otherwise, lengthsubstminGT3 is 0. Query suggestion may refer to a generation of related queries based on an original user query. The user query may be broken into units as described above. A related unit may be found for each unit and combined to form a related query. For example, when a user enters a query for “New York hotel,” it may be split into two units “New York” and “hotel.” “New York” may be rewritten to a related query “Manhattan” and “hotel” may be rewritten to “motel.” Accordingly, “Manhattan motel” may be a related candidate query for an original user query of “New York hotel.”
  • As described, the equation in Table A and the corresponding features that are used to calculate a similarity score in the calculator 208 are exemplary. Alternatively, a different equation, different weights and different features may be utilized to compute a similarity score. For example, the edit distance may be computed for each of the comparison forms 302 and averaged to become the similarity score. Alternatively, weights may be added to each converted form, or additional comparison features 402 may be used.
  • In one embodiment, the equation that is used to determine the similarity score, such as the equation in Table A, is analyzed by comparing with a human or editorial control set. The editorial control set may include a human review of the similarity scores for pairs of queries to determine an accuracy of the equation used for calculating the similarity score. In one embodiment, the human review may be used to optimize the equation that calculates the similarity score. Human editors may label query pairs with a relevance score. The relevance score may be used as a training label for the similarity score calculation, such as for the weights used in the equation in Table A. The editorial score may be a response variable and/or a dependent variable. The model may be fitted using linear regression.
  • FIG. 5 is an illustration for identifying related queries. In block 502, a user query is received. The user query may be Chinese-related and include at least one Chinese character. The user query may be received by a search engine 102. The user query may be compared with a selected candidate set of queries or search keywords in block 504. The candidate set may be selected form the search log database 112. In one embodiment, the candidate set may be chosen based on an initial comparison of similarity with the user query. The user query and/or the candidate set of queries may be converted into a different form or format for comparison, such as the conversion forms 302. The user query and a member of the candidate set are compared in block 508. In block 510, a similarity score is calculated to measure a similarity between the user query and the member of the candidate set. The similarity score may be based on utilizing any of the comparison features 402 for comparing a converted form of the user query with a converted form of the member. In block 512, another comparison at block 508 occurs for another member from the candidate set and continues until all members of the candidate set have been compared and have a similarity score. In block 514, the similarity scores between the candidate set may be reviewed to identify the member of the candidate set with the closest similarity score to the user query. The identification of a similar member, such as a similar search keyword, may be used to identify which advertisements to display for sponsored searching.
  • Referring to FIG. 6, an illustrative embodiment of a general computer system is shown and is designated 600. The user device 106, ad server 103, the search engine 102, the search log database 112, the data source 113, the unit dictionary 116, and/or the language analyzer 104 may be a computer or computing devices, such as the computer system 600 or any of its components. The computer system 600 can include a set of instructions that can be executed to cause the computer system 600 to perform any one or more of the methods or computer based functions disclosed herein. The computer system 600 may operate as a standalone device or may be connected, e.g., using a network, to other computer systems or peripheral devices.
  • In a networked deployment, the computer system may operate in the capacity of a server or as a client user computer in a server-client user network environment, or as a peer computer system in a peer-to-peer (or distributed) network environment. The computer system 600 can also be implemented as or incorporated into various devices, such as a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a mobile device, a palmtop computer, a laptop computer, a desktop computer, a communications device, a wireless telephone, a land-line telephone, a control system, a camera, a scanner, a facsimile machine, a printer, a pager, a personal trusted device, a web appliance, a network router, switch or bridge, or any other machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. In a particular embodiment, the computer system 600 can be implemented using electronic devices that provide voice, video or data communication. Further, while a single computer system 600 is illustrated, the term “system” shall also be taken to include any collection of systems or sub-systems that individually or jointly execute a set, or multiple sets, of instructions to perform one or more computer functions.
  • As illustrated in FIG. 6, the computer system 600 may include a processor 602, e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both. The processor 602 may be a component in a variety of systems. For example, the processor 602 may be part of a standard personal computer or a workstation. The processor 602 may be one or more general processors, digital signal processors, application specific integrated circuits, field programmable gate arrays, servers, networks, digital circuits, analog circuits, combinations thereof, or other now known or later developed devices for analyzing and processing data. The processor 602 may implement a software program, such as code generated manually (i.e., programmed).
  • The computer system 600 may include a memory 604 that can communicate via a bus 608. The memory 604 may be a main memory, a static memory, or a dynamic memory. The memory 604 may include, but is not limited to computer readable storage media such as various types of volatile and non-volatile storage media, including but not limited to random access memory, read-only memory, programmable read-only memory, electrically programmable read-only memory, electrically erasable read-only memory, flash memory, magnetic tape or disk, optical media and the like. In one embodiment, the memory 604 includes a cache or random access memory for the processor 602. In alternative embodiments, the memory 604 is separate from the processor 602, such as a cache memory of a processor, the system memory, or other memory. The memory 604 may be an external storage device or database for storing data. Examples include a hard drive, compact disc (“CD”), digital video disc (“DVD”), memory card, memory stick, floppy disc, universal serial bus (“USB”) memory device, or any other device operative to store data. The memory 604 is operable to store instructions executable by the processor 602. The functions, acts or tasks illustrated in the figures or described herein may be performed by the programmed processor 602 executing the instructions stored in the memory 604. The functions, acts or tasks are independent of the particular type of instructions set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firm-ware, micro-code and the like, operating alone or in combination. Likewise, processing strategies may include multiprocessing, multitasking, parallel processing and the like.
  • As shown, the computer system 600 may further include a display unit 614, such as a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, a cathode ray tube (CRT), a projector, a printer or other now known or later developed display device for outputting determined information. The display 614 may act as an interface for the user to see the functioning of the processor 602, or specifically as an interface with the software stored in the memory 604 or in the drive unit 606.
  • Additionally, the computer system 600 may include an input device 616 configured to allow a user to interact with any of the components of system 600. The input device 616 may be a number pad, a keyboard, or a cursor control device, such as a mouse, or a joystick, touch screen display, remote control or any other device operative to interact with the system 600.
  • In a particular embodiment, as depicted in FIG. 6, the computer system 600 may also include a disk or optical drive unit 606. The disk drive unit 606 may include a computer-readable medium 610 in which one or more sets of instructions 612, e.g. software, can be embedded. Further, the instructions 612 may embody one or more of the methods or logic as described herein. In a particular embodiment, the instructions 612 may reside completely, or at least partially, within the memory 604 and/or within the processor 602 during execution by the computer system 600. The memory 604 and the processor 602 also may include computer-readable media as discussed above.
  • The present disclosure contemplates a computer-readable medium that includes instructions 612 or receives and executes instructions 612 responsive to a propagated signal, so that a device connected to a network 620 can communicate voice, video, audio, images or any other data over the network 620. Further, the instructions 612 may be transmitted or received over the network 620 via a communication port 618. The communication port 618 may be a part of the processor 602 or may be a separate component. The communication port 618 may be created in software or may be a physical connection in hardware. The communication port 618 is configured to connect with a network 620, external media, the display 614, or any other components in system 600, or combinations thereof. The connection with the network 620 may be a physical connection, such as a wired Ethernet connection or may be established wirelessly as discussed below. Likewise, the additional connections with other components of the system 600 may be physical connections or may be established wirelessly.
  • The network 620 may include wired networks, wireless networks, or combinations thereof. The wireless network may be a cellular telephone network, an 802.11, 802.16, 802.20, or WiMax network. Further, the network 620 may be a public network, such as the Internet, a private network, such as an intranet, or combinations thereof, and may utilize a variety of networking protocols now available or later developed including, but not limited to TCP/IP based networking protocols.
  • While the computer-readable medium is shown to be a single medium, the term “computer-readable medium” includes a single medium or multiple media, such as a centralized or distributed database, and/or associated caches and servers that store one or more sets of instructions. The term “computer-readable medium” shall also include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor or that cause a computer system to perform any one or more of the methods or operations disclosed herein.
  • In a particular non-limiting, exemplary embodiment, the computer-readable medium can include a solid-state memory such as a memory card or other package that houses one or more non-volatile read-only memories. Further, the computer-readable medium can be a random access memory or other volatile re-writable memory. Additionally, the computer-readable medium can include a magneto-optical or optical medium, such as a disk or tapes or other storage device to capture carrier wave signals such as a signal communicated over a transmission medium. A digital file attachment to an e-mail or other self-contained information archive or set of archives may be considered a distribution medium that is a tangible storage medium. Accordingly, the disclosure is considered to include any one or more of a computer-readable medium or a distribution medium and other equivalents and successor media, in which data or instructions may be stored.
  • In an alternative embodiment, dedicated hardware implementations, such as application specific integrated circuits, programmable logic arrays and other hardware devices, can be constructed to implement one or more of the methods described herein. Applications that may include the apparatus and systems of various embodiments can broadly include a variety of electronic and computer systems. One or more embodiments described herein may implement functions using two or more specific interconnected hardware modules or devices with related control and data signals that can be communicated between and through the modules, or as portions of an application-specific integrated circuit. Accordingly, the present system encompasses software, firmware, and hardware implementations.
  • In accordance with various embodiments of the present disclosure, the methods described herein may be implemented by software programs executable by a computer system. Further, in an exemplary, non-limited embodiment, implementations can include distributed processing, component/object distributed processing, and parallel processing. Alternatively, virtual computer system processing can be constructed to implement one or more of the methods or functionality as described herein.
  • Although the present specification describes components and functions that may be implemented in particular embodiments with reference to particular standards and protocols, the invention is not limited to such standards and protocols. For example, standards for Internet and other packet switched network transmission (e.g., TCP/IP, UDP/IP, HTML, HTTP) represent examples of the state of the art. Such standards are periodically superseded by faster or more efficient equivalents having essentially the same functions. Accordingly, replacement standards and protocols having the same or similar functions as those disclosed herein are considered equivalents thereof.
  • The illustrations of the embodiments described herein are intended to provide a general understanding of the structure of the various embodiments. The illustrations are not intended to serve as a complete description of all of the elements and features of apparatus and systems that utilize the structures or methods described herein. Many other embodiments may be apparent to those of skill in the art upon reviewing the disclosure. Other embodiments may be utilized and derived from the disclosure, such that structural and logical substitutions and changes may be made without departing from the scope of the disclosure. Additionally, the illustrations are merely representational and may not be drawn to scale. Certain proportions within the illustrations may be exaggerated, while other proportions may be minimized. Accordingly, the disclosure and the figures are to be regarded as illustrative rather than restrictive.
  • One or more embodiments of the disclosure may be referred to herein, individually and/or collectively, by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any particular invention or inventive concept. Moreover, although specific embodiments have been illustrated and described herein, it should be appreciated that any subsequent arrangement designed to achieve the same or similar purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all subsequent adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the description.
  • The Abstract of the Disclosure is provided to comply with 37 C.F.R. § 1.72(b) and is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, various features may be grouped together or described in a single embodiment for the purpose of streamlining the disclosure. This disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter may be directed to less than all of the features of any of the disclosed embodiments. Thus, the following claims are incorporated into the Detailed Description, with each claim standing on its own as defining separately claimed subject matter.
  • The above disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover all such modifications, enhancements, and other embodiments, which fall within the true spirit and scope of the present invention. Thus, to the maximum extent allowed by law, the scope of the present invention is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description. While various embodiments of the invention have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the invention. Accordingly, the invention is not to be restricted except in light of the attached claims and their equivalents.

Claims (23)

1. A method for matching queries with keywords comprising:
receiving a non-native language user query;
gathering a candidate set of the keywords to be compared with the user query;
converting the user query to a form for comparison with the keywords, wherein the keywords are converted to the form for comparison;
comparing the converted user query with each of the keywords, wherein a similarity score is established for each keyword to determine similarity with the user query; and
matching at least one keyword from the keywords with the user query based on the similarity score.
2. The method according to claim 1 wherein the non-native language user query comprises a Chinese related user query, wherein the Chinese related user query comprises at least one Chinese character.
3. The method according to claim 1 wherein each of the keywords are associated with at least one advertisement.
4. The method according to claim 3 further comprising:
providing the at least one advertisement that is associated with the matched at least one keyword.
5. The method according to claim 1 wherein the converting of the user query comprises at least one of adding, removing, or substituting at least one character from the user query.
6. The method according to claim 1 wherein the similarity score for each of the keywords is based on an edit distance with the converted user query.
7. In a computer readable storage medium having stored therein data representing instructions executable by a programmed processor for comparing a Chinese query with keywords, the storage medium comprising instructions operative for:
receiving the Chinese query;
selecting a set of the keywords for comparing with the Chinese query;
converting the Chinese query into at least one different form;
converting the set of keywords into the at least one different form;
determining at least one comparison between the Chinese query and the set of keywords, wherein the at least one comparison comprises a similarity score between the Chinese query and the set of keywords; and
identifying one of the set of keywords based on the similarity score.
8. The storage medium according to claim 7 wherein the at least one comparison comprises calculating an edit distance between the converted Chinese query and each of the converted set of keywords.
9. The storage medium according to claim 8 wherein the identified one of the set of keywords has a closest edit distance with the converted Chinese query.
10. The storage medium according to claim 7 wherein the at least one different form comprises a conversion of at least one character to at least one of a Chinese soundex form, a zhuyin form, a radicals form, a pinyin without tone form, a pinyin with tone form, or a Chinese utf8 form.
11. A method for determining similarity between queries comprising:
selecting at least two queries from a set of queries according to one language system;
converting each of the at least two queries into a different format, wherein the conversion comprises a transformation of certain characters in the at least two queries;
determining at least one comparison feature for each of the at least two queries; and
comparing the at least two queries based on the at least one comparison feature to determine a similarity between the at least two queries based on each of the at least one comparison feature.
12. The method according to claim 11 wherein the language system comprises Chinese, and the set of queries are Chinese related queries.
13. The method according to claim 12 wherein the transformation of certain characters in the at least two queries comprises changing at least one character into at least one of a Chinese soundex form, a zhuyin form, a radicals form, a pinyin without tone form, a pinyin with tone form, or a Chinese utf8 form.
14. The method according to claim 11 wherein the at least one comparison feature comprises at least one of comparing an edit distance, comparing a character level prefix overlap, or comparing a character level suffix overlap.
15. The method according to claim 14 wherein the comparing the edit distance further comprises comparing an edit distance by characters or an edit distance by words.
16. The method according to claim 11 wherein the at least two queries are converted into a plurality of different formats, wherein the comparing further comprises comparing the at least two queries in each of the plurality of different formats.
17. A method for comparing queries comprising:
receiving at least two queries, wherein each of the at least two queries comprise at least one Chinese representation;
converting the at least two queries into at least one common format;
calculating an edit distance between the converted at least two queries for each of the at least one common format; and
recording the edit distances between each of the converted at least two queries.
18. The method according to claim 17 wherein the at least one common format comprises an addition, subtraction, or substitution of at least one character.
19. The method according to claim 17 wherein the at least one common format comprises a conversion of at least one character into at least one of a Chinese soundex form, a zhuyin form, a radicals form, a pinyin without tone form, a pinyin with tone form, or a Chinese utf8 form
20. A system for measuring related queries comprising:
a search engine operative to receive a user search query;
an ad server coupled with the search engine and operative to provide an advertisement for display in response to the received user search query, wherein the ad server includes a plurality of search keywords, each of which are associated with at least one advertisement;
a search log database coupled with the search engine and operative to store search queries including the plurality of search keywords; and
a language analyzer coupled with the search engine that comprises:
a receiver operative to receive the user search query;
a converter coupled with the receiver and operative to convert the user search query into a different form;
a comparator coupled with the converter and operative to compare the converted search query with a candidate set of the plurality of search keywords; and
a calculator coupled with the comparator and operative to calculate a similarity score for each member of the candidate set based on the comparison with the converted search query;
wherein the associated at least one advertisement that is associated with the member of the candidate set with a closest similarity score is provided for display in response to the received search query.
21. The system according to claim 20 wherein the user search query is Chinese related and the converter is operative to change the Chinese related user search query into the different form by at least one of adding, deleting or substituting at least one of the characters of the Chinese related user search query.
22. The system according to claim 20 wherein the calculation of the similarity score comprises a computation of an edit distance between the converted user search query and the member of the candidate set.
23. The system according to claim 20 wherein the converter is operative to convert the candidate set of the plurality of search keywords into the different form for comparison with the converted user search query.
US11/948,374 2006-02-28 2007-11-30 Identifying and measuring related queries Abandoned US20080077588A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/948,374 US20080077588A1 (en) 2006-02-28 2007-11-30 Identifying and measuring related queries

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/365,315 US7689554B2 (en) 2006-02-28 2006-02-28 System and method for identifying related queries for languages with multiple writing systems
US11/948,374 US20080077588A1 (en) 2006-02-28 2007-11-30 Identifying and measuring related queries

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US11/365,315 Continuation-In-Part US7689554B2 (en) 2006-02-28 2006-02-28 System and method for identifying related queries for languages with multiple writing systems

Publications (1)

Publication Number Publication Date
US20080077588A1 true US20080077588A1 (en) 2008-03-27

Family

ID=38445252

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/365,315 Expired - Fee Related US7689554B2 (en) 2006-02-28 2006-02-28 System and method for identifying related queries for languages with multiple writing systems
US11/948,374 Abandoned US20080077588A1 (en) 2006-02-28 2007-11-30 Identifying and measuring related queries

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/365,315 Expired - Fee Related US7689554B2 (en) 2006-02-28 2006-02-28 System and method for identifying related queries for languages with multiple writing systems

Country Status (7)

Country Link
US (2) US7689554B2 (en)
EP (2) EP3301591A1 (en)
JP (1) JP2009528636A (en)
KR (1) KR101098703B1 (en)
CN (2) CN102750323B (en)
HK (2) HK1130912A1 (en)
WO (1) WO2007101194A2 (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080312911A1 (en) * 2007-06-14 2008-12-18 Po Zhang Dictionary word and phrase determination
US20090248634A1 (en) * 2008-03-31 2009-10-01 International Business Machines Corporation Method and system for a metadata driven query
US20090299974A1 (en) * 2008-05-29 2009-12-03 Fujitsu Limited Character sequence map generating apparatus, information searching apparatus, character sequence map generating method, information searching method, and computer product
US20100106704A1 (en) * 2008-10-29 2010-04-29 Yahoo! Inc. Cross-lingual query classification
US20100217781A1 (en) * 2008-12-30 2010-08-26 Thales Optimized method and system for managing proper names to optimize the management and interrogation of databases
US20110295897A1 (en) * 2010-06-01 2011-12-01 Microsoft Corporation Query correction probability based on query-correction pairs
US20120047151A1 (en) * 2010-08-19 2012-02-23 Yahoo! Inc. Method and system for providing contents based on past queries
WO2012074704A2 (en) * 2010-11-29 2012-06-07 Microsoft Corporation Display of search ads in local language
WO2012148427A1 (en) * 2011-04-29 2012-11-01 Hewlett-Packard Development Company, L.P. Systems and methods for in-memory processing of events
US20130054225A1 (en) * 2010-06-23 2013-02-28 Business Objects Software Limited Searching and matching of data
US8417718B1 (en) * 2011-07-11 2013-04-09 Google Inc. Generating word completions based on shared suffix analysis
US20130090916A1 (en) * 2011-10-05 2013-04-11 Daniel M. Wang System and Method for Detecting and Correcting Mismatched Chinese Character
US20150213142A1 (en) * 2007-12-03 2015-07-30 Ebay Inc. Live search chat room
US20190295012A1 (en) * 2018-03-23 2019-09-26 International Business Machines Corporation Predicting employee performance metrics
US20210240751A1 (en) * 2018-12-26 2021-08-05 Paypal, Inc. Machine learning approach to cross-language translation and search
US11170183B2 (en) * 2018-09-17 2021-11-09 International Business Machines Corporation Language entity identification
US20230185857A1 (en) * 2015-12-08 2023-06-15 Yahoo Assets Llc Method and system for providing context based query suggestions
RU2813239C1 (en) * 2022-12-21 2024-02-08 Акционерное общество "Лаборатория Касперского" Method for filtering events for transmission to remote device

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7821503B2 (en) 2003-04-09 2010-10-26 Tegic Communications, Inc. Touch screen and graphical user interface
US7750891B2 (en) 2003-04-09 2010-07-06 Tegic Communications, Inc. Selective input system based on tracking of motion parameters of an input device
US7286115B2 (en) 2000-05-26 2007-10-23 Tegic Communications, Inc. Directional input system with automatic correction
US7030863B2 (en) * 2000-05-26 2006-04-18 America Online, Incorporated Virtual keyboard system with automatic correction
US7689554B2 (en) * 2006-02-28 2010-03-30 Yahoo! Inc. System and method for identifying related queries for languages with multiple writing systems
US8762358B2 (en) * 2006-04-19 2014-06-24 Google Inc. Query language determination using query terms and interface language
US8442965B2 (en) 2006-04-19 2013-05-14 Google Inc. Query language identification
US7689548B2 (en) * 2006-09-22 2010-03-30 Microsoft Corporation Recommending keywords based on bidding patterns
US7925498B1 (en) 2006-12-29 2011-04-12 Google Inc. Identifying a synonym with N-gram agreement for a query phrase
US8201087B2 (en) * 2007-02-01 2012-06-12 Tegic Communications, Inc. Spell-check for a keyboard system with automatic correction
US8225203B2 (en) 2007-02-01 2012-07-17 Nuance Communications, Inc. Spell-check for a keyboard system with automatic correction
US20080250008A1 (en) * 2007-04-04 2008-10-09 Microsoft Corporation Query Specialization
US8290921B2 (en) * 2007-06-28 2012-10-16 Microsoft Corporation Identification of similar queries based on overall and partial similarity of time series
US8090709B2 (en) * 2007-06-28 2012-01-03 Microsoft Corporation Representing queries and determining similarity based on an ARIMA model
US7831588B2 (en) * 2008-02-05 2010-11-09 Yahoo! Inc. Context-sensitive query expansion
US8171021B2 (en) * 2008-06-23 2012-05-01 Google Inc. Query identification and association
US8745051B2 (en) * 2008-07-03 2014-06-03 Google Inc. Resource locator suggestions from input character sequence
US9053197B2 (en) * 2008-11-26 2015-06-09 Red Hat, Inc. Suggesting websites
CN101464897A (en) 2009-01-12 2009-06-24 阿里巴巴集团控股有限公司 Word matching and information query method and device
EP2328366A1 (en) * 2009-11-20 2011-06-01 Alcatel Lucent Method and system for conducting surveys
US20110153423A1 (en) * 2010-06-21 2011-06-23 Jon Elvekrog Method and system for creating user based summaries for content distribution
US20110153414A1 (en) * 2009-12-23 2011-06-23 Jon Elvekrog Method and system for dynamic advertising based on user actions
US8751305B2 (en) * 2010-05-24 2014-06-10 140 Proof, Inc. Targeting users based on persona data
CN102567408B (en) 2010-12-31 2014-06-04 阿里巴巴集团控股有限公司 Method and device for recommending search keyword
KR101461062B1 (en) * 2011-10-24 2014-11-17 네이버 주식회사 System and method for recommendding japanese language automatically using tranformatiom of romaji
US8756241B1 (en) * 2012-08-06 2014-06-17 Google Inc. Determining rewrite similarity scores
US9971837B2 (en) * 2013-12-16 2018-05-15 Excalibur Ip, Llc Contextual based search suggestion
US9690860B2 (en) 2014-06-30 2017-06-27 Yahoo! Inc. Recommended query formulation
CN104572836A (en) * 2014-12-10 2015-04-29 百度在线网络技术(北京)有限公司 Method and device for confirming comprehensive relevancy of candidate inquiry sequence
US10169414B2 (en) 2016-04-26 2019-01-01 International Business Machines Corporation Character matching in text processing
CN110162593B (en) * 2018-11-29 2023-03-21 腾讯科技(深圳)有限公司 Search result processing and similarity model training method and device
US11194850B2 (en) * 2018-12-14 2021-12-07 Business Objects Software Ltd. Natural language query system
CN110008237B (en) * 2019-01-14 2023-05-02 创新先进技术有限公司 Similar query recognition method and device
CN111629020A (en) * 2019-12-03 2020-09-04 蘑菇车联信息科技有限公司 Remote input method, device, PC (personal computer) terminal, android device and system

Citations (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020052871A1 (en) * 2000-11-02 2002-05-02 Simpleact Incorporated Chinese natural language query system and method
US20020165717A1 (en) * 2001-04-06 2002-11-07 Solmer Robert P. Efficient method for information extraction
US20030069880A1 (en) * 2001-09-24 2003-04-10 Ask Jeeves, Inc. Natural language query processing
US20030144994A1 (en) * 2001-10-12 2003-07-31 Ji-Rong Wen Clustering web queries
US20040249801A1 (en) * 2003-04-04 2004-12-09 Yahoo! Universal search interface systems and methods
US20050038802A1 (en) * 2000-12-21 2005-02-17 Eric White Method and system for platform-independent file system interaction
US6876997B1 (en) * 2000-05-22 2005-04-05 Overture Services, Inc. Method and apparatus for indentifying related searches in a database search system
US20050080795A1 (en) * 2003-10-09 2005-04-14 Yahoo! Inc. Systems and methods for search processing using superunits
US20050102259A1 (en) * 2003-11-12 2005-05-12 Yahoo! Inc. Systems and methods for search query processing using trend analysis
US20050228780A1 (en) * 2003-04-04 2005-10-13 Yahoo! Inc. Search system using search subdomain and hints to subdomains in search query statements and sponsored results on a subdomain-by-subdomain basis
US6999932B1 (en) * 2000-10-10 2006-02-14 Intel Corporation Language independent voice-based search system
US7051119B2 (en) * 2001-07-12 2006-05-23 Yahoo! Inc. Method and system for enabling a script on a first computer to communicate and exchange data with a script on a second computer over a network
US7051023B2 (en) * 2003-04-04 2006-05-23 Yahoo! Inc. Systems and methods for generating concept units from search queries
US7051014B2 (en) * 2003-06-18 2006-05-23 Microsoft Corporation Utilizing information redundancy to improve text searches
US20060122994A1 (en) * 2004-12-06 2006-06-08 Yahoo! Inc. Automatic generation of taxonomies for categorizing queries and search query processing using taxonomies
US20060122979A1 (en) * 2004-12-06 2006-06-08 Shyam Kapur Search processing with automatic categorization of queries
US20060161520A1 (en) * 2005-01-14 2006-07-20 Microsoft Corporation System and method for generating alternative search terms
US20060167896A1 (en) * 2004-12-06 2006-07-27 Shyam Kapur Systems and methods for managing and using multiple concept networks for assisted search processing
US20060206474A1 (en) * 2005-03-10 2006-09-14 Yahoo!, Inc. System for modifying queries before presentation to a sponsored search generator or other matching system where modifications improve coverage without a corresponding reduction in relevance
US20060206476A1 (en) * 2005-03-10 2006-09-14 Yahoo!, Inc. Reranking and increasing the relevance of the results of Internet searches
US20070020705A1 (en) * 2003-10-21 2007-01-25 Shigehiko Mizutani Method for prognostic evaluation of carcinoma using anti-p-lap antibody
US20070038621A1 (en) * 2005-08-10 2007-02-15 Tina Weyand System and method for determining alternate search queries
US20070038602A1 (en) * 2005-08-10 2007-02-15 Tina Weyand Alternative search query processing in a term bidding system
US20070203894A1 (en) * 2006-02-28 2007-08-30 Rosie Jones System and method for identifying related queries for languages with multiple writing systems
US20070208709A1 (en) * 2001-10-03 2007-09-06 Malibu Engineering & Software Ltd. Method and query application tool for searching hierarchical databases
US20070208697A1 (en) * 2001-06-18 2007-09-06 Pavitra Subramaniam System and method to enable searching across multiple databases and files using a single search
US20070208701A1 (en) * 2006-03-01 2007-09-06 Microsoft Corporation Comparative web search
US20070208713A1 (en) * 2006-03-01 2007-09-06 Oracle International Corporation Auto Generation of Suggested Links in a Search System
US20070208719A1 (en) * 2004-03-18 2007-09-06 Bao Tran Systems and methods for analyzing semantic documents over a network
US20070208704A1 (en) * 2006-03-06 2007-09-06 Stephen Ives Packaged mobile search results
US20070208698A1 (en) * 2002-06-07 2007-09-06 Dougal Brindley Avoiding duplicate service requests
US20070208703A1 (en) * 2006-03-03 2007-09-06 Microsoft Corporation Web forum crawler
US20070208714A1 (en) * 2006-03-01 2007-09-06 Oracle International Corporation Method for Suggesting Web Links and Alternate Terms for Matching Search Queries
US20070208699A1 (en) * 2004-09-07 2007-09-06 Shigeki Uetabira Information search provision apparatus and information search provision system
US20070208720A1 (en) * 2000-12-12 2007-09-06 Home Box Office, Inc. Digital asset data type definitions
US20070208711A1 (en) * 2005-12-21 2007-09-06 Rhoads Geoffrey B Rules Driven Pan ID Metadata Routing System and Network
US20070208706A1 (en) * 2006-03-06 2007-09-06 Anand Madhavan Vertical search expansion, disambiguation, and optimization of search queries
US20070208702A1 (en) * 2006-03-02 2007-09-06 Morris Robert P Method and system for delivering published information associated with a tuple using a pub/sub protocol
US20070208700A1 (en) * 2005-01-19 2007-09-06 Konica Minolta Holdings, Inc. Update detecting apparatus
US20070214118A1 (en) * 2005-09-27 2007-09-13 Schoen Michael A Delivery of internet ads

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4833610A (en) * 1986-12-16 1989-05-23 International Business Machines Corporation Morphological/phonetic method for ranking word similarities
JP2809341B2 (en) * 1994-11-18 1998-10-08 松下電器産業株式会社 Information summarizing method, information summarizing device, weighting method, and teletext receiving device.
US6006221A (en) * 1995-08-16 1999-12-21 Syracuse University Multilingual document retrieval system and method using semantic vector matching
US5778361A (en) * 1995-09-29 1998-07-07 Microsoft Corporation Method and system for fast indexing and searching of text in compound-word languages
ATE243869T1 (en) * 1998-03-03 2003-07-15 Amazon Com Inc IDENTIFICATION OF THE MOST RELEVANT ANSWERS TO A CURRENT SEARCH QUERY BASED ON ANSWERS ALREADY SELECTED FOR SIMILAR QUERIES
US6493709B1 (en) * 1998-07-31 2002-12-10 The Regents Of The University Of California Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment
JP2001337980A (en) * 2000-05-29 2001-12-07 Sony Corp Electronic program guide retrieving method and electronic program guide retrieving device
US8706747B2 (en) * 2000-07-06 2014-04-22 Google Inc. Systems and methods for searching using queries written in a different character-set and/or language from the target pages
JP2003296443A (en) * 2002-03-29 2003-10-17 Konica Corp Medical image pick-up device, display control method, and program
JP2004280259A (en) * 2003-03-13 2004-10-07 National Institute Of Information & Communication Technology Search device
US6947930B2 (en) * 2003-03-21 2005-09-20 Overture Services, Inc. Systems and methods for interactive search query refinement
US20040260681A1 (en) * 2003-06-19 2004-12-23 Dvorak Joseph L. Method and system for selectively retrieving text strings
EP1692626A4 (en) * 2003-09-17 2008-11-19 Ibm Identifying related names
WO2005124599A2 (en) * 2004-06-12 2005-12-29 Getty Images, Inc. Content search in complex language, such as japanese
JP4936650B2 (en) * 2004-07-26 2012-05-23 ヤフー株式会社 Similar word search device, method thereof, program thereof, and information search device
US20060106769A1 (en) * 2004-11-12 2006-05-18 Gibbs Kevin A Method and system for autocompletion for languages having ideographs and phonetic characters

Patent Citations (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6876997B1 (en) * 2000-05-22 2005-04-05 Overture Services, Inc. Method and apparatus for indentifying related searches in a database search system
US6999932B1 (en) * 2000-10-10 2006-02-14 Intel Corporation Language independent voice-based search system
US20020052871A1 (en) * 2000-11-02 2002-05-02 Simpleact Incorporated Chinese natural language query system and method
US20070208720A1 (en) * 2000-12-12 2007-09-06 Home Box Office, Inc. Digital asset data type definitions
US20050038802A1 (en) * 2000-12-21 2005-02-17 Eric White Method and system for platform-independent file system interaction
US20020165717A1 (en) * 2001-04-06 2002-11-07 Solmer Robert P. Efficient method for information extraction
US20070208697A1 (en) * 2001-06-18 2007-09-06 Pavitra Subramaniam System and method to enable searching across multiple databases and files using a single search
US7051119B2 (en) * 2001-07-12 2006-05-23 Yahoo! Inc. Method and system for enabling a script on a first computer to communicate and exchange data with a script on a second computer over a network
US20030069880A1 (en) * 2001-09-24 2003-04-10 Ask Jeeves, Inc. Natural language query processing
US20070208709A1 (en) * 2001-10-03 2007-09-06 Malibu Engineering & Software Ltd. Method and query application tool for searching hierarchical databases
US20030144994A1 (en) * 2001-10-12 2003-07-31 Ji-Rong Wen Clustering web queries
US20070208698A1 (en) * 2002-06-07 2007-09-06 Dougal Brindley Avoiding duplicate service requests
US20040249801A1 (en) * 2003-04-04 2004-12-09 Yahoo! Universal search interface systems and methods
US7051023B2 (en) * 2003-04-04 2006-05-23 Yahoo! Inc. Systems and methods for generating concept units from search queries
US20050228780A1 (en) * 2003-04-04 2005-10-13 Yahoo! Inc. Search system using search subdomain and hints to subdomains in search query statements and sponsored results on a subdomain-by-subdomain basis
US7051014B2 (en) * 2003-06-18 2006-05-23 Microsoft Corporation Utilizing information redundancy to improve text searches
US20050080795A1 (en) * 2003-10-09 2005-04-14 Yahoo! Inc. Systems and methods for search processing using superunits
US20070020705A1 (en) * 2003-10-21 2007-01-25 Shigehiko Mizutani Method for prognostic evaluation of carcinoma using anti-p-lap antibody
US20050102259A1 (en) * 2003-11-12 2005-05-12 Yahoo! Inc. Systems and methods for search query processing using trend analysis
US7240049B2 (en) * 2003-11-12 2007-07-03 Yahoo! Inc. Systems and methods for search query processing using trend analysis
US20070208719A1 (en) * 2004-03-18 2007-09-06 Bao Tran Systems and methods for analyzing semantic documents over a network
US20070208699A1 (en) * 2004-09-07 2007-09-06 Shigeki Uetabira Information search provision apparatus and information search provision system
US20060122979A1 (en) * 2004-12-06 2006-06-08 Shyam Kapur Search processing with automatic categorization of queries
US20060167896A1 (en) * 2004-12-06 2006-07-27 Shyam Kapur Systems and methods for managing and using multiple concept networks for assisted search processing
US20060122994A1 (en) * 2004-12-06 2006-06-08 Yahoo! Inc. Automatic generation of taxonomies for categorizing queries and search query processing using taxonomies
US20060161520A1 (en) * 2005-01-14 2006-07-20 Microsoft Corporation System and method for generating alternative search terms
US20070208700A1 (en) * 2005-01-19 2007-09-06 Konica Minolta Holdings, Inc. Update detecting apparatus
US20060206474A1 (en) * 2005-03-10 2006-09-14 Yahoo!, Inc. System for modifying queries before presentation to a sponsored search generator or other matching system where modifications improve coverage without a corresponding reduction in relevance
US20060206476A1 (en) * 2005-03-10 2006-09-14 Yahoo!, Inc. Reranking and increasing the relevance of the results of Internet searches
US20070038602A1 (en) * 2005-08-10 2007-02-15 Tina Weyand Alternative search query processing in a term bidding system
US20070038621A1 (en) * 2005-08-10 2007-02-15 Tina Weyand System and method for determining alternate search queries
US20070214118A1 (en) * 2005-09-27 2007-09-13 Schoen Michael A Delivery of internet ads
US20070208711A1 (en) * 2005-12-21 2007-09-06 Rhoads Geoffrey B Rules Driven Pan ID Metadata Routing System and Network
US20070203894A1 (en) * 2006-02-28 2007-08-30 Rosie Jones System and method for identifying related queries for languages with multiple writing systems
US20070208713A1 (en) * 2006-03-01 2007-09-06 Oracle International Corporation Auto Generation of Suggested Links in a Search System
US20070208714A1 (en) * 2006-03-01 2007-09-06 Oracle International Corporation Method for Suggesting Web Links and Alternate Terms for Matching Search Queries
US20070208701A1 (en) * 2006-03-01 2007-09-06 Microsoft Corporation Comparative web search
US20070208702A1 (en) * 2006-03-02 2007-09-06 Morris Robert P Method and system for delivering published information associated with a tuple using a pub/sub protocol
US20070208703A1 (en) * 2006-03-03 2007-09-06 Microsoft Corporation Web forum crawler
US20070208706A1 (en) * 2006-03-06 2007-09-06 Anand Madhavan Vertical search expansion, disambiguation, and optimization of search queries
US20070208704A1 (en) * 2006-03-06 2007-09-06 Stephen Ives Packaged mobile search results

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110282903A1 (en) * 2007-06-14 2011-11-17 Google Inc. Dictionary Word and Phrase Determination
US8412517B2 (en) * 2007-06-14 2013-04-02 Google Inc. Dictionary word and phrase determination
US20080312911A1 (en) * 2007-06-14 2008-12-18 Po Zhang Dictionary word and phrase determination
US20150213142A1 (en) * 2007-12-03 2015-07-30 Ebay Inc. Live search chat room
US8150838B2 (en) * 2008-03-31 2012-04-03 International Business Machines Corporation Method and system for a metadata driven query
US20090248634A1 (en) * 2008-03-31 2009-10-01 International Business Machines Corporation Method and system for a metadata driven query
US20090299974A1 (en) * 2008-05-29 2009-12-03 Fujitsu Limited Character sequence map generating apparatus, information searching apparatus, character sequence map generating method, information searching method, and computer product
US20100106704A1 (en) * 2008-10-29 2010-04-29 Yahoo! Inc. Cross-lingual query classification
US8117237B2 (en) * 2008-12-30 2012-02-14 Thales Optimized method and system for managing proper names to optimize the management and interrogation of databases
US20100217781A1 (en) * 2008-12-30 2010-08-26 Thales Optimized method and system for managing proper names to optimize the management and interrogation of databases
US20110295897A1 (en) * 2010-06-01 2011-12-01 Microsoft Corporation Query correction probability based on query-correction pairs
US8745077B2 (en) * 2010-06-23 2014-06-03 Business Objects Software Limited Searching and matching of data
US20130054225A1 (en) * 2010-06-23 2013-02-28 Business Objects Software Limited Searching and matching of data
US8442987B2 (en) * 2010-08-19 2013-05-14 Yahoo! Inc. Method and system for providing contents based on past queries
US20120047151A1 (en) * 2010-08-19 2012-02-23 Yahoo! Inc. Method and system for providing contents based on past queries
WO2012074704A2 (en) * 2010-11-29 2012-06-07 Microsoft Corporation Display of search ads in local language
WO2012074704A3 (en) * 2010-11-29 2012-07-19 Microsoft Corporation Display of search ads in local language
CN103502990A (en) * 2011-04-29 2014-01-08 惠普发展公司,有限责任合伙企业 Systems and methods for in-memory processing of events
US9355148B2 (en) 2011-04-29 2016-05-31 Hewlett Packard Enterprise Development Lp Systems and methods for in-memory processing of events
WO2012148427A1 (en) * 2011-04-29 2012-11-01 Hewlett-Packard Development Company, L.P. Systems and methods for in-memory processing of events
US8417718B1 (en) * 2011-07-11 2013-04-09 Google Inc. Generating word completions based on shared suffix analysis
US8886662B1 (en) 2011-07-11 2014-11-11 Google Inc. Generating word completions based on shared suffix analysis
US8725497B2 (en) * 2011-10-05 2014-05-13 Daniel M. Wang System and method for detecting and correcting mismatched Chinese character
US20130090916A1 (en) * 2011-10-05 2013-04-11 Daniel M. Wang System and Method for Detecting and Correcting Mismatched Chinese Character
US20230185857A1 (en) * 2015-12-08 2023-06-15 Yahoo Assets Llc Method and system for providing context based query suggestions
US20190295012A1 (en) * 2018-03-23 2019-09-26 International Business Machines Corporation Predicting employee performance metrics
US10891578B2 (en) * 2018-03-23 2021-01-12 International Business Machines Corporation Predicting employee performance metrics
US11170183B2 (en) * 2018-09-17 2021-11-09 International Business Machines Corporation Language entity identification
US20210240751A1 (en) * 2018-12-26 2021-08-05 Paypal, Inc. Machine learning approach to cross-language translation and search
US11914626B2 (en) * 2018-12-26 2024-02-27 Paypal, Inc. Machine learning approach to cross-language translation and search
RU2813239C1 (en) * 2022-12-21 2024-02-08 Акционерное общество "Лаборатория Касперского" Method for filtering events for transmission to remote device

Also Published As

Publication number Publication date
CN101390097B (en) 2012-07-04
CN102750323B (en) 2016-05-11
JP2009528636A (en) 2009-08-06
HK1176711A1 (en) 2013-08-02
WO2007101194A2 (en) 2007-09-07
EP3301591A1 (en) 2018-04-04
US7689554B2 (en) 2010-03-30
WO2007101194A3 (en) 2008-03-13
US20070203894A1 (en) 2007-08-30
CN101390097A (en) 2009-03-18
CN102750323A (en) 2012-10-24
KR101098703B1 (en) 2011-12-23
EP1929415A2 (en) 2008-06-11
HK1130912A1 (en) 2010-01-08
EP1929415A4 (en) 2011-06-15
KR20080114764A (en) 2008-12-31

Similar Documents

Publication Publication Date Title
US20080077588A1 (en) Identifying and measuring related queries
US10325033B2 (en) Determination of content score
TWI471737B (en) System and method for trail identification with search results
US8676827B2 (en) Rare query expansion by web feature matching
US8346754B2 (en) Generating succinct titles for web URLs
JP5281405B2 (en) Selecting high-quality reviews for display
US9542476B1 (en) Refining search queries
JP5990178B2 (en) System and method for keyword extraction
US8620745B2 (en) Selecting advertisements for placement on related web pages
US7877389B2 (en) Segmentation of search topics in query logs
US20120323968A1 (en) Learning Discriminative Projections for Text Similarity Measures
US9251206B2 (en) Generalized edit distance for queries
AU2019366858B2 (en) Method and system for decoding user intent from natural language queries
US9798820B1 (en) Classification of keywords
JP5507469B2 (en) Providing content using stored query information
US20090216710A1 (en) Optimizing query rewrites for keyword-based advertising
Chang et al. Integrating a semantic-based retrieval agent into case-based reasoning systems: A case study of an online bookstore
AU2018250372B2 (en) Method to construct content based on a content repository
US20110131093A1 (en) System and method for optimizing selection of online advertisements
US20090327877A1 (en) System and method for disambiguating text labeling content objects
JP5427694B2 (en) Related content presentation apparatus and program
US20090248627A1 (en) System and method for query substitution for sponsored search
CN108319586B (en) Information extraction rule generation and semantic analysis method and device
JP4883644B2 (en) RECOMMENDATION DEVICE, RECOMMENDATION SYSTEM, RECOMMENDATION DEVICE CONTROL METHOD, AND RECOMMENDATION SYSTEM CONTROL METHOD
WO2014169857A1 (en) Data processing device, data processing method and electronic equipment

Legal Events

Date Code Title Description
AS Assignment

Owner name: YAHOO| INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ZHANG, WEI VIVIAN;JONES, ROSIE;REY, BENJAMIN;REEL/FRAME:020257/0539;SIGNING DATES FROM 20071127 TO 20071129

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: YAHOO HOLDINGS, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YAHOO| INC.;REEL/FRAME:042963/0211

Effective date: 20170613

AS Assignment

Owner name: OATH INC., NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YAHOO HOLDINGS, INC.;REEL/FRAME:045240/0310

Effective date: 20171231