METHOD AND APPARATUS FOR ACCESSING WEB PAGES
TECHNICAL FIELD The present invention relates in general to a system and method for providing user- friendly interfaces for accessing information sources over a communication network, and, in particular, to a system and method for accessing web pages over the Internet.
BACKGROUND Public data communication networks, such as the Internet, have revolutionized access to information by providing a powerful new system for disseminating information such as news, product information, advertisements, images, samples of music and video clips, etc. to distantly located persons or entities.
In general, the Internet includes a number of interconnected computers — usually called server computers or servers — which store such information. Servers receive requests for such information from distantly located computers typically operated by users — which are usually called client computers or clients. The servers respond by transmitting the requested information to the client computer via the Internet. The most popular current methods for these request-response interactions between a client computer and a server include a protocol called the Hyper Text Transfer Protocol (HTTP). This protocol is typically executed over a transport layer protocol such as Transmission Control Protocol/Internet Protocol (TCP/IP), which establishes and maintains a connection between the client computer and the server which, in turn, are interconnected with numerous other servers. To understand the present invention, it is helpful to understand the way the client interacts with the server. The client includes a computer program called a web browser, or simply, a "browser," that provides an interactive interface/display of the information received from access to the vast resources of the Internet. A user operating a client launches the browser on the client. Thereafter, the user typically specifies a network address from where desired information is to be retrieved e.g., a web page on the worldwide web. This address is generally expressed in the form of a Universal Resource Locator (URL), which contains a domain name and subdirectory information for a server from which information is retrieved.
When the user identifies a desired web page, the browser transmits a request for that page to an appropriate server via the Internet. In doing so, the client establishes a TCP/IP
connection with its Internet Service Provider (ISP) and, through that provider, requests a piece of information from the appropriate server in the form of a HTTP message packet. In response, the client receives a response message, which is typically a packet of data in the Hyper Text Markup Language (HTML), which packet is also referred to as a web page. The browser then displays the page for the user on display screen attached to the client. Persons of ordinary skill in the art understand the several variations to the methods of identifying the desired web pages such as a text entry area, a dialog box, or others.
Each web page is located at a network address that can be expressed in the URL format. The URLs are typically expressed in Roman characters such as those used in the English language. For most web pages, and in particular most "home" pages for web sites, a domain name is chosen so that it is easy for users to remember. For example, companies and organizations with widely recognized company names such as "Coca-Cola®", "Motorola®", and "IBM®", typically use their trademarked corporate names as the domain name for their home page, i.e. www.motorola.com and www.ibm.com. Although this addressing system is convenient for users familiar with a language founded on a Romanized alphabet system, it is not convenient for a large portion of the global population whose native languages do not employ such an alphabet.
For example, languages such as Persian, Hebrew, Japanese, Korean, and Chinese do not use the Romanized alphabet. The Chinese language, for instance, uses characters or symbols which require about 17 strokes for each symbol. In countries where such languages are spoken, the names of established companies are typically written with characters from their native languages, and therefore cannot be express using characters of the Roman alphabet. Accordingly, since URL's must be formed from roman characters, these company names therefore cannot be used as Internet domain names. If such a company selects a domain name formed of Roman characters, this selected name will have little or no meaning to its customers because the characters and domain names are unfamiliar to them. This same problem arises when companies established in the United States or in other countries (which use domain names in a Romanized alphabet) try to market their products and services in certain foreign countries. Since customers in these areas are not familiar with the Romanized alphabet and whereas these customers are most likely more familiar with the translated names of these corporations in Korean, Chinese, or whatever local language at issue, the domain names for such companies (even if well known in the United States) often have little meaning to these foreign customers and are therefore not the most effective in such markets.
Further, even if the domain name system allows the use of other types of characters, such as Chinese, Korean, Persian, Hebrew or Japanese characters, it would be difficult for users to enter such characters into a computer using a conventional keyboard such as a personal computer (PC) keyboard which is the standard input device for computers installed worldwide, without regard to the native language. For example, there are at least five different — and complex — methods for inputting Chinese into a personal computer through the standard keyboard. One method, called Cang Jie, breaks down Chinese characters into 26 building blocks, or radicals, on a PC keyboard. A method used in Taiwan, Zhuyin, uses a set of 37 phonetic symbols on the PC keyboard for input purposes. Another system used in China, Pinyin, uses standard romanized letters. In China, there is a five-stroke method called Wu Bi, while Hong Kong-based Ziran has come up with a 10-stroke system.
Numerous methods have been invented for entering Chinese textual characters using a keyboard. U.S. Pat. No. 6,009,444 to Chen discloses a method of entering text for the Zhuyin phonetic Chinese language, in which a character is representable as a first symbol selected from a first subset of symbols and a second symbol selected from a second subset of symbols, where the first and second subsets are mutually exclusive. A first key on which is displayed a first subset of symbols is activated (e.g., any one of keys 1-6). A candidate first symbol is displayed in response to the step of activating the first key. A second key is activated on which is displayed a second subset of symbols (e.g., any one of keys 7-0). The candidate first symbol is fixed and a candidate second symbol is displayed in response to activating the second key. A third key can be activated (e.g., any one of keys 7-0), on which is displayed a further subset of symbols, whereupon the candidate second symbol is fixed.
Even with these techniques, these characters are difficult to type into a computer and any domain names formed from such characters are difficult for English speaking customers to understand. By way of example, this illustrates numerous obstacles to implementation of a multi-lingual URL system, even if it were possible to overcome the complications of inputting non-Roman characters.
SUMMARY OF THE INVENTION The invention relates generally to an improved system and method for retrieving web pages from a plurality of server computers connected to a public communication network such as the Internet. The system includes a plurality of voice activated client computers, and at least one server computer which shall be designated as a voice bridge server.
Each voice activated client computer includes a presentation device, such as a computer display, for presenting the content of web pages to a user. It also includes a recording mechanism having a microphone for recording audio pointers which represent the names, to the inputted through speech, assigned to web pages which the user desires to access.
Each voice activated client computer also has a communication mechanism for connecting to said public communication network to obtain desired web pages. When a user speaks an assigned name, which shall be called an audio pointer of a desired web page, the communication mechanism first transmits the recorded audio pointer via said public communication network to a remote voice bridge server connected to the internet.
The voice bridge server includes an audio pointer database for storing, for each of a plurality of Internet web pages, a corresponding audio representation of a voice name assigned to the web page (the "audio pointer"). It also stores a network address for each such web page. When the voice bridge server receives from a remotely located voice activated client computer a recording of a spoken audio pointer of a desired web page, it compares the received recording to data in the audio printer database to determine the network address for the desired web page. It then transmits the network address for said desired web page to said remotely located voice activated client computer. When the voice activated client computer receives from the voice bridge server a network address for the web page corresponding to the audio pointer, it transmits a request for the desired web page to a remotely located web page server identified by the network address, and receives said web page from said remotely located web page server for presentation by a browser program. In a preferred embodiment, the system includes a plurality of voice bridge servers, each for handling audio pointers in one specific language. In this embodiment, each voice activated client computer includes a language selection mechanism for allowing the user to select a desired language for the audio pointers. The voice activated client computer also includes a voice bridge selection mechanism which, based upon the selected language, selects a corresponding remote voice bridge server. The communication mechanism of the voice activated client computer then transmits each recorded audio pointer to the corresponding voice bridge server for that language.
In other embodiments, voice bridge servers can be deployed based on geographic regions to service clients in those regions. Each such server may contain multiple databases, each for handling a different language commonly used in said region.
BRIEF DESCRIPTION OF THE DRAWINGS These and other objects, features and advantages of the invention will be more readily apparent from the following detailed description of presently preferred embodiments and the appended claims with a reference to the accompanying drawings, where like numerals represent like parts, and in which:
FIG. 1 is a block diagram of a plurality of conventional client computers, voice activated client computers, conventional web servers and voice bridge server computers coupled to a communication network.
FIG. 2 is a block diagram of an illustrative voice activated client computer. FIG. 3 is a block diagram of an exemplary voice bridge server computer.
FIG. 4 is a flow-chart depicting the flow of execution in a preferred voice activated client computer configuration.
FIG. 5 is a flow-chart depicting the operation of a voice bridge server computer. FIG. 6 is a flow-chart of procedures performed by a voice activated client computer once the voice bridge server returns the search results based upon the user's spoken audio pointers.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
System Overview FIG. 1 depicts a plurality of conventional client computers 100, conventional web servers 102, voice activated client computers 104 and voice bridge server computers 106 all connected to a communication network such as the Internet 108.
The web servers 102 are conventional computer servers which are configured to store and disseminate web pages. Conventional client computers 100 transmit requests for these web pages directly to web servers 102, as is well known in the art. In response to such requests, the web servers transmit web pages over the Internet 108 for display at the requesting client computer 100.
Voice activated client computer 104 Referring now to FIG. 2, in a preferred embodiment, each voice activated client computer 104 includes a processor 200 (which includes a microprocessor such as the
Pentium® EH and memory such as semiconductor memory), a display 202 such as a CRT or a flat-panel display, a storage device 204 such as a fixed disk drive, and input devices 208 such as a keyboard or a pointing device such as a mouse. It also includes a speech input device such as a microphone 206, and an audio output device 210 (which are both
connected to a DSP 224, which handles coding, decoding, and data I/O to and from audio hardware and software that utilizes them).
Further, each client computer 104 is also equipped with a communication mechanism 212 that enables communication with other devices connected to networks, such as the Internet 108 via communication protocols such as TCP/IP. In general, the connection to the Internet 108 can be established via an Internet Service Provider (ISP) such as America Online, Inc., or via an office Local Area Network. In a preferred embodiment, a transmitter and receiver are both included in a network card.
Each voice activated client computer 104 includes a browser program 220 for accessing web pages from web servers 102 in the same manner as conventional client computers 100. Each voice activated client computer 104 is enhanced according to the principles of the present invention to include a mechanism that allows a user to request web pages by speaking a predetermined name or phrase. Accordingly, as shown in FIG. 2, each voice activated client computer 104 includes a Client Side Software (CSS) program 222. Though a conventional computer such as a personal computer is illustratively depicted as the voice activated client computer 104, in alternative embodiments, the present inventive principles can be practiced by enhancing devices such as the Web-TV™ device marketed by Microsoft Corporation of Redmond, Washington; a hand-held computer such as the Palm Pilot™ marketed by 3-COM Corporation of Santa Clara, California; the AOL- TV™ device marketed by America Online, Inc. of Dulles, Virginia; or the device used in conjunction with the Wireless Web™ service from Sprint Corporation of Westwood, Kansas.
Voice Bridge Server Computer 106 Referring to FIG. 3, each voice bridge server computer 106 is preferably implemented by installing voice bridge application software onto a general purpose computer that includes a processor 300 (including a microprocessor such as an Alpha™ microprocessor and associated semiconductor memory), a storage device 302 such as a fixed disk drive, and a data communications device 304 such as a 3-COM™ network card to connect to the Internet 108 using a protocol such as (TCP/IP). In general, the connection to the Internet 108 can be established via an ISP such as America Online, Inc., or a direct connection such as a T-l or a Digital Subscriber Line connection. Further, in a preferred embodiment, the voice bridge server 106 includes one or more database(s) 314 such as the Oracle Relational Database Management System.
The voice bridge server computer 106 is additionally configured to execute an operating system such as Windows-NT™ or Linux, and web server software such as the one
marketed by Netscape® Corporation of Sunnyvale, California or obtainable from sources such as Apache. The web server software is configured to interface with a communication device to receive packets of messages from computers connected to the Internet 108, decipher the information in the packets, and act according to instructions provided in the packets.
The processor 300 executes a Server Side Software (SSS) program 310 and a Voice Bridge Application program (VBA) 312. Preferably, the VBA 312 includes a voice recognition module (Not Shown) and a database 314 containing samples of audio pointers that sponsors would prefer to use to designate their web site. The voice recognition module can be purchased or licensed from companies that specialize in the analysis of spoken waveforms. The voice recognition module is preferably tuned to a lower accuracy setting to accommodate a larger number of users with distinct speaking styles or accents. In preferred embodiments, the SSS 140 and the VBA 312 are written in standard programming languages such as Java, C++, or implemented in part as middleware components that interface with the database 314.
Operation of the Invention In a preferred embodiment, the present invention is implemented in part on the voice activated client computer 104 and in part on the voice bridge server computer 106.
Operation of the Voice Activated Client Computer 104 Referring to FIG. 4, which depicts the steps performed by the voice activated client computer 104, as a first step, a user installs the CSS 222 on a standard client computer 104 (step 400) equipped with a DSP and a microphone. This installation can be accomplished by any of the standard methods known to persons skilled in the art, which methods include installing via a portable medium such as a CDROM, floppy disk, or downloaded via the Internet 108. Such a system is referred to as a voice activated client computer. When installed on the voice activated client computer 104, the computer displays an icon on the desktop of the voice activated client computer's 104 display screen 202. Then the user activates or "launches" the CSS 222 by using the mouse and clicking once or twice on the icon (Step 402). When launched, the CSS 222 preferably opens a window on the voice activated client computer 104 and prompts the user for a preferred language selection by displaying a language selection list (step 404). In a preferred embodiment, this selection list includes a menu of options selectable by the user, radio buttons, and/or check boxes and the like.
Alternatively, if a user so desires, a preferred language can be selected while installing the CSS 222. In this case, any language choice that is so selected is stored in the client computer 104 as a "default" language. When the user thereafter launches the CSS 222, a default language is assumed to be selected and therefore no further selection is necessary. However, if after the CSS 222 is launched, the user wishes to change the language selected, he can override the default language by selecting a different language from a selection box as explained above.
After the user selects a language, the CSS 222 assigns and stores internally a user identifier for the user (step 406). This user identifier is preferably a unique identifier, which could be established by generating a random number and appending the random number to a unique number on the client computer 104. This unique number is obtained from the user's computer license key, network card identifier, a transformation of the user's name or other such method. This step is only applicable when the CSS runs for the first time.
The CSS 222 then prompts the user for an audio pointer in the selected language which identifies a web page which the user wishes to view (step 408). This prompt is preferably an audio prompt or a visual prompt. In case of an audio prompt, the user is beeped to indicate when to start speaking. In case of a visual prompt, the user is shown a window with a pre-designated area comprising a button to press before speaking the phrase. The CSS 222 also instructs or activates the DSP 224 to start listening to the speaker's voice as received by microphone 108 (step 410). Preferably, the DSP 224 is configured to wait and listen for a predetermined (expiration) period of time, for example, 2-3 seconds before timing out. If the user speaks the audio pointer during this time period, the CSS 222 instructs the DSP 224 to record the audio pointer in a file on the fixed disk drive 204 (steps 412, 416). In one embodiment, the stored audio pointer is a wave format (with a ".wav" extension) file or an audio format (with a ".au" extension) file that is encoded according to a method known to persons skilled in the art. If, on the other hand, the user does not speak a phrase during the expiration period, the CSS 222 expires (times out) and prompts the user again (steps 412, 414).
The CSS 222 then verifies whether a network connection is established for the voice activated client computer 104 (step 418). Preferably, this verification is done by sending a "ping" signal to a remote computer such as the voice bridge server 106 over the network 108. If a connection is not present, the CSS 222 instructs the voice activated client computer to establish a network connection (step 420) via a standard method.
Based on the language selected by the user, the CSS 222 identifies a particular voice bridge server computer 106 (step 422) from among a plurality of voice bridge servers 106,
each of which is designated to serve a different language. It should be noted, however, that there may be only one voice bridge server 106 designated to service a plurality of languages, or a particular language group comprising a plurality of dialects in which case, the CSS will be requesting data from one of many databases on a voice bridge server. Other embodiments, which are easily understood by persons skilled in the art, include a plurality of server computers that collectively comprise a single voice bridge server 106.
The CSS 222 transmits the user's unique identifier and the recorded audio pointer to the voice bridge server 106 (step 424). This is accomplished by using a method such as file transfer protocol (ftp), HTTP, and others known to persons of ordinary skill in the art. The CSS then removes the file from the client computer.
Operation of the Voice Bridge Server Computer 106 Referring now to FIG. 5, during an initialization step 550, a voice bridge service provider (VBSP), who is an entity that operates the voice bridge server 106, invites other entities — organizations, individuals or companies (hereinafter collectively referred to as "sponsors") — to do business with the VBSP. In order to do business with the VBSP, the sponsor first opens an account with the VBSP by filling an online or a paper form, which requests the sponsor's name, address, billing information and the like. Once a sponsor fills out the form and (optionally) pays a prescribed fee, a sponsor account is established for each sponsor and is recorded in the database 314.
After registration, the sponsor may provide a plurality of network addresses (e.g., URLs) for registration with the VBSP. For each network address that the sponsor wishes to register, the sponsor preferably provides the address (either in a symbolic form or in a dotted decimal notation) and a corresponding word or phrase in a particular language of interest, which the system will use as an "audio pointer" to the network address. In some cases, the sponsor may provide at least one audio pointer for each of a plurality of languages of interest. This selected audio pointer need not have any relationship to the contents of the network address; it is only a symbolic name — such as a nickname or a memorable phrase — which the sponsor wishes to designate as the name for that network address. The audio pointer can also be arbitrarily chosen and therefore need not be related in any way to the particular network address. This audio pointer is one that identifies a particular network address. Preferably, the audio pointer is recorded by a native speaker of the selected language(s) of interest in a particular vernacular or dialect.
For each network address registered by a sponsor, the SSS 140 stores the address together with the corresponding designated audio pointer in the audio pointer, database 314
(step 552). A single sponsor may also register a plurality of audio pointers for a given web page or alternatively a single audio pointer may be associated with a list of addresses.
Since an aspect of the present invention is to match sounds in an entire phrase with similar stored sound samples, rather than recognizing the exact words spoken by a user, confusing or like-sounding words or phrases — such as "Apple Computer" and "April Computer," wherein the latter phrase is the former phrase mis-pronounced — are either removed from or not entered into the database 314. By thus avoiding confusing phrases, the accuracy of the matching process at a later stage is enhanced (step 554).
The SSS 140 "listens" for any incoming requests from voice activated client computers 106 (Step 556). When, as described in step 424, the client computer 104 transmits the audio pointer and the user's identifier to the voice bridge server 106, the voice bridge server 106 receives them and stores them in memory (step 558).
In one embodiment, the SSS 140 searches database 314 to determine if the VBSP has a prior usage record for the user identifier (step 559). If no prior usage record exists, the SSS 140 instructs the VBA 312 to store the user's identifier in user access records maintained on disk 302 (step 560). If the user has an existing user access record, then it is retrieved from disk 302 to assist in interpreting the new request.
After receiving the user's audio pointer as in step 558, the SSS 140 transfers control of execution to the VBA 312 (step 562). The VBA 312 preferably removes from the received audio pointer any leading or trailing silence and compares this received pointer to the audio pointers stored in the database 314 to locate the stored audio pointers which correspond to the received audio pointer, (step 566). In other embodiments, intra-phrase silence is also removed.
If the VBA finds a single audio pointer that matches the received audio pointer, the VBA 312 retrieves from the database 314 the corresponding network address (URL) or addresses and forwards it to the SSS 140 (step 566).
After the matching step, the SSS 140 transmits the corresponding network address or addresses to the voice activated client computer 104 (step 568) to permit the client computer to request the web page from the appropriate web server 103. The VBA 312 may find a plurality of registered audio pointers which seem to match the received audio pointer, for example, if two registered audio pointers sound similar (e.g. "April Computers" and "Apple Computers.") In this case, the SSS 140 gathers all the matched URLs and compares them to this user's user access record to determine priority of all the matched URLs. The SSS then forwards all this information to the CSS in order for the user to pick the desired URL.
It should be noted that these multiple matches could be the result either of a confusing or similar sounding name, or because the sponsor included a plurality of matching URLs to be returned whenever a user speaks a particular word or phrase. The former is the case when a user mis-pronounces a word which matches two close audio pointers and therefore the match results in two URLs. The latter could be the case, for example, when a Portugese company registers two URLs, one for use in Portugal and another for use in Brazil. A user uttering the same audio pointer or phrase in the Portugese language will therefore match both the URLs, as intended by the sponsor, and these are returned to the user for a further selection by the user at the voice activated client computer 104.
Advantageously, the SSS 140 maintains user access records of every access by a user of the voice bridge server 106 for use in assisting the voice recognition module in interpreting subsequent requests. The user access records typically include information identifying the web pages which the user has accessed in the past. However, it may also include information regarding the speech characteristics of the user and other information useful to the voice recognition module. In alternative preferred embodiments, every input and output between the SSS 140 and the VBA 312 and between the client computer 104 and the voice bridge server 106 are also logged. (Step 570).
Operation of the Voice Activated Client Computer 104
Referring to FIG. 6, the CSS 222 receives from the voice bridge server 106 the network addresses which matched the audio pointer sent by the client computer to the voice bridge server (step 680). If there are a plurality of network addresses matched for the audio pointer, the CSS 222 displays these in a selection list on the display device, arranged by priority, allowing the user to make a selection from the list (step 682).
The CSS 222 then instructs the browser 220 to send a request for the desired web page to the selected network address supplied by the voice bridge server 106 (step 684). The selected page is received from the remote web server 102 and displayed on the display device 220. The foregoing discloses a method for providing web access based on a user- selectable language, which language could be one that does not use an alphabet system such as the Roman alphabet system. Persons skilled in the art may make several modifications and rearrangements without departing from the spirit and scope of the invention or without undue experimentation. For example, some of the steps are optional, and the order of the steps described herein can be changed. The functions performed by the VBA 312 and the
SSS 310 could be performed by hardware or software according to any other method as designed by a person skilled in the art. The web browser can be any content viewer. The web page can be any content available on the Internet. All such deviations, departures, modifications and rearrangements should be construed to be within the spirit and scope of the appended claims.