WO2011004000A2 - Information distributing system with feedback mechanism - Google Patents

Information distributing system with feedback mechanism Download PDF

Info

Publication number
WO2011004000A2
WO2011004000A2 PCT/EP2010/059874 EP2010059874W WO2011004000A2 WO 2011004000 A2 WO2011004000 A2 WO 2011004000A2 EP 2010059874 W EP2010059874 W EP 2010059874W WO 2011004000 A2 WO2011004000 A2 WO 2011004000A2
Authority
WO
WIPO (PCT)
Prior art keywords
voice
data
server
representation
browser
Prior art date
Application number
PCT/EP2010/059874
Other languages
French (fr)
Other versions
WO2011004000A3 (en
Inventor
Frits Kievits
Vladimir Bronin
Anthonius Jacobus Brouwer
Original Assignee
Dialogs Unlimited B.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dialogs Unlimited B.V. filed Critical Dialogs Unlimited B.V.
Publication of WO2011004000A2 publication Critical patent/WO2011004000A2/en
Publication of WO2011004000A3 publication Critical patent/WO2011004000A3/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/487Arrangements for providing information services, e.g. recorded voice services or time announcements
    • H04M3/493Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals
    • H04M3/4938Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals comprising a voice browser which renders and interprets, e.g. VoiceXML
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/72445User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality for supporting Internet browser applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/25Aspects of automatic or semi-automatic exchanges related to user interface aspects of the telephonic communication service
    • H04M2203/251Aspects of automatic or semi-automatic exchanges related to user interface aspects of the telephonic communication service where a voice mode or a visual mode can be used interchangeably
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2207/00Type of exchange or network, i.e. telephonic medium, in which the telephonic communication takes place
    • H04M2207/40Type of exchange or network, i.e. telephonic medium, in which the telephonic communication takes place terminals with audio html browser
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/74Details of telephonic subscriber devices with voice recognition means

Definitions

  • the invention relates to an information distributing system comprising a voice browser and a voice server, the voice browser comprising a first sending device for sending voice data over a first voice data connection to the voice server, the voice server comprising a first voice data receiving device for receiving the voice data, a speech recognition device for parsing the voice data, a representation data creation device for creating representation data representing the parsed voice data, a second sending device for sending the representation data over a second data connection to the voice browser.
  • the invention also relates to a voice browser and a voice server for use in the information distributing system.
  • the invention also relates to a method of distributing information and corresponding computer program product.
  • a computer generates the appropriate phrases using a speech synthesis application, for example, by way of a speech synthesis library.
  • the user may respond to the computer generated phrases in any suitable manner, but preferably with voice responses.
  • the user's voice responses are interpreted by the computer using, e.g., speech
  • VXML VoiceXML
  • W3C World Wide Web Consortium
  • the voice browser may interpret and execute a voice script, such as a voiceXML script.
  • VXML applications have been deployed. These applications include: order inquiry, package tracking, driving directions, emergency notification, wake-up, flight tracking, voice access to email, customer relationship management, prescription refilling, audio newsmagazines, voice dialing, real-estate information and national directory assistance applications, etc.
  • the computer may generate phrases which indicate products which are for sale.
  • the user can direct the generated phrases towards products of his interest by giving responses.
  • the given responses are interpreted by the computer upon which it, following his scripts, generates new phrases.
  • the side of the interactive voice dialogue which is generated by a computer typically uses speech synthesis.
  • the user side may respond, e.g., using the numerical keyboard of his telephone, but preferably, he responds by voice as well.
  • the computer also employs a speech recognition application. Voice browsing is also possible without speech synthesis by using prerecorded speech samples.
  • VUI Voice User Interface
  • a user connects to the system via a telephone connection.
  • the system establishes a connection to a Web site and loads the first Web page.
  • the HTML is parsed, separating text from other media types, isolating URL from HTML anchors and isolating the associated anchor titles.
  • a user is allowed to say any keyword phrase from the title to activate the associated link.
  • the Web document is described to the user.
  • Each pop-up menu item or radio button choice is spoken to the user who can repeat that phrase or a phrase subset to activate that choice.
  • the voice application When the voice application is deployed, the following problem may occur.
  • the interactive voice dialogue uses voice for both sides, it may happen that the user's voice is not recognized correctly. For example, a user may reply that he is interested in 'books', but the computer mistakenly identifies this as 'foods'. Consequently, the computer may continue to present products which belong to a category in which the user is not interested.
  • the service circuits includes a plurality of conventional Text-To-Speech (TTS) processor circuits which translate text into voice signals, a plurality of conventional voice circuits which record, digitize and play back speech and other audio frequency sounds, a plurality of conventional Automatic Speech Recognition (ASR) circuits which interpret speech signals received from a caller, and a plurality of conventional ISDN Basic Rate Interface circuits which provide an interface between the system platform and the Public Switched Telephone Network (PTSN).
  • TTS Text-To-Speech
  • ASR Automatic Speech Recognition
  • the service When the service receives a caller's web page update via a voice telephone call, the update is converted to a text update by an ASR. Next the service uses a TTS to convert the text update to speech and read it back to the caller.
  • a disadvantage of this method is that is takes a relatively long time for the user to verify his responses. Approximately the same amount of time is used for verification as for the receiving the user's input itself. As a result, this type of verification makes voice applications sluggish.
  • An information distributing system comprises a voice browser and a voice server.
  • the voice browser comprises a first sending device for sending voice data over a first voice data connection to the voice server.
  • the voice server comprises a first voice data receiving device for receiving the voice data, a speech recognition device for parsing the voice data, a representation data creation device for creating representation data representing the parsed voice data and a second sending device for sending the representation data over a second data connection to the voice browser.
  • the representation data comprises visual data representing the parsed voice data.
  • the voice browser comprises a second visual data receiving device for receiving the visual data and a display for displaying the visual data.
  • a user can use a voice browser to control browsing through a voice server using his voice to guide the browsing.
  • the server who receives this voice data must often first interpret the voice data before it can act on it.
  • the voice data may be parsed into data which is more readily readable by a machine.
  • this parsing process is prone to error.
  • the voice data comprises commands or comprises data upon which actions are to be based this can have undesirable consequences, ranging from wrong entries in databases to misinterpreted orders, etc.
  • the user of the voice browser may verify the parsed data for errors.
  • the representation data creation device may create data which represents the voice data, at least partially and at least in so far, as is relevant to the current application.
  • a user can review this representation, and take corrective actions, e.g., cancel an order etc, if the representation does not match the instruction he gave using the voice data.
  • the representation data comprises visual data which represents the parsed voice data, it can be displayed on a display device.
  • a visual representation has the advantage over other types of data representation, e.g., audio representations, that it can be reviewed very quickly. For less complicated data the voice browser's user can see at a glance if something is amiss.
  • a further advantage is that visual information can be reviewed more carefully, since each element of the representation can be carefully examined.
  • the time a user needs for the review is easily selectable by the user. Some users may be comfortable in quickly reviewing the representations; other users may want to review at a slower pace.
  • a user may even take more time for some parts and less time for other parts of the representation.
  • the speed with which a user can review visual data need not be constant but can depend on, e.g., the complexity of the part of the representation which is currently under review. In a sense the representation data allows the user to give feedback on parsing of the voice server.
  • voice applications may be used on mobile phones which have the option of using a display, albeit a small one, and a keyboard, albeit a cumbersome one.
  • the voice server may, instead of sending a speech synthesized version of how the computer understood the voice input, send a visual (e.g. textual) representation to the mobile phone, for display on the mobile phone's display.
  • the feedback may also be other visual feedback, such as a selected radio button.
  • the visual feedback may be compiled from multiple user responses.
  • a voice browser is a device capable of using voice to browse via a speech application.
  • the voice browser can be any device which is capable of receiving and sending voice data to a voice server. Examples include, a phone, in particular a mobile phone, a computer configured for VOIP connections etc.
  • the voice server may comprise a computer server.
  • the first sending device for sending voice data may include a telephone connection, an antenna configured for e.g. GSM, etc.
  • the first voice data connection may be analog or digital. It may be a telephone connection or a data network connection.
  • the representation data representing the parsed voice data may comprise HTML, text, etc.
  • the representation data may also be graphical. For example, an HTML page may be shown in which certain radio buttons are provided with an indication that that radio button is pressed.
  • the second data connection may be any connection type suitable for transmission of data. Examples include SMS, Internet, etc.
  • the display for displaying the visual data may be a computer monitor, a television, a mobile phone display etc.
  • creating representation data comprises entering the parsed data in an input field of a form.
  • Voice applications may be used to enable users to enter data into data structures using a voice browser. Filling in forms is one way to do this.
  • a form may be an HTML page with a data entry element, e.g., an ⁇ input> field. Giving the user, at least part of, the form that he is filling-in using the voice application, puts his parsed voice data into a context. This makes it easier for the user to recognize if these entries make sense, and are correct. Data entry review may go faster. The review will also be more likely to be correct, as a user is less likely to confuse answers and questions. For example, he will be able to see, if his name was entered in a 'place of birth' field by accident, if he can see the form in the visual representation.
  • the form may be in many suitable representations, e.g. VXML, HTML, etc.
  • the voice server is connectable to a web server for receiving a representation of the form, and for sending a representation of the parsed data to the web server.
  • Sending the filled-in form to a web server allows the voice server to act as a conduit between a web server and the voice browser.
  • the effect is that a user can fill-in forms on a web server without using a regular text oriented web browser, but using his voice browser instead.
  • the user can review his answers better. Review of his answers may be done before the filled-in form is sent to the web server, this has the advantage that the web server is less likely to receive incorrect forms. It may also be done after the form has been sent to the web server; this has the advantage of speeding up the process.
  • the voice server may be connected to the web server through another device which acts as an intermediary.
  • the voice server is connectable to a transcoder device for converting a webpage to a voice application.
  • the voice server may be connected to the web server through the transcoder device. That is, the voice server may be connected to the transcoder device, and the transcoder device may be connected to the web server.
  • VXML is a convenient language in which to develop voice applications, the fact remains that a voice application has to be built from scratch. Especially for large applications this is a significant investment.
  • an HTML version of the application For example, returning to our vending application mentioned in the background.
  • the vendor typically first invests in an internet presence, that is, in a web-based application in which customers can interactively order his products.
  • the problem may be solved by using a transcoder which transcodes, i.e., translates a web based application to a voice based application.
  • a transcoder which transcodes, i.e., translates a web based application to a voice based application.
  • the transcoder is on the one hand connected to a web server, for receiving web pages, and on the other hand to a voice server for outputting voice scripts.
  • the voice server uses the voice scripts to synthesize phrases and to receive and recognise speech coming from a user.
  • the transcoder converts the web page to a voice application by means of a script conversion device, which converts a page written in a scripting language intended for visual display, e.g., HTML to a scripting language which is intended for use by a speech synthesizer and/or speech recognizer, e.g., VXML.
  • the script conversion device uses 'HTML patterns', which are a type of search strings to recognize a predetermined pattern of html code.
  • the HTML pattern may be written using regular expressions. It is preferred if the patterns are XML based.
  • To each HTML pattern a VXML pattern is associated which indicates how this particular HTML pattern is to be converted to VXML.
  • Another way of processing HTML documents for use in a voice application is described in Brown 1 , section 'PhoneBrowser Processing' and the section 'Document processing'.
  • the transcoder fills-in the web page using the recognised, i.e. parsed, responses from the user. If the web page requires multiple responses the transcoder may be configured for collecting all user responses for that webpage.
  • the transcoder submits the webpage to a web server. In this way, it seems to the web server as if he handles just one more ordinary interaction. In other words, it may be transparent to the web server that a user interacts with it through voice.
  • the speech synthesis and speech recognition systems need not run on a user's device. This has the advantage that the user's voice browser needs less powerful computing resources. Moreover, having the speech synthesis and/or speech recognition at a voice server has the advantage that the quality of the speech synthesis and/or speech recognition can be higher, as a result of larger available resources, e.g. more computing power, larger memories etc. It is known that speech synthesis and/or speech recognition is a resource intensive application, moving these types of applications away of device with relatively few computing resources is therefore an advantage.
  • Converting existing HTML application to VXML application has the advantage that the VXML application does not need to be built from scratch. This significantly reduces the investment that is necessary before the voice application is ready for use. Note that, not only are individual HTML pages converted to VXML, but also the process which is embedded in the web application (e.g. which page comes after which page, etc) is converted.
  • the first connection is different from the second connection.
  • the first voice data connection and the second data connection may each be adapted to their various uses.
  • the first voice data connection may be a regular telephone connection, connection to a mobile phone, a fixed phone, etc.
  • the second data connection may use a connection adapted for data transmission, e.g., short message service (SMS), Ethernet, Internet etc.
  • SMS short message service
  • both connections may also be the same.
  • both connections may be an internet connection, the voice data coming in, e.g., over VOIP, the representation data using the HTTP protocol.
  • the voice browser comprises an input device for selecting in the displayed visual data an incorrectly parsed data element of the parsed voice data.
  • Identifying which elements are incorrect has the advantage that the voice server may take corrective action. Since the voice server knows which elements are incorrect the corrective action can depend thereupon. For example, the voice server may ask to repeat only those data elements which are incorrect. The process of correcting parsing errors in the form is faster if only a subset of the data elements need correcting.
  • the voice brower's user may use various input devices, different from his microphone, to select a response that was received incorrectly.
  • input devices include a keyboard, a mouse pad, a touch sensitive screen etc.
  • a webpage which comprises multiple responses can also be send to the mobile phone in pieces, so that the pieces may fit easily on the mobile phone's display.
  • identifying incorrectly parsed data element may use voice commands.
  • the user may say a command which tells the voice server that the first element is incorrect.
  • the visual data comprises textual data.
  • a particular efficient way to represent and communicate parsed voice data is to represent them as text, e.g., using ASCII symbols.
  • the voice browser is comprised in a mobile phone.
  • a mobile phone is a convenient interface which allows one to browse on a voice server without the need to be at a fixed place.
  • a further aspect of the invention concerns a voice browser for use in any information distributing system according to the invention.
  • a further aspect of the invention concerns a voice server for use in any information distributing system according to the invention.
  • a further aspect of the invention concerns an information distributing method.
  • the method comprises sending voice data over a first voice data connection to the voice server, receiving the voice data, parsing the voice data, creating representation data representing the parsed voice data, sending the representation data over a second data connection to the voice browser.
  • the representation data comprising visual data represents the parsed voice data.
  • the method further comprises receiving the visual data and displaying the visual data.
  • a further aspect of the invention concerns a method for receiving information on a voice browser comprising sending voice data over a first voice data connection to the voice server, receiving visual data and displaying the visual data, the visual data representing a parsing of the voice data.
  • a further aspect of the invention concerns a method for confirming information on a voice server comprising receiving voice data from a voice browser, parsing the voice data, creating representation data, sending the representation data over a data connection to the voice browser, the representation data comprising visual data representing the parsed voice data.
  • a method according to the invention may be implemented on a computer as a computer implemented method, or in dedicated hardware, or in a combination of both.
  • Executable code for a method according to the invention may be stored on a computer program product. Examples of computer program products include memory devices, optical storage devices, integrated circuits, servers, online software, etc.
  • the computer program comprises computer program code means adapted to perform all the steps of a method according to the invention when the computer program is run on a computer.
  • the computer program is embodied on a computer readable medium.
  • Figure 1 is a block diagram illustrating an information distributing system
  • Figure 2 is a schematic front view of a voice browser
  • Figure 3 is a schematic view of a visual representation of parsed data
  • Figure 4 is a flow chart representation of a method for information distribution.
  • Figure 5 is a block diagram illustrating a further information distributing system, Throughout the Figures, similar or corresponding features are indicated by same reference numerals.
  • Figure 1 shows an information distributing system 100.
  • Information distributing system 100 comprises a web server 130 which is currently connected through data network 120 to a web browser 1 10.
  • Web server 130 is configured to serve a web site, e.g., some web application to browser 110.
  • Web browser 110 may be any known web browser, e.g., Internet Explorer, Firefox, etc.
  • the website comprises an information page (not shown) made in HTML. The information page may be send by web server 130 to web browser 110 over network 120.
  • Web server 130 is also connected to back office system 160.
  • the information page comprises at least one input field, for example, a radio button, an input box etc.
  • the user may enter a response using web browser 1 10.
  • the response is received by web server 130 and the response is eventually entered into back office system 160.
  • back office 160 comprises a business rules engine (BRE) of some sort and a database.
  • BRE business rules engine
  • Information distributing system 100 further comprises a voice server 135 which is currently connected via voice network 125, i.e. a first voice data connection, to a voice browser 1 15.
  • Voice server 135 is configured to generate speech using speech synthesis, and to receive responses using speech recognition software. Note that, voice server 135 may also recognize Dual-tone multi-frequency (DTMF) responses, such as may be generated by pressing numbers on a telephone's keyboard.
  • Voice network 125 may be a telephone network.
  • Voice network 125 may also be a data network over which speech may be transferred in a suitable encoding.
  • Web server 130 and voice server 135 are both connected to transcoder 140.
  • FIG. 2 shows a front view of a voice browser, in this case a mobile phone 200.
  • Mobile phone 200 is provided with a display 210 and an input device 220.
  • the input device 220 may be a keyboard, a touchpad etc.
  • the keyboard may be numeric, alphanumeric etc.
  • Input device 220 may be integrated with display 210, e.g. if the display is touch sensitive.
  • the information distributing system 100 may omit certain features.
  • the invention may be used without web browser 1 10, data network 120, transcoder 140, back office system 160, web server 130.
  • web server 130 may comprise a voice application written directly in VXML, and representation data is sent from web server 130 to voice browser 1 15.
  • Information pages are also possible without input fields. Even for an information page without an input field voice data may be received, e.g., to receive navigation commands, e.g., to request a next page.
  • transcoder 140 receives from web server 130 an information page comprising an input field.
  • the information page is to be filled out by the user. That is, the HTML page acts as a form.
  • Transcoder 140 converts the information page to a VXML voice script, and forwards the voice script to voice server 135.
  • Transcoder 140 comprises a script conversion device (not shown).
  • the script conversion device is configured to search inside the information pages for at least one pre-determined search string, possibly containing appropriate wild cards.
  • the pattern selects part of the information page which contains the input field.
  • a conversion pattern converts the select part to a script page which may be used by voice server 135.
  • the HTML page may comprise the following HTML code:
  • This snippet of HTML code may be recognized by an HTML pattern.
  • the HTML pattern may for example be configured to look for the words 'form' and 'input' When a pattern matches against the HTML code it can extract elements which may be speech synthesized. In this HTML example, the pattern may extract the phrases 'Name', 'Submit' and 'Reset'.
  • Voice server 135 uses a speech synthesis application to generate voice, which is then sent over network 125 to a voice browser 115.
  • the voice browser may be a mobile telephone.
  • the speech synthesis application may generate sounds files corresponding to the phrases identified above. Note that additional phrases may be added as required. For example, the application may add a phrase which instructs a user of the voice browser to speak an appropriate response to the phrase 'Name'.
  • a user of voice browser 115 can enter a voice response.
  • the voice data is sent back to voice server 135 over first connection 125.
  • a speech recognition application converts the voice response into a textual representation thereof. For example, the speech recognition application may translate a spoken name into a textual
  • the speech recognition application may also convert a command to code representing the same command. For example, referring to the HTML example above, a user may say 'reset' to indicate that he wishes to annul his previous commands. The speech recognition application may translate this into the code 'B2'. Other parts of system 100 can react in an appropriate manner to receiving a code, in this case code B2.
  • a form in which the parsed voice data is filled-in may be created as a visual representation of the parsed data.
  • Figure 3 illustrates a possible visual representation
  • 315 may contain the speech recognitions attempt at recognizing the name of the user.
  • radio buttons 320, 325 are shown, which may be labeled with 'Submit' and 'Reset' respectively.
  • This filled-in form may be send to voice browser 1 15 as a graphical picture, e.g., a JPG picture. But it is more efficient to use a textual representation. It is also possible to combine text and graphics in a single representation. For example, the
  • representation may use HTML.
  • the textual representation is sent via a connection 170, i.e., a second data connection, from transcoder 140 to the voice browser 1 15 for display on a display device.
  • a connection 170 i.e., a second data connection
  • Connection 170 is typically a data connection such as the internet.
  • the representation data may be sent from voice server 135 instead of from transcoder 140.
  • transcoder 140 may be integrated in voice server 135.
  • the radio buttons 320 and 325 may be labeled to indicate if they have been selected or if they can still be selected. For example, button 320 may be colored, e.g., red, to indicate that the submit button was selected. The buttons may be flashing to indicate that no selection has been made yet.
  • Mobile phone 200 can forward this information to voice server 135, possibly reusing data connection 170.
  • Voice server 135 may use this information to take corrective action. For example, he may request the voice data corresponding to the incorrect field again. He may also initiate a process in which the offending field is entered again, but using a different input medium. For example, the user may use input 220, e.g., a keyboard, to enter the problematic field again manually. In this way still a benefit is achieved. Most of the fields can be entered using voice. Only the occasional field with errors must be typed. In this way typing is avoided for most of the input fields.
  • the displayed visual representation of the parsed voice data may comprise a plurality of displayed data elements.
  • the input device may be used for selecting in the displayed visual data an incorrectly parsed data element of the parsed voice data out of the plurality of displayed data element of the parsed voice data.
  • FIG. 4 shows a schematic flow chart of a method for information distribution 400. Three parts of the method are represented in the flow chart indicated with reference numerals, 410, 420 and 430.
  • Step 410 comprises sending voice data over a first voice data connection to the voice server
  • Step 420 comprises receiving the voice data, parsing the voice data, creating representation data representing the parsed voice data, sending the representation data over a second data connection to the voice browser.
  • the representation data comprises visual data representing the parsed voice data.
  • Step 430 comprises receiving the visual data and displaying the visual data.
  • a method according to the invention may be executed using software, which comprises instructions for causing a processor system to perform method 400.
  • Software may include steps taken by a voice server, or those taken by an on-demand voice browser, or both.
  • the software may be stored in a suitable storage medium, such as a hard disk, a floppy, a memory etc.
  • the software may be sent as a signal along a wire, or wireless, or using a data network, e.g., the Internet.
  • the software may be made available for download and/or for remote usage on a server.
  • the following example gives a possible workflow.
  • Using a transcoding method a webpage is converted into one or more VoiceXML scripts using HTML patterns. After conversion the VoiceXML scripts may be sent to a Voice browser.
  • a conversion project may be defined which defines a translation from a web application including its page-to-page flow into a speech application.
  • An HTML pattern defines how a particular sequence of possible HTML keywords may be translated into a voice script, e.g. Voice XML.
  • the conversion project may be used as in the following example.
  • An incoming call is detected, e.g., coming from a voice browser. Based on the calling number a voice server retrieves a URL and a conversion project. First a welcome interaction may take place with the user.
  • An HTML page is retrieved from the URL by the voice server using a web browser.
  • the page is analyzed by a voice transcoder and converted, e.g. partitioned, into a collection of HTML patterns.
  • the structure of the collection of HTML patterns is compared to predetermined project documents. If a matching document exists the project document is used to further translate the HTML page. If a predetermined project documents is not found, then a new project document is created.
  • the HTML patterns are converted to VoiceXML by a dialog manager.
  • the VoiceXML may then be executed. Possibly, the VoiceXML is executed on the voice server, and voice data is sent to the calling voice browser. Possibly, the voice xml itself is sent to and interpreted and executed by the voice browser.
  • the part of the HTML page corresponding to a pattern that is executed as VoiceXML is also shown on a display using contrasting means to make it stand out visually, e.g., by using a contrasting color.
  • This color may appear on a display of the voice browser.
  • the color indicates to the user when it should give a spoken reply.
  • a voice interaction takes place.
  • the speech may be recognized or further text may be played using speech synthesis.
  • the dialog manager Based on a recognized reply of the user the dialog manager sends the result to the voice transcoder.
  • the transcoder translates the recognized reply into a reply appropriate for the web page. For example, the reply is translated into filing-in an input field, clicking a button, etc.
  • FIG. 5 is a block diagram illustrating a further information distributing system 500.
  • Information distributing system 500 comprises a web browser 1 10 connected via a data network 120 to a web server 130.
  • the data network 120 may use the TCP/IP protocol.
  • the data network 120 is preferably the Internet.
  • Web browser 110 is configured to obtain a web page (not shown), e.g. an HTML page from web server 130 and display it on a displaying device to a user of web browser 1 10.
  • Web browser 1 10 preferably runs on a personal computer, although any other device configured for displaying web pages and connecting to a web server is also possible.
  • the web page comprises a 'talk to me' button.
  • the web page is configured to upload a connecting program to web browser 1 10 for connecting to voice server 135.
  • the connecting program may be an applet, for example, a SIP User Agent applet.
  • the connecting program is configured to connect web browser 1 10 to a voice server 135.
  • the voice server 135 may be reached using the same data network 120.
  • the voice server 135 may be integrated with web server 130, but it may also be operated as a separate server.
  • a speech application is installed which can send an emulation, e.g., a transcoding, of the web page to web browser 1 10, possibly through the connecting program.
  • an emulation e.g., a transcoding
  • the exact appearance of the emulation may depend on choices made by a web designer of the web page. For example, if the emulation is made using HTML patterns, the web designer can select which parts of HTML must be translated in parts of the emulated page.
  • Voice server 135 may be connected to web server 130 to receive the web page. In this situation the emulation page may be create on-demand.
  • a user of web browser 1 10 may press the button.
  • the web browser 110 Upon pressing the button the web browser 110, preferably under control of web page 110, load the connecting program.
  • the connecting program connects to voice server 135.
  • voice server 135 the speech application is started, or may already be running.
  • Voice server 135 sends the emulation to web browser 110.
  • the emulation page may be described using, preferably synthesized, speech in a very complete manner. In this way the need to look at the web page is reduced to a minimum, if the need arises at all. This option is especially useful for users who, for whatever reason, are restricted in using visual input.
  • the user may be operating machinery, driving a car etc.
  • a user who is prompted for input can use his voice to give this input. It is preferred if the connection program records the voice of the user and sends the resulting voice data to voice server 135.
  • Voice server 135 receives the input and recognizes the speech. Voice server 135 may send the parsed speech directly to web server 130, using a connection between them. However, it is preferred for voice server 135 to send the parsed voice data to web browser 110. Again there are several possibilities. For example, voice server 135 may create a version of the web page wherein input fields are filled in based upon the voice data. On the other hand, voice server 135 may also send the parsed voice data to web browser 110 where a filled-in version of the web page may be created, e.g., by the connection program.
  • the web browser 110 may show the parsed data in a visual representation, for example as text, e.g., in the form of a filled in web page form.
  • the web browser 1 10 may also read the parsed data back to the user.
  • the speech synthesis may preferably be done at voice server 135, and sent in the form of audio data to web browser 110, e.g., using the MP3 audio file format.
  • the speech synthesis may also be done at web browser 110, using a local speech synthesiser.
  • connection with voice server 135 is preferably initiated from web browser 110.
  • the invention also extends to computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice.
  • the program may be in the form of source code, object code, a code intermediate source and object code such as partially compiled form, or in any other form suitable for use in the implementation of the method according to the invention.
  • a program may have many different architectural designs.
  • a program code implementing the functionality of the method or system according to the invention may be subdivided into one or more subroutines. Many different ways to distribute the functionality among these subroutines will be apparent to the skilled person.
  • the subroutines may be stored together in one executable file to form a self-contained program.
  • Such an executable file may comprise computer executable instructions, for example, processor instructions and/or interpreter instructions (e.g. Java interpreter instructions).
  • one or more or all of the subroutines may be stored in at least one external library file and linked with a main program either statically or dynamically, e.g. at run-time.
  • the main program contains at least one call to at least one of the subroutines.
  • the subroutines may comprise function calls to each other.
  • An embodiment relating to a computer program product comprises computer executable instructions corresponding to each of the processing steps of at least one of the methods set forth. These instructions may be subdivided into subroutines and/or be stored in one or more files that may be linked statically or dynamically.
  • Another embodiment relating to a computer program product comprises computer executable instructions corresponding to each of the means of at least one of the systems and/or products set forth. These instructions may be subdivided into subroutines and/or be stored in one or more files that may be linked statically or dynamically.
  • the carrier of a computer program may be any entity or device capable of carrying the program.
  • the carrier may include a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disc or hard disk.
  • the carrier may be a transmissible carrier such as an electrical or optical signal, which may be conveyed via electrical or optical cable or by radio or other means.
  • the carrier may be constituted by such cable or other device or means.
  • the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted for performing, or for use in the performance of, the relevant method.

Abstract

An information distributing system (100) comprising a voice browser (115) and a voice server (135), the voice browser comprising a first sending device for sending voice data over a first voice data connection (125) to the voice server, the voice server comprising a first voice data receiving device for receiving the voice data, a speech recognition device for parsing the voice data, a representation data creation device for creating representation data representing the parsed voice data, and a second sending device for sending the representation data over a second data connection to the voice browser, wherein the representation data comprises visual data representing the parsed voice data, and the voice browser comprises a second visual data receiving device for receiving the visual data and a display for displaying the visual data.

Description

INFORMATION DISTRIBUTING SYSTEM WITH FEEDBACK MECHANISM FIELD OF THE INVENTION The invention relates to an information distributing system comprising a voice browser and a voice server, the voice browser comprising a first sending device for sending voice data over a first voice data connection to the voice server, the voice server comprising a first voice data receiving device for receiving the voice data, a speech recognition device for parsing the voice data, a representation data creation device for creating representation data representing the parsed voice data, a second sending device for sending the representation data over a second data connection to the voice browser.
The invention also relates to a voice browser and a voice server for use in the information distributing system.
The invention also relates to a method of distributing information and corresponding computer program product.
BACKGROUND OF THE INVENTION
There are many situations in which interaction with a computer is desired, but wherein this is unfeasible. For example, no keyboard and/or display may be available, or, using a computer is considered cumbersome. It is known that in such circumstances it is advantageous to use interactive voice dialogues between a human and a computer. Such a dialogue may be performed over a regular telephone or handheld. A computer generates the appropriate phrases using a speech synthesis application, for example, by way of a speech synthesis library. The user may respond to the computer generated phrases in any suitable manner, but preferably with voice responses. The user's voice responses are interpreted by the computer using, e.g., speech
recognition.
In the art, interactive voice dialogues between a human and a computer are created using dedicated scripting languages. Currently, VoiceXML, also known as VXML, is popular choice. VXML is a W3C standard XML format in which the voice application can be developed. It allows voice applications to be developed and deployed in an analogous way in which HTML may be used to develop visual applications. VXML documents are interpreted by a voice browser. A common architecture has a voice browser attached to a telephone network so that users can use a telephone to interact with the voice application. One version of VoiceXML has been standardized by the World Wide Web Consortium (W3C). The voice browser may interpret and execute a voice script, such as a voiceXML script.
Many VXML applications have been deployed. These applications include: order inquiry, package tracking, driving directions, emergency notification, wake-up, flight tracking, voice access to email, customer relationship management, prescription refilling, audio newsmagazines, voice dialing, real-estate information and national directory assistance applications, etc.
For example, in a vending application, the computer may generate phrases which indicate products which are for sale. The user can direct the generated phrases towards products of his interest by giving responses. The given responses are interpreted by the computer upon which it, following his scripts, generates new phrases. As noted, the side of the interactive voice dialogue which is generated by a computer typically uses speech synthesis. The user side may respond, e.g., using the numerical keyboard of his telephone, but preferably, he responds by voice as well. In the latter case, the computer also employs a speech recognition application. Voice browsing is also possible without speech synthesis by using prerecorded speech samples.
In document "Web Page Analysis for Voice Browsing" by authors Michael K. Brown, Stephen C. Glinski, and Brian C. Schmult a Voice User Interface (VUI) for voice controlled browsing of Web based content is disclosed. This document is included herein by reference, and referred to hereinafter as "Brown 1 ". HTML content can be automatically converted to produce a voice dialog interface to content not originally intended for voice access.
A user connects to the system via a telephone connection. The system establishes a connection to a Web site and loads the first Web page. The HTML is parsed, separating text from other media types, isolating URL from HTML anchors and isolating the associated anchor titles. A user is allowed to say any keyword phrase from the title to activate the associated link. At the same time, the Web document is described to the user.
Each pop-up menu item or radio button choice is spoken to the user who can repeat that phrase or a phrase subset to activate that choice.
When the voice application is deployed, the following problem may occur. When the interactive voice dialogue uses voice for both sides, it may happen that the user's voice is not recognized correctly. For example, a user may reply that he is interested in 'books', but the computer mistakenly identifies this as 'foods'. Consequently, the computer may continue to present products which belong to a category in which the user is not interested.
It is a problem, if a user does not have a convenient opportunity to verify if the computer has understood him correctly.
A different type of voice browsing application is disclosed in US patent 6,335,928, included herein by reference. It discloses a facility for interfacing the Internet with a telecommunications network and vice versa so that a user who does not have access to the Internet may, nevertheless, provide a Web page and update the Web page via the telecommunications network.
The service circuits includes a plurality of conventional Text-To-Speech (TTS) processor circuits which translate text into voice signals, a plurality of conventional voice circuits which record, digitize and play back speech and other audio frequency sounds, a plurality of conventional Automatic Speech Recognition (ASR) circuits which interpret speech signals received from a caller, and a plurality of conventional ISDN Basic Rate Interface circuits which provide an interface between the system platform and the Public Switched Telephone Network (PTSN).
When the service receives a caller's web page update via a voice telephone call, the update is converted to a text update by an ASR. Next the service uses a TTS to convert the text update to speech and read it back to the caller. A disadvantage of this method is that is takes a relatively long time for the user to verify his responses. Approximately the same amount of time is used for verification as for the receiving the user's input itself. As a result, this type of verification makes voice applications sluggish.
It is a problem of the prior art that user verification of speech to text conversions are slow.
SUMMARY OF THE INVENTION
It is an object of the invention to avoid or mitigate the disadvantages set out above.
This and other objects are achieved by the information distributing system according to the invention. An information distributing system comprises a voice browser and a voice server. The voice browser comprises a first sending device for sending voice data over a first voice data connection to the voice server. The voice server comprises a first voice data receiving device for receiving the voice data, a speech recognition device for parsing the voice data, a representation data creation device for creating representation data representing the parsed voice data and a second sending device for sending the representation data over a second data connection to the voice browser. The representation data comprises visual data representing the parsed voice data. The voice browser comprises a second visual data receiving device for receiving the visual data and a display for displaying the visual data.
A user can use a voice browser to control browsing through a voice server using his voice to guide the browsing. The server who receives this voice data must often first interpret the voice data before it can act on it. Using a speech recognition device the voice data may be parsed into data which is more readily readable by a machine. However this parsing process is prone to error. Especially if the voice data comprises commands or comprises data upon which actions are to be based this can have undesirable consequences, ranging from wrong entries in databases to misinterpreted orders, etc. To avoid this, the user of the voice browser may verify the parsed data for errors. The representation data creation device may create data which represents the voice data, at least partially and at least in so far, as is relevant to the current application. A user can review this representation, and take corrective actions, e.g., cancel an order etc, if the representation does not match the instruction he gave using the voice data. Since the representation data comprises visual data which represents the parsed voice data, it can be displayed on a display device. A visual representation has the advantage over other types of data representation, e.g., audio representations, that it can be reviewed very quickly. For less complicated data the voice browser's user can see at a glance if something is amiss. A further advantage is that visual information can be reviewed more carefully, since each element of the representation can be carefully examined. Yet a further advantage is that the time a user needs for the review is easily selectable by the user. Some users may be comfortable in quickly reviewing the representations; other users may want to review at a slower pace. A user may even take more time for some parts and less time for other parts of the representation. The speed with which a user can review visual data need not be constant but can depend on, e.g., the complexity of the part of the representation which is currently under review. In a sense the representation data allows the user to give feedback on parsing of the voice server.
Furthermore, it is noted that voice applications may be used on mobile phones which have the option of using a display, albeit a small one, and a keyboard, albeit a cumbersome one. To speed up the verification by the user of his voice input, the voice server may, instead of sending a speech synthesized version of how the computer understood the voice input, send a visual (e.g. textual) representation to the mobile phone, for display on the mobile phone's display. The feedback may also be other visual feedback, such as a selected radio button. Moreover, the visual feedback may be compiled from multiple user responses.
A voice browser is a device capable of using voice to browse via a speech application. The voice browser can be any device which is capable of receiving and sending voice data to a voice server. Examples include, a phone, in particular a mobile phone, a computer configured for VOIP connections etc. The voice server may comprise a computer server. The first sending device for sending voice data may include a telephone connection, an antenna configured for e.g. GSM, etc. The first voice data connection may be analog or digital. It may be a telephone connection or a data network connection. The representation data representing the parsed voice data may comprise HTML, text, etc. The representation data may also be graphical. For example, an HTML page may be shown in which certain radio buttons are provided with an indication that that radio button is pressed. The second data connection may be any connection type suitable for transmission of data. Examples include SMS, Internet, etc. The display for displaying the visual data may be a computer monitor, a television, a mobile phone display etc. In an embodiment, creating representation data comprises entering the parsed data in an input field of a form.
Voice applications may be used to enable users to enter data into data structures using a voice browser. Filling in forms is one way to do this. A form may be an HTML page with a data entry element, e.g., an <input> field. Giving the user, at least part of, the form that he is filling-in using the voice application, puts his parsed voice data into a context. This makes it easier for the user to recognize if these entries make sense, and are correct. Data entry review may go faster. The review will also be more likely to be correct, as a user is less likely to confuse answers and questions. For example, he will be able to see, if his name was entered in a 'place of birth' field by accident, if he can see the form in the visual representation.
The form may be in many suitable representations, e.g. VXML, HTML, etc.
In an embodiment, the voice server is connectable to a web server for receiving a representation of the form, and for sending a representation of the parsed data to the web server.
Sending the filled-in form to a web server allows the voice server to act as a conduit between a web server and the voice browser. The effect is that a user can fill-in forms on a web server without using a regular text oriented web browser, but using his voice browser instead. At the same time the user can review his answers better. Review of his answers may be done before the filled-in form is sent to the web server, this has the advantage that the web server is less likely to receive incorrect forms. It may also be done after the form has been sent to the web server; this has the advantage of speeding up the process. Especially, if the data is of less importance, but on the other hand, quick responses from the voice server/web server system are valued by the user, than this method may be preferred. The voice server may be connected to the web server through another device which acts as an intermediary.
In an embodiment, the voice server is connectable to a transcoder device for converting a webpage to a voice application. The voice server may be connected to the web server through the transcoder device. That is, the voice server may be connected to the transcoder device, and the transcoder device may be connected to the web server. Although VXML is a convenient language in which to develop voice applications, the fact remains that a voice application has to be built from scratch. Especially for large applications this is a significant investment. Moreover, often there already exists an HTML version of the application. For example, returning to our vending application mentioned in the background. The vendor typically first invests in an internet presence, that is, in a web-based application in which customers can interactively order his products. Only at some later point will the vendor decide to offer an automated voice based option for his customers. In this situation, the vendor needs to make a similar investment to develop his voice application, i.e., writing the appropriate VXML code, as he did to develop his website, i.e. writing the appropriate HTML code. Moreover, if later changes are necessary, he needs to update both his HTML and his VXML application. There always is the risk that the two applications are out of synch. It is considered a problem to bear development costs twice when both a HTML and a VXML application are needed.
The problem may be solved by using a transcoder which transcodes, i.e., translates a web based application to a voice based application. In the following we will focus on the transcoding of web pages which comprise an input field. That is, web pages which require some kind of interactive response from the user.
The transcoder is on the one hand connected to a web server, for receiving web pages, and on the other hand to a voice server for outputting voice scripts. The voice server uses the voice scripts to synthesize phrases and to receive and recognise speech coming from a user.
The transcoder converts the web page to a voice application by means of a script conversion device, which converts a page written in a scripting language intended for visual display, e.g., HTML to a scripting language which is intended for use by a speech synthesizer and/or speech recognizer, e.g., VXML. The script conversion device uses 'HTML patterns', which are a type of search strings to recognize a predetermined pattern of html code. For example, the HTML pattern may be written using regular expressions. It is preferred if the patterns are XML based. To each HTML pattern a VXML pattern is associated which indicates how this particular HTML pattern is to be converted to VXML. Another way of processing HTML documents for use in a voice application is described in Brown 1 , section 'PhoneBrowser Processing' and the section 'Document processing'.
The transcoder fills-in the web page using the recognised, i.e. parsed, responses from the user. If the web page requires multiple responses the transcoder may be configured for collecting all user responses for that webpage.
Finally, the transcoder submits the webpage to a web server. In this way, it seems to the web server as if he handles just one more ordinary interaction. In other words, it may be transparent to the web server that a user interacts with it through voice.
It is noted that the speech synthesis and speech recognition systems need not run on a user's device. This has the advantage that the user's voice browser needs less powerful computing resources. Moreover, having the speech synthesis and/or speech recognition at a voice server has the advantage that the quality of the speech synthesis and/or speech recognition can be higher, as a result of larger available resources, e.g. more computing power, larger memories etc. It is known that speech synthesis and/or speech recognition is a resource intensive application, moving these types of applications away of device with relatively few computing resources is therefore an advantage.
Converting existing HTML application to VXML application has the advantage that the VXML application does not need to be built from scratch. This significantly reduces the investment that is necessary before the voice application is ready for use. Note that, not only are individual HTML pages converted to VXML, but also the process which is embedded in the web application (e.g. which page comes after which page, etc) is converted.
In an embodiment, the first connection is different from the second connection.
This has the advantage that different data connection means may be used side by side, possibly simultaneously. Using different connections for the first and second connection has various advantages. The first voice data connection and the second data connection may each be adapted to their various uses. For example, the first voice data connection may be a regular telephone connection, connection to a mobile phone, a fixed phone, etc. The second data connection may use a connection adapted for data transmission, e.g., short message service (SMS), Ethernet, Internet etc. On the other hand both connections may also be the same. For example, both connections may be an internet connection, the voice data coming in, e.g., over VOIP, the representation data using the HTTP protocol.
In an embodiment, the voice browser comprises an input device for selecting in the displayed visual data an incorrectly parsed data element of the parsed voice data.
Identifying which elements are incorrect has the advantage that the voice server may take corrective action. Since the voice server knows which elements are incorrect the corrective action can depend thereupon. For example, the voice server may ask to repeat only those data elements which are incorrect. The process of correcting parsing errors in the form is faster if only a subset of the data elements need correcting.
The voice brower's user may use various input devices, different from his microphone, to select a response that was received incorrectly. Example, input devices include a keyboard, a mouse pad, a touch sensitive screen etc. On the other hand, a webpage which comprises multiple responses can also be send to the mobile phone in pieces, so that the pieces may fit easily on the mobile phone's display.
On the other hand, identifying incorrectly parsed data element may use voice commands. For example, the user may say a command which tells the voice server that the first element is incorrect.
In embodiment, the visual data comprises textual data. A particular efficient way to represent and communicate parsed voice data is to represent them as text, e.g., using ASCII symbols.
In an embodiment the voice browser is comprised in a mobile phone. A mobile phone is a convenient interface which allows one to browse on a voice server without the need to be at a fixed place.
A further aspect of the invention concerns a voice browser for use in any information distributing system according to the invention. A further aspect of the invention concerns a voice server for use in any information distributing system according to the invention.
A further aspect of the invention concerns an information distributing method. The method comprises sending voice data over a first voice data connection to the voice server, receiving the voice data, parsing the voice data, creating representation data representing the parsed voice data, sending the representation data over a second data connection to the voice browser. The representation data comprising visual data represents the parsed voice data. The method further comprises receiving the visual data and displaying the visual data.
A further aspect of the invention concerns a method for receiving information on a voice browser comprising sending voice data over a first voice data connection to the voice server, receiving visual data and displaying the visual data, the visual data representing a parsing of the voice data.
A further aspect of the invention concerns a method for confirming information on a voice server comprising receiving voice data from a voice browser, parsing the voice data, creating representation data, sending the representation data over a data connection to the voice browser, the representation data comprising visual data representing the parsed voice data.
A method according to the invention may be implemented on a computer as a computer implemented method, or in dedicated hardware, or in a combination of both. Executable code for a method according to the invention may be stored on a computer program product. Examples of computer program products include memory devices, optical storage devices, integrated circuits, servers, online software, etc.
In a preferred embodiment, the computer program comprises computer program code means adapted to perform all the steps of a method according to the invention when the computer program is run on a computer. Preferably, the computer program is embodied on a computer readable medium.
BRIEF DESCRIPTION OF THE DRAWINGS The invention is explained in further detail by way of example and with reference to the accompanying drawings, wherein:
Figure 1 is a block diagram illustrating an information distributing system,
Figure 2 is a schematic front view of a voice browser,
Figure 3 is a schematic view of a visual representation of parsed data,
Figure 4 is a flow chart representation of a method for information distribution. Figure 5 is a block diagram illustrating a further information distributing system, Throughout the Figures, similar or corresponding features are indicated by same reference numerals.
List of Reference Numerals:
100, 500 information distributing syste
110 web browser
115 voice browser
120 data network
125 voice network
130 web server
135 voice server
140 transcoder
160 back office system
170 data connection
200 mobile phone
210 display
220 input device
300 visual data
310 form element
315 filled-in input fields of a form
320, 325 radio button
410, 420, method steps
430
DETAILED EMBODIMENTS While this invention is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail one or more specific embodiments, with the understanding that the present disclosure is to be considered as exemplary of the principles of the invention and not intended to limit the invention to the specific embodiments shown and described.
Figure 1 shows an information distributing system 100. Information distributing system 100 comprises a web server 130 which is currently connected through data network 120 to a web browser 1 10. Web server 130 is configured to serve a web site, e.g., some web application to browser 110. Web browser 110 may be any known web browser, e.g., Internet Explorer, Firefox, etc. The website comprises an information page (not shown) made in HTML. The information page may be send by web server 130 to web browser 110 over network 120. Web server 130 is also connected to back office system 160.
It will be assumed that the information page comprises at least one input field, for example, a radio button, an input box etc. After the page is displayed, the user may enter a response using web browser 1 10. The response is received by web server 130 and the response is eventually entered into back office system 160. Typically back office 160 comprises a business rules engine (BRE) of some sort and a database.
Information distributing system 100 further comprises a voice server 135 which is currently connected via voice network 125, i.e. a first voice data connection, to a voice browser 1 15. Voice server 135 is configured to generate speech using speech synthesis, and to receive responses using speech recognition software. Note that, voice server 135 may also recognize Dual-tone multi-frequency (DTMF) responses, such as may be generated by pressing numbers on a telephone's keyboard. Voice network 125 may be a telephone network. Voice network 125 may also be a data network over which speech may be transferred in a suitable encoding. Web server 130 and voice server 135 are both connected to transcoder 140.
Figure 2 shows a front view of a voice browser, in this case a mobile phone 200. Mobile phone 200 is provided with a display 210 and an input device 220. The input device 220 may be a keyboard, a touchpad etc. The keyboard may be numeric, alphanumeric etc. Input device 220 may be integrated with display 210, e.g. if the display is touch sensitive. It is noted that the information distributing system 100 may omit certain features. For example, the invention may be used without web browser 1 10, data network 120, transcoder 140, back office system 160, web server 130. For example, web server 130 may comprise a voice application written directly in VXML, and representation data is sent from web server 130 to voice browser 1 15.
Information pages are also possible without input fields. Even for an information page without an input field voice data may be received, e.g., to receive navigation commands, e.g., to request a next page.
During operation, transcoder 140 receives from web server 130 an information page comprising an input field. The information page is to be filled out by the user. That is, the HTML page acts as a form. Transcoder 140 converts the information page to a VXML voice script, and forwards the voice script to voice server 135. Transcoder 140 comprises a script conversion device (not shown).
The script conversion device is configured to search inside the information pages for at least one pre-determined search string, possibly containing appropriate wild cards. The pattern selects part of the information page which contains the input field. A conversion pattern converts the select part to a script page which may be used by voice server 135.
For example, the HTML page may comprise the following HTML code:
<form method="POST" action="http://www.some-post-action.com">
<p>Name <input type="text" name="T1 " size="20"></p>
<p><input type="submit" value="Submit" name="B1 "><input type="reset" value="Reset" name="B2"χ/p>
</form>
This snippet of HTML code may be recognized by an HTML pattern. The HTML pattern may for example be configured to look for the words 'form' and 'input' When a pattern matches against the HTML code it can extract elements which may be speech synthesized. In this HTML example, the pattern may extract the phrases 'Name', 'Submit' and 'Reset'. Voice server 135 uses a speech synthesis application to generate voice, which is then sent over network 125 to a voice browser 115. The voice browser may be a mobile telephone. For example, the speech synthesis application may generate sounds files corresponding to the phrases identified above. Note that additional phrases may be added as required. For example, the application may add a phrase which instructs a user of the voice browser to speak an appropriate response to the phrase 'Name'.
A user of voice browser 115 can enter a voice response. The voice data is sent back to voice server 135 over first connection 125. A speech recognition application converts the voice response into a textual representation thereof. For example, the speech recognition application may translate a spoken name into a textual
representation of a name. The speech recognition application may also convert a command to code representing the same command. For example, referring to the HTML example above, a user may say 'reset' to indicate that he wishes to annul his previous commands. The speech recognition application may translate this into the code 'B2'. Other parts of system 100 can react in an appropriate manner to receiving a code, in this case code B2.
A form in which the parsed voice data is filled-in may be created as a visual representation of the parsed data. Figure 3 illustrates a possible visual representation
300. Shown is a form element 310 which in this case could contain the phrase 'Name'.
After field 310 the parsed version of, part of, the voice data is shown. In this case field
315 may contain the speech recognitions attempt at recognizing the name of the user.
Finally two radio buttons 320, 325 are shown, which may be labeled with 'Submit' and 'Reset' respectively.
This filled-in form may be send to voice browser 1 15 as a graphical picture, e.g., a JPG picture. But it is more efficient to use a textual representation. It is also possible to combine text and graphics in a single representation. For example, the
representation may use HTML.
The textual representation is sent via a connection 170, i.e., a second data connection, from transcoder 140 to the voice browser 1 15 for display on a display device.
Connection 170 is typically a data connection such as the internet. Alternatively, the representation data may be sent from voice server 135 instead of from transcoder 140. Note that transcoder 140 may be integrated in voice server 135. The radio buttons 320 and 325 may be labeled to indicate if they have been selected or if they can still be selected. For example, button 320 may be colored, e.g., red, to indicate that the submit button was selected. The buttons may be flashing to indicate that no selection has been made yet.
After the user has seen visual data 300, he may notice that his name is spelled incorrectly. He may use input device 220 to select the field which was incorrect.
Mobile phone 200 can forward this information to voice server 135, possibly reusing data connection 170. Voice server 135 may use this information to take corrective action. For example, he may request the voice data corresponding to the incorrect field again. He may also initiate a process in which the offending field is entered again, but using a different input medium. For example, the user may use input 220, e.g., a keyboard, to enter the problematic field again manually. In this way still a benefit is achieved. Most of the fields can be entered using voice. Only the occasional field with errors must be typed. In this way typing is avoided for most of the input fields.
The displayed visual representation of the parsed voice data may comprise a plurality of displayed data elements. The input device may be used for selecting in the displayed visual data an incorrectly parsed data element of the parsed voice data out of the plurality of displayed data element of the parsed voice data.
Figure 4 shows a schematic flow chart of a method for information distribution 400. Three parts of the method are represented in the flow chart indicated with reference numerals, 410, 420 and 430. Step 410 comprises sending voice data over a first voice data connection to the voice server, Step 420 comprises receiving the voice data, parsing the voice data, creating representation data representing the parsed voice data, sending the representation data over a second data connection to the voice browser. The representation data comprises visual data representing the parsed voice data. Step 430 comprises receiving the visual data and displaying the visual data.
A method according to the invention may be executed using software, which comprises instructions for causing a processor system to perform method 400.
Software may include steps taken by a voice server, or those taken by an on-demand voice browser, or both. The software may be stored in a suitable storage medium, such as a hard disk, a floppy, a memory etc. The software may be sent as a signal along a wire, or wireless, or using a data network, e.g., the Internet. The software may be made available for download and/or for remote usage on a server. The following example gives a possible workflow. Using a transcoding method a webpage is converted into one or more VoiceXML scripts using HTML patterns. After conversion the VoiceXML scripts may be sent to a Voice browser. Accordingly, a conversion project may be defined which defines a translation from a web application including its page-to-page flow into a speech application. An HTML pattern defines how a particular sequence of possible HTML keywords may be translated into a voice script, e.g. Voice XML. The conversion project may be used as in the following example.
An incoming call is detected, e.g., coming from a voice browser. Based on the calling number a voice server retrieves a URL and a conversion project. First a welcome interaction may take place with the user. An HTML page is retrieved from the URL by the voice server using a web browser. The page is analyzed by a voice transcoder and converted, e.g. partitioned, into a collection of HTML patterns. The structure of the collection of HTML patterns is compared to predetermined project documents. If a matching document exists the project document is used to further translate the HTML page. If a predetermined project documents is not found, then a new project document is created. The HTML patterns are converted to VoiceXML by a dialog manager. The VoiceXML may then be executed. Possibly, the VoiceXML is executed on the voice server, and voice data is sent to the calling voice browser. Possibly, the voice xml itself is sent to and interpreted and executed by the voice browser.
The part of the HTML page corresponding to a pattern that is executed as VoiceXML is also shown on a display using contrasting means to make it stand out visually, e.g., by using a contrasting color. This color may appear on a display of the voice browser. Preferably, the color indicates to the user when it should give a spoken reply. After the speech reply of the user a voice interaction takes place. For example, the speech may be recognized or further text may be played using speech synthesis.
Based on a recognized reply of the user the dialog manager sends the result to the voice transcoder. The transcoder translates the recognized reply into a reply appropriate for the web page. For example, the reply is translated into filing-in an input field, clicking a button, etc.
The filled in page is then handled by the web browser and sent to a web server. A reply of the web browser is sent to the display of the user's device. Figure 5 is a block diagram illustrating a further information distributing system 500. Information distributing system 500 comprises a web browser 1 10 connected via a data network 120 to a web server 130. The data network 120 may use the TCP/IP protocol. The data network 120 is preferably the Internet.
Web browser 110 is configured to obtain a web page (not shown), e.g. an HTML page from web server 130 and display it on a displaying device to a user of web browser 1 10. Web browser 1 10 preferably runs on a personal computer, although any other device configured for displaying web pages and connecting to a web server is also possible.
The web page comprises a 'talk to me' button. The web page is configured to upload a connecting program to web browser 1 10 for connecting to voice server 135. The connecting program may be an applet, for example, a SIP User Agent applet. The connecting program is configured to connect web browser 1 10 to a voice server 135. The voice server 135 may be reached using the same data network 120. The voice server 135 may be integrated with web server 130, but it may also be operated as a separate server.
On voice server 135 a speech application is installed which can send an emulation, e.g., a transcoding, of the web page to web browser 1 10, possibly through the connecting program. The exact appearance of the emulation may depend on choices made by a web designer of the web page. For example, if the emulation is made using HTML patterns, the web designer can select which parts of HTML must be translated in parts of the emulated page.
Voice server 135 may be connected to web server 130 to receive the web page. In this situation the emulation page may be create on-demand.
During operational use, a user of web browser 1 10 may press the button. Upon pressing the button the web browser 110, preferably under control of web page 110, load the connecting program. The connecting program connects to voice server 135. On voice server 135 the speech application is started, or may already be running. Voice server 135 sends the emulation to web browser 110. There are many possibilities for the emulation page. For example, the emulation page may be described using, preferably synthesized, speech in a very complete manner. In this way the need to look at the web page is reduced to a minimum, if the need arises at all. This option is especially useful for users who, for whatever reason, are restricted in using visual input. For example, the user may be operating machinery, driving a car etc. This option is especially useful for users who have trouble reading a display, for example users who are astigmatic. In this way visual interaction with the web page is minimized, which is an advantage if such interaction is burdensome. Another possibility is for the emulation page to be described only minimally. A user of such a page would need to use the visual input of the web page to take in its contents. For example, the page could be described minimally, by only referring to its title, or by not describing at al. In both situations, the emulation page would prompt the user for input for at least one input field on the page. In this way interaction with input devices, such as a keyboard, is reduced. The user can supply this input using his voice recorded by a microphone connected to web browser 1 10 (not shown). In between these the types of emulation page that are described here, there are intermediate possibilities.
A user who is prompted for input can use his voice to give this input. It is preferred if the connection program records the voice of the user and sends the resulting voice data to voice server 135. Voice server 135 receives the input and recognizes the speech. Voice server 135 may send the parsed speech directly to web server 130, using a connection between them. However, it is preferred for voice server 135 to send the parsed voice data to web browser 110. Again there are several possibilities. For example, voice server 135 may create a version of the web page wherein input fields are filled in based upon the voice data. On the other hand, voice server 135 may also send the parsed voice data to web browser 110 where a filled-in version of the web page may be created, e.g., by the connection program.
The web browser 110 may show the parsed data in a visual representation, for example as text, e.g., in the form of a filled in web page form. The web browser 1 10 may also read the parsed data back to the user. The speech synthesis may preferably be done at voice server 135, and sent in the form of audio data to web browser 110, e.g., using the MP3 audio file format. The speech synthesis may also be done at web browser 110, using a local speech synthesiser.
The connection with voice server 135 is preferably initiated from web browser 110.
It will be appreciated that the invention also extends to computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice. The program may be in the form of source code, object code, a code intermediate source and object code such as partially compiled form, or in any other form suitable for use in the implementation of the method according to the invention. It will also be appreciated that such a program may have many different architectural designs. For example, a program code implementing the functionality of the method or system according to the invention may be subdivided into one or more subroutines. Many different ways to distribute the functionality among these subroutines will be apparent to the skilled person. The subroutines may be stored together in one executable file to form a self-contained program. Such an executable file may comprise computer executable instructions, for example, processor instructions and/or interpreter instructions (e.g. Java interpreter instructions). Alternatively, one or more or all of the subroutines may be stored in at least one external library file and linked with a main program either statically or dynamically, e.g. at run-time. The main program contains at least one call to at least one of the subroutines. Also, the subroutines may comprise function calls to each other. An embodiment relating to a computer program product comprises computer executable instructions corresponding to each of the processing steps of at least one of the methods set forth. These instructions may be subdivided into subroutines and/or be stored in one or more files that may be linked statically or dynamically. Another embodiment relating to a computer program product comprises computer executable instructions corresponding to each of the means of at least one of the systems and/or products set forth. These instructions may be subdivided into subroutines and/or be stored in one or more files that may be linked statically or dynamically.
The carrier of a computer program may be any entity or device capable of carrying the program. For example, the carrier may include a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disc or hard disk. Furthermore, the carrier may be a transmissible carrier such as an electrical or optical signal, which may be conveyed via electrical or optical cable or by radio or other means. When the program is embodied in such a signal, the carrier may be constituted by such cable or other device or means.
Alternatively, the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted for performing, or for use in the performance of, the relevant method.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. An information distributing system (100) comprising a voice browser (115) and a voice server (135),
the voice browser comprising a first sending device for sending voice data over a first voice data connection (125) to the voice server,
the voice server comprising a first voice data receiving device for receiving the voice data, a speech recognition device for parsing the voice data, a representation data creation device for creating representation data representing the parsed voice data, and a second sending device for sending the representation data over a second data connection to the voice browser,
wherein
- the representation data comprises visual data representing the parsed voice data, and
the voice browser comprises a second visual data receiving device for receiving the visual data and a display for displaying the visual data.
2. An information distributing system as in any one of the preceding claims wherein creating representation data comprises entering the parsed data in an input field of a form.
3. An information distributing system as in Claim 2, wherein the voice server is connectable to a web server for receiving a representation of the form, and for sending a representation of the parsed data to the web server.
4. An information distributing system as in any one of Claims 2 and 3, wherein the voice server is connectable to a transcoder device for converting a webpage to a voice application.
5. An information distributing system as in any one of the preceding claims, wherein the first connection is different from the second connection.
6. An information distributing system as in any one of the preceding claims, wherein the voice browser comprises an input device for selecting in the displayed visual data an incorrectly parsed data element of the parsed voice data.
7. An information distributing system as in any one of the preceding claims, wherein the visual data comprises textual data.
8. An information distributing system as in any one of the preceding claims, wherein the voice browser is comprised in a mobile phone.
9. A voice browser for use in any one of the preceding claims.
10. A voice server for use in any one of the preceding claims.
1 1. An information distributing method comprising
sending voice data over a first voice data connection to the voice server, receiving the voice data, parsing the voice data, creating representation data representing the parsed voice data, sending the representation data over a second data connection to the voice browser, the representation data comprising visual data representing the parsed voice data,
receiving the visual data and displaying the visual data.
12. A method for receiving information on a voice browser comprising
sending voice data over a first voice data connection to the voice server, receiving visual data and displaying the visual data, the visual data
representing a parsing of the voice data.
13. A method for confirming information on a voice server comprising
receiving voice data from a voice browser, parsing the voice data, creating representation data, sending the representation data over a data connection to the voice browser, the representation data comprising visual data representing the parsed voice data.
14. A computer program comprising computer program code means adapted to perform all the steps of any one of claims 11 , 12 and 13 when the computer program is run on a computer.
15. A computer program as claimed in claim 14 embodied on a computer readable medium.
PCT/EP2010/059874 2009-07-10 2010-07-09 Information distributing system with feedback mechanism WO2011004000A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP09165139 2009-07-10
EP09165139.8 2009-07-10

Publications (2)

Publication Number Publication Date
WO2011004000A2 true WO2011004000A2 (en) 2011-01-13
WO2011004000A3 WO2011004000A3 (en) 2011-03-10

Family

ID=43413348

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2010/059874 WO2011004000A2 (en) 2009-07-10 2010-07-09 Information distributing system with feedback mechanism

Country Status (1)

Country Link
WO (1) WO2011004000A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3246828A1 (en) * 2016-05-19 2017-11-22 Palo Alto Research Center Incorporated Natural language web browser
US10372804B2 (en) 2016-05-17 2019-08-06 Bruce HASSEL Interactive audio validation/assistance system and methodologies

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6335928B1 (en) 1997-06-06 2002-01-01 Lucent Technologies, Inc. Method and apparatus for accessing and interacting an internet web page using a telecommunications device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6920425B1 (en) * 2000-05-16 2005-07-19 Nortel Networks Limited Visual interactive response system and method translated from interactive voice response for telephone utility
US7158779B2 (en) * 2003-11-11 2007-01-02 Microsoft Corporation Sequential multimodal input
US7739117B2 (en) * 2004-09-20 2010-06-15 International Business Machines Corporation Method and system for voice-enabled autofill
US8370160B2 (en) * 2007-12-31 2013-02-05 Motorola Mobility Llc Methods and apparatus for implementing distributed multi-modal applications

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6335928B1 (en) 1997-06-06 2002-01-01 Lucent Technologies, Inc. Method and apparatus for accessing and interacting an internet web page using a telecommunications device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10372804B2 (en) 2016-05-17 2019-08-06 Bruce HASSEL Interactive audio validation/assistance system and methodologies
EP3246828A1 (en) * 2016-05-19 2017-11-22 Palo Alto Research Center Incorporated Natural language web browser
JP2017208086A (en) * 2016-05-19 2017-11-24 パロ アルト リサーチ センター インコーポレイテッド Natural language web browser
US11599709B2 (en) 2016-05-19 2023-03-07 Palo Alto Research Center Incorporated Natural language web browser

Also Published As

Publication number Publication date
WO2011004000A3 (en) 2011-03-10

Similar Documents

Publication Publication Date Title
EP1701527B1 (en) Graphical menu generation in interactive voice response systems
US20180007201A1 (en) Personal Voice-Based Information Retrieval System
Lucas VoiceXML for Web-based distributed conversational applications
US6885736B2 (en) System and method for providing and using universally accessible voice and speech data files
KR100561228B1 (en) Method for VoiceXML to XHTML+Voice Conversion and Multimodal Service System using the same
US7016848B2 (en) Voice site personality setting
US7640163B2 (en) Method and system for voice activating web pages
US9263037B2 (en) Interactive manual, system and method for vehicles and other complex equipment
US8260616B2 (en) System and method for audio content generation
US7548858B2 (en) System and method for selective audible rendering of data to a user based on user input
US8566102B1 (en) System and method of automating a spoken dialogue service
US20060136220A1 (en) Controlling user interfaces with voice commands from multiple languages
US20030187656A1 (en) Method for the computer-supported transformation of structured documents
US20020077833A1 (en) Transcription and reporting system
US20040025115A1 (en) Method, terminal, browser application, and mark-up language for multimodal interaction between a user and a terminal
KR20020004931A (en) Conversational browser and conversational systems
JP2000137596A (en) Interactive voice response system
JP2002063032A (en) Method and device for developing zero footprint telephone application
GB2383247A (en) Multi-modal picture allowing verbal interaction between a user and the picture
JP2004005676A (en) Method and system for creating both initiative multi-modal dialogue and related browsing mechanism
GB2383918A (en) Collecting user-interest information regarding a picture
JPH08335160A (en) System for making video screen display voice-interactive
JP2003015860A (en) Speech driven data selection in voice-enabled program
JPH10124293A (en) Speech commandable computer and medium for the same
WO2011004000A2 (en) Information distributing system with feedback mechanism

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10737306

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC DATED 26.04.12

122 Ep: pct application non-entry in european phase

Ref document number: 10737306

Country of ref document: EP

Kind code of ref document: A2