AN APPARATUS AND METHOD FOR DETERMINING
EMOTIONAL AND CONCEPTUAL CONTEXT
FROM A USER INPUT
BACKGROUND
(1) Field of the Invention
The invention relates to computer/human interfaces. More specifically, the invention relates to increasing the ability of the computer to respond in a humanlike manner to an indeterminate input stream.
(2) Background
Improving the computer human interface has long been a goal, and software designers have tried making computers more user-friendly. Great strides have been made with the invention of graphically user interfaces which obscure from the user most of the underlying complexity of carrying out their desired function. The ability to "point and click" and "drag and drop" vastly facilitated the acceptance of the computer as a mainstay in modern society. But computers remain intimidating to a wide variety of potential users, and almost no one would characterize a computer session as resembling a typical human interaction. In customer service situations, consumers resent being funneled through a computer attendant before reaching a "real person." Even the proliferation of speech recognition software that has permitted computers to take oral instruction to perform certain predefined tasks has not significantly resolved these issues. Such prior art speech recognition typically employs grammar parsers to discern what command had been issued coupled with patterns matching to match the voice command to those existing in the set of possible commands.
Notwithstanding these strides in human computer interaction, computers are largely regarded as a machines, like a car or a washing machine. However, unlike a car, computers can emulate certain human functions, but their ability to emulate those functions is constrained by the command set provided.
Moreover, computers have no humanesque embodiment identified with them. The general problem remains how to improve the ease of use in human computer interactions and make the user experience more enjoyable.
BRIEF SUMMARY OF THE INVENTION
A method and system of generating a contextuaUy correct output for an input is disclosed. An input concept and an indication of an input emotional state are received. A state machine adjusts an output emotional state in response to the input concept and input emotional state. An output is generated factoring in the output emotional state and the input concept.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a block diagram of a system of one embodiment of the invention.
Figure 2 is a flow chart of an operation of a natural language filter of one embodiment of the invention.
Figure 3 is a flow chart of operation of the emotional scoring filter of one embodiment of the invention.
Figure 4 is a flow chart of operation of the context filter in one embodiment of the invention.
Figure 5 is a flow chart of the operation of a continuous emotional state machine of one embodiment of the invention.
Figure 6 is a flow chart of discreet state emotion suppression in one embodiment of the invention.
Figure 7 is a flow chart of output concept dialogue generation of one embodiment of the invention.
Figure 8 is a flow chart of operation of the vocal expression module in one embodiment of the invention.
Figure 9 is a flow chart of output generation in the facial animation
module of one embodiment of the invention.
Figure 10 is a flow chart of the body animation module of one embodiment of the invention.
DETAILED DESCRIPTION
The invention generally provides a new paradigm for human computer interaction by using as much information as can be derived about the context of the interaction and appropriately defining emotional states. The system uses a data driven algorithm to provide an animated embodiment to the interacting computer, as well as natural language response capability. Some embodiments may provide for only one of audio and graphical output. In any case, providing the computer with appropriate understanding of and response to context improves the overall user experience.
Some portions of the detailed descriptions which follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it should be appreciated that throughout the present invention, discussions utilizing terms such as "processing" or "computing" or "calculating" or "determining" or "displaying"
or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices, including robotic devices, such as robots and dolls.
The present invention also relates to apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magneto-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose machines may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these machines will appear from the description below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
Figure 1 is a block diagram of a system of one embodiment of the invention. In this embodiment of the invention, a context filter 10 accepts certain inputs and controls the generation of emotionally consistent audio and graphical output. A user interface 16 receives an input stream. The input stream may include typed input, spoken input, galvanic skin response or other biofeedback, video or still capture of user expressions, or any other information useful in derivation of context. Depending on the types of input accepted by the
user interface 16, the user interface 16 may contain a voice recognition module 56, a voice stress analyzer 54, and an expression identifier 52. Other modules or the user interface 16 are within the scope and contemplation of the invention. The form of the input stream is also likely to vary between different applications of the invention. For example, a phone center computer may receive only audio and touch tone input (and only provide audio output).
The language portion of the input stream, be it spoken or typed, is passed to a natural language filter 18. Natural language filter 18 derives an input concept from the natural language input. As used herein "concept" is abbreviated data reflecting a semantic meaning of the language input stream. In some embodiments, the natural language filter also receives biometric data to assist in the identification of, e.g., irony and sarcasm. The natural language filter 18 may access an idiom database and rule set 24 in the course of matching the natural language input with an input concept. It is well understood in the field of natural language that grammatical rules alone may, in many cases, fail to identify the semantic meaning of a language string. Thus, the use of an idiomatic database to identify input concepts significantly improves the probability that the natural language filter 18 will generate a semantically correct input concept. The natural language filter 18 forwards the input concept to the context filter 10 for later use.
Concurrently, the natural language input and other emotional indicators derived from the input stream are forwarded to an emotional scoring filter 14. In one embodiment, the emotional scoring filter is modeled after the work of Louis Gottschalk, who developed a method of psychological analysis known as the "Gottschalk-Gleiser Method of Content Analysis." This system uses grammatical clauses as a unit of communication from which to derive the psychological message being conveyed by the speaker. More than three decades of empirical data backs the accuracy of the Gottschalk Method, which scores numerous parameters, such as: hostility, social alienation, cognitive and intellectual impairment, depression, hope, and anxiety.
While the Gottschalk parameters work well in a strict pathological context, it is desirable to avoid its psychological bias and generate more basic emotion for use in the preferred embodiment. Thus, the emotional scoring filter 14, in one
embodiment of the invention, will score for common emotional states, such as fear, anger, joy, sadness, disgust, and disdain, rather than the wide range of pathology typical of a traditional Gottschalk filter. The score will be representative of the user's mood and permit the context filter 10 to adjust the emotional state of the system responsive to the user's emotional state. Emotional scoring filter 14 may also use other emotional indicators, such as galvanic skin response or other biofeedback mechanisms, voice stress analysis, and expression identification as a contribution to or tags on the emotional score derived from the language input.
The context filter 10 then employs the emotional score to determine emotional state and variable shifts for a plurality of emotional state variables which define the system's current emotional state. The emotional shift caused by any score and input concept, may vary from system to system or character to character. For example, one character may be defined to meet hostility with hostility, while another meets hostility with placation. These emotional variable shifts are provided to a continuous emotional state machine 12. The system may have a default or base emotional state which it will exhibit at a first interaction with the user. In one embodiment, the default emotional state is defined to be random within a particular range. The state variables correspond to primary emotions, such as fear, joy, arrogance, disgust, sadness, anger, and coldness. An arbitrarily large number of other emotional states may be reflected in the state variables. Alternatively, other emotional states can be designed as a combination of the primary emotional states.
The continuous emotional state machine 12 employs a plurality of state vectors, referred to as match vectors, which describe idealized conditions under which a particular event occurs. For example, there might be a match vector corresponding to an angry response, and a match vector corresponding to a joyful response. The concept of match vectors is derived from fuzzy logic and neural network systems that categorize input values by similarity to given patterns. By comparing the existing emotional state to the set of match vectors, a closest match existing emotional state dictates the emotional context of the response. Typically, the emotional variables will decay to their default value or to fall within their default range. A decay function may be linear or non-linear
relative to time and may vary from one emotion to the next. For example, fear may decay exponentially with time from the event giving rise to the fear, while joy may decay linearly. Similarly, two emotional state variables decaying linearly may decay at different rates. For example, joy may decay at a rate of 2t, while arrogance decays at a rate of t/2. Decay profiles may vary from one animated character to another within a single system or from system to system.
A vector corresponding to the current existing emotional state is returned to the context filter 10 and retained as part of the current state 62 of the system. The context filter 10 may also employ match vectors to define the current context. In such an embodiment, the vector corresponding to the emotional state is only one of the variables that defines the current context. The current state 62 information may include other information about the user or the current context, such as user's name, gender, etc., as well as the number of I/O pairs and number of conversational threads currently in use.
The context filter 10 may also contain one or more triggers 64 which each cause the system to enter a discreet state 32 in which the emotions are suppressed or partially suppressed and the system is permitted to carry out any standard software function. An example of a possible discreet state is "make the sale" for a sales system. The trigger might be that the purchaser indicates a desire to buy. The underlying software functions may include credit card verification. Discreet state transitions are deemed instantaneous, and the system will never be in more than one discreet at a single time, nor will it be between discreet states. Context filter 10 is also responsible for driving a generation of output responses. In that capacity, the context filter 10 supplies the input concept and state information to an adaptive filter 20. The adaptive filter 20 uses the current state information to assess user understanding and depth of the conversation. The adaptive filter 20 is constantly trying to maintain the levels of conversation consistent with the user's understanding and other contextual factors. Additionally, if after a threshold number of I/O pairs, for example, two or three, the system is unable to match the input concept with an output concept, the adaptive filter 20 may assume control of the conversation by asking the user questions to get back on track. Prior to getting the threshold, if no match is found for an input concept, the adaptive filter 20 may merely provide, for example, a guiding response to
attempt to move the user back in an area into which the system has knowledge.
The input concept and state information are tagged by the adaptive filter 20 with, for example, information reflecting a suitable level for output concept, and the adaptive filter checks the output concept database 22. If an output concept is identified in the output concept database 22, it is compared with the idiom database and rule set 24 to select a suitable idiom based on the state information. Idiom database and rule set 24 converts the output concept into a suitable word output (dialogue) and forwards it and the current state to a vocal expression module 30 and a facial animation module 26.
The vocal expression module 30 uses a variation of three variables — pitch, speed, and volume — to inflect emotional content into the audio output. The parameters pitch, speed, and volume are provided to the facial animation module 26 to permit lip synchronization between the animated face and the words based on the pitch, speed, and volume employed by the vocal expression module. The facial animation modules 26 and body animation modules 28 provide a graphical output of the system. The context filter 10 provides the currents state to both modules 26, 28 and indicates the action required for the body animation module 28. In this manner, the facial expressions and body language of animated character are made consistent with the emotional and semantic context of the interaction.
Figure 2 is a flow chart of an operation of a natural language filter of one embodiment of the invention. At functional block 100, the natural language filter receives a natural language input. As previously noted, this may be in the form of a typed character string or the output of speech recognition software. Various third party speech recognition software is available, such as Nuance, available from Nuance Communications of Menlo Park, California, and the L&H Speech Recognition System from Lermont & Hauspie Speech Products N.V. (L&H) of Belgium. Both of the above systems are multi-voice software suitable for a multi-user environment. In one embodiment, the natural language filter uses a pattern matching algorithm that includes anaphoric references and uses the ASL 1600 from L&H. Other embodiments may employ another superior recognizer as it becomes available. In some embodiments, the natural language filter parses the grammatic structure. In one embodiment, the
natural language filter provides a mechanism for tagging anaphoric references, such as he, she, me, they, it, etc. Tagging of such anaphoric references is important to generation of contextuaUy correct output concepts. A layer of abstraction separates the natural language filter from the context filter permitting ready adaption as improved technology becomes available.
The natural language input received at functional block 100 is compared to an idiomatic database in an effort to determine its meaning which might not be apparent from parsing its grammatic structure. A determination is made at decision block 104 whether an idiom exists within the idiomatic database. The idiomatic database includes libraries of concepts matching various idioms and an extensive dictionary including common misspellings. If an idiom exists, the natural language filter creates a concept matching the idiom at functional block 106. If no idiom is found, a concept is derived from the original natural language input at functional block 108. In one embodiment, the concept is expressed and tagged English based on recognition and matching of multiple key words within sentences related by word distance. By way of example, in one possible syntactic formation, the question "what is context?" would become "WhatContext." In one embodiment, this may be performed using table lookup in a lookup table that there is a rule with a possible input string. Other syntax is, of course, within the scope and contemplation of the invention. Then, at functional block 110, the natural language filter provides the input concept to the context filter. The natural language filter's function is then complete for the current input /output pair.
Figure 3 is a flow chart of operation of the emotional scoring filter of one embodiment of the invention. At functional block 150, an emotional scoring filter is provided with the emotional content from the input stream. As previously indicated, this may include the natural language input, captured expression, voice stress analysis, and galvanic skin response or other biofeedback mechanism. Other sources of emotional content may also contribute to the input of the emotional scoring filter. The language portion of the stream is scored for emotional context using a modified Gottschalk approach (such as discussed above) at functional block 152. A determination is made at functional block 154 if other data is present in an input stream from which emotional
content may be derived. If it is, the score of the language portion may be modified based on that additional data or the score may have tags appended to it reflecting the emotional weight of that additional data at functional block 156. In either case, the score is forwarded to the context filter at functional block 158, after which the emotional scoring filter has completed its function for input/output pair.
Figure 4 is a flow chart of operation of the context filter in one embodiment of the invention. At functional block 200, the context filter receives the input concept and emotional score. At functional block 202, the context filter identifies the conversation thread corresponding to the input concept. It is possible that a user may carry on multiple conversational threads with the system, and tracking the thread to which the particular input concept applies is important to providing an appropriate contextualized output. A determination is made at decision block 204 where the thread has been identified. If no thread has been identified matching the input concept, a new thread is deemed created at functional block 206. If a thread is identified or after a new thread is created, the context filter then defines the adjustment for emotional state variables based on the emotional score at functional block 208. Then at functional block 210, the adjustment is sent to the emotional state machine. The current emotional state is returned to the context filter at function block 212. Then at functional block 214, the input concept and state information are forwarded for generation of a corresponding output concept. The input concept and any output concept corresponding thereto are stored in a neural network table at functional block 216. This has several advantages inasmuch as it permits follow up questions because history is maintained. But also over time, an extensive dataset will be developed that may be data mined to teach other animated character natural language skills. At functional block 218, the context filter provides the voice output subsystem with the then existing emotional context. Similarly, at functional block 220, the context filter provides the emotional context to the graphical animation subsystem which includes both facial animation and body animation.
Figure 5 is a flow diagram of the operation of a continuous emotional state machine of one embodiment of the invention. At functional block 250, the
state machine receives state variable adjustments, if any, from the context filter. Then at functional block 252, it adjust those state variables based on adjustments received and any decay which has occurred since the previous cycle. Then at functional block 234, it updates the current emotional state in the context filter to reflect the current emotional state after any adjustments or decay. In one embodiment, the emotional state is determined using match vectors. As adjustments (either positive or negative) are applied to the various defined emotions, the value of that emotion's contribution to the emotional state of the system changes. The aggregate of the various values of the defined emotions is compared with existing match vectors to yield the current emotional state.
Figure 6 is a flow chart of discreet state emotion suppression in one embodiment of the invention. At functional block 270, a signal to enter the discreet state is received. In response to entering the discreet state, the context filter is signaled to suppress one or more emotions, either partially or totally, at functional block 272. Then, at functional block 274, a predefined software function is performed upon completion of the predefined software function, or, if for another reason, the triggering event ceases, the system will leave the discreet state and emotion suppression will be discontinued.
Figure 7 is a flow chart of output concept dialogue generation of one embodiment of the invention. At functional block 300, the adaptive filter receives the input concept and the state information from the context filter. At functional block 302, the adaptive filter assesses user understanding and conversational thread progression from the state information. The adaptive filter then accesses an output concept database for an output concept which is consistent with the input concept and the assessment of understanding at functional block 304. If an output concept is not found at decision block 306, an indication of the I/O pairs during which no match was found is incremented at functional block 308. A determination is then made of the number of times no match was found exceeds the threshold at decision block 310. If the threshold has been exceeded, the adaptive filter enters an adaptive testing mode at functional block 314. In this adaptive testing mode, the adaptive filter formulates questions which steer the user back into a discourse in which matches may be found. By way of example, a first question may be an open ended question. If the user's
response to the open-ended question returns to a thread which permits matching, the adaptive test is over. If the system is unable to make sense of the user response, the focus of the question will be narrowed on consecutive questions until a bi-conditional response is called for. In this manner, it is expected that the adaptive testing procedure will permit the system to move the user back into a conversational context suitable to the system. If at decision block 310, no match does not equal threshold, the system will generate a guiding response to try and guide the user back into the context or possibly make some kind of a general comment relevant to the topic. The system will not yet assume control of the conversation, as in the case of the adaptive test.
After an output concept is found, either at decision block 306 or through the guiding response or adaptive test, the output concept is checked against the idiom database to generate a suitable dialogue string corresponding to the conversational context and the output concept at functional block 316. At functional block 318, the generated dialogue is sent to the voice expression module and the facial animation module.
Figure 8 is a flow chart of operation of the vocal expression module in one embodiment of the invention. The vocal expression module receives the dialogue and state information at functional block 340. Using the state information, the vocal expression module establishes a pitch, speed, and volume map to apply to dialogue output to impart emotional and semantic content. By way of example, if the output is a question, the pitch may be elevated at the end of the question. If the emotional state is happy, pitch and speed may be adjusted to effect a lilting vocal output. The pitch, speed, and volume map may be established using library lookups to find closest matches to the existing state. Additionally, each variable within the map may be randomized within a range. This randomization reduces the machine-like quality that might otherwise result. The pitch, speed, and volume map is forwarded to the facial animation module at functional block 344. Then, at functional block 346, the dialogue is then synthesized or played from pre-recorded voice recordings in synchronization with the facial animation using the pitch, speed, and volume map.
Figure 9 is a flow chart of output generation in the facial animation
module of one embodiment of the invention. At functional block 370, the facial animation module receives state information dialogue and the PSV map. At functional block 372, the facial vertices are adjusted to reflect the dominant motion reflected in the state information. At functional block 374, the movement is synchronized dialogue and PSV map.
In one embodiment, the face is based on a face template with a fixed number of vertices. When modifying a copy of the template, vertices may be moved but not detached, reattached, deleted or inserted; the order of vertices as they appear in the vertex list and the parts of the face that they represent must be maintained. The following features will be included in the fixed length vertex list describing the face: hairline of the forehead to the whole jaw including the top of the neck, excluding the ears. The ears will be part of the rigid head geometry and will be represented by texturing a simple shape.
The systems employs vertex deformation protocol to deform vertices of the face to create different facial expressions on the fly. Rather than create key frames on a time line, the system algorithmically interpolates between morph targets.
The facial art consists of the following morph targets: Target N — A neutral face that would be characteristic of listening to someone else speaking; Target A — The prototypical face of angry disgust for this character with teeth clenched; and Target B — The sleepy yawn face with eyes closed. It is expected that targets N, A, B will be sufficient for recombining into all of various facial expressions to be represented. Additional targets may be added, as needed.
Contiguous ranges of vertices within the fixed length vertex list will be assigned to distinct facial features. This will allow facial animation to be divided into subchannels. These subchannels will, for example, allow 80% of the left eye from target B to be combined with 50% of the entire target N and 10% of the entire target A. By defining different morph formulae to correspond to various emotional states, an appropriate facial animation of a dominant state is achieved on the fly.
Figure 10 is a flow chart of the body animation module of one embodiment of the invention. At functional block 400, the body animation module receives the current state information and action required. At
functional block 402, the body of the animated character assumes the posture of the dominant emotion. At functional block 404, the body animation module accesses an action library to identify details of any action required. The action library is created using motion capture for actions for which such techniques is viable. Smaller actions are defined by key frame animation. In either case, a library lookup permits the character to synthesize motion in real time. At functional block 406, the body is animated to perform the action.
The body animation module controls bodily movements. That module tells body parts how to move in response to commands from the context filter. In one embodiment, 3D character modeling constraints include: (i) all models must be created near the 3D origin (0, 0, 0); (ii) whole units of distance must approximate inches; the default camera must be from the perspective of a child looking slightly up in a positive Z direction at an adult standing at the origin; the child is standing about 100 inches away in the negative Z; (iii) camera position (0, 50, -100); (iv) camera up vector (0, 1. 0); (v) camera look at (0, 60, 0); (vi) camera fustrum will approximate a 35 mm lens; and (vii) the head will be attached to a piece of body geometry named "neckcenter."
In one embodiment, the body parts of each animated character are subject to certain additional constraints. All animation of the characters, except for the faces, will be based on quaternions applied to a hierarchy of segmented meshes. The fingers are all slightly curved at the knuckles so that the only joint that needs to be animateα is tne joint at tne Dase or tne πnger. i nere are tnree separate rigid meshes representing the fingers for each hand, one for each thumb and index finger, and a single rigid mesh that groups the remaining three fingers. This will provide a reasonable approximation of a wide range of finger and hand positions with a minimum number of polygons and joints.
The hierarchy of body parts provides a structure for the relative positioning of body parts as shown in Table 3 below To assist in diagnosing problems that may arise, all body parts will bear labels according to this standard Each relationship between adjacent rigid components in the hierarchy is a joint to which quaternion animation can be applied. The position and animation of each body part is computed relative to the part immediately above it in the hierarchy. Every other part is positioned directly or indirectly relative to the
pelvis. An invisible frame of reference with its origin on the ground between the feet determines the position of the pelvis. The origin should appear at the world origin or position x = 0, y = 0, z = 0.
Table 1
Character Root Node - origin on ground between feet, no geometry
Furniture Node - any 3D geometry that the character may interact with (e.g. a chair)
Parent Node - mesh - the parent is located at the center of the pelvis - it should not be confused with the pelvis which may move independently Pelvis Node - mesh
Torso Node - mesh
Neck Node - mesh
Head Node - mesh including ears, excluding face
Face Node - deformations of fixed length vertex list
Eyes Node - attached to head Upper Teeth - attached to head Headwear - i.e. glasses, visors, hats - attached to head Left Upper Arm Node - mesh
Left Forearm Node - mesh
Left Hand Node - mesh
Left Thumb Node - mesh Left Index Finger Node - mesh Left Minor Finger Group Node - mesh Right Upper Arm Node - mesh
Right Forearm Node - mesh
Right Hand Node - mesh
Right Thumb Node - mesh
Right Index Finger Node - mesh Right Minor Finger Group Node -mesh Left Thigh Node - mesh
Left Lower Leg Node - mesh Left Foot Node - mesh Right Thigh Node - mesh
Right Lower Leg Node - mesh
While each of the above-described flow charts set forth a particular ordering, it is within the scope and contemplation of the invention that the ordering of certain decisions and /or functional blocks may occur in parallel or in an order other than specified. These flow diagrams are merely exemplary and other orderings are within the scope and contemplation of the invention.
In the foregoing specification, the invention has been described with reference to specific embodiments thereof. It will, however, be evident that various modifications and changes can be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. Therefore, the scope of the invention should be limited only by the appended claims.