WO2005055602A1

WO2005055602A1 - Video application node

Info

Publication number: WO2005055602A1
Application number: PCT/SE2003/001883
Authority: WO
Inventors: Norishige Sawada; Naoya Ori
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2003-12-04
Filing date: 2003-12-04
Publication date: 2005-06-16
Also published as: WO2005055602A8; AU2003304675A8; AU2003304675A1

Abstract

This invention relates to a video conversation application, which includes the detection of facial characteristics and movements of a human being by one video terminal and the generation of a synthesized animated graphical representation, which is sent to another video terminal. The coding algorithms for this application require high processing capacity that makes it unsuitable for video terminals with limited processing capacity (such as for example mobile telephones). The invention solves this problem by implementing the algorithms in a centralised video application server (4100) instead of in the video terminals (4200, 4300). Different graphical representations can be selected and modified by the users themselves from the video terminals. The invention has a particular use as an amusement feature in mobile video telephony conversations.

Description

Video Application Node TECHNICAL FIELD OF THE INVENTION

The present invention relates to a video conversation application in a communication network.

BACKGROUND AND DESCRIPTION OF RELATED ART

As mobile telephony networks are deployed around the world, more and more people are starting to have a mobile telephone. At the same time, many applications using the mobile network are developed. As one of those applications, video telephony, is standardized, many of the latest mobile telephones have a video telephony client implemented.

Video telephony is an extended peer-to-peer communication application using voice and video simultaneously synchronized. This application enables you to talk to the other party while seeing his/her face on the display of the mobile phone. The standard is described in ITU-T Recommendation H.324 (Terminal for low bit-rate multimedia communication) and in the 3GPP specification TS 26.111.

Recommendation H.324 covers the technical requirements for very low bit-rate multimedia telephone terminals operating over the General Switched Telephone Network, that is the fixed and/or the mobile telephony networks. The scope of H.324 includes a number of other relevant ITU-T Recommendations such as H.223 multiplex/demultiplex, H.245 control, H.263 video codec, and G.723.1 audio codec.

The 3GPP specification TS 26.111 is based on H.324, but with additions to fully specify a multimedia codec for use in the 3rd Generation Mobile System.

Apart from the coding standards mentioned above, there is a number of other known solutions and techniques for encoding video. One technique is specializing in sensing facial movements of a human being. By using a video camera and image processing, a video terminal can detect facial movements and characteristics in real time, transform these movements and characteristics into a data stream and send it to another video terminal. The latter terminal receives the information and generates a synthesized and animated image of a human face. In some literature this image is also referred to as an avatar.

One example of such a technique is described in the International patent application with publication number W099/53443 ^lWavelet-based Facial Motion Capture for Avatar Communication' . This patent application is disclosing a method and an apparatus for generating an animated image based on facial sensing. This patent application does also suggest that the facial sensing is performed at one site and the coded video data stream is sent over a network to a remote site where the animated image is reconstructed and displayed.

Another example is US patent 6,044,168 ^xModel based faced coding and decoding using feature detection and eigenface coding' . This patent discloses an algorithm for sensing facial features and to synthesize a face image at the receiving end by mapping the received facial feature locations on a 3D face model.

Yet another example of relevant prior art is the Moving Picture Experts Group's MPEG-4 version 2 compression standard ISO/IEC 14496. /Among others, does this standard include natural and synthetic coding as well as Facial and Body Animation as described in 'Overview of the MPEG-4 standard' ISO/IEC JTC1/SC29/WG11 N4030 of March 2001. The MPEG-4 standard and its possible applicability to mobile terminals has also been discussed by I. S. Pandzic in "Facial animation framework for the web and mobile platforms", Proceedings of Web3D Symposium 2002.

Common characteristics of the algorithms for detecting facial features and to transform these into a synthesized image are that they offer large bandwidth savings when transmitting video over a communication channel but do require a large amount of computing capacity in the video terminals.

SUMMARY OF THE INVENTION

In video conversation applications between two or more human users, (such as video conferences, video telephony etc.) it is in conventional video coding techniques a goal to aim at displaying the image of the user's face as authentic as possible on the screen of the video terminal.

A problem that is addressed by the current invention is to make it possible to display an animated synthesized graphical representation of the user's face on the screen in lieu of displaying a near authentic image.

The algorithms for detecting and transforming facial characteristics and movements into a synthesized graphical representation are known to require a large amount of computing capacity. This makes it very difficult to implement the algorithms in video terminals with limited processing capacity such as for example mobile video telephones.

Another problem addressed by the current invention is that the users involved in the video conversation normally do not have the possibility to control the behaviour of the video coding algorithms.

The current invention is resolving these problems by moving the implementation of said algorithms from the video terminal (the client) to a centralised server in the communication network, called a Video Application Node (VAN) .

When establishing a video channel between a first video terminal and a second video terminal, the video channel is passing through the VAN.

The video data stream, which is encoded by the first video terminal using conventional video coding techniques, is received by the VAN and decoded by a first video codec. The decoded video data stream is sent to a Real Time Video Effecter part in the VAN. This Real Time Video Effecter part performs the transforming of the facial characteristics and movements but does also allow for additional graphical effects. The resulting synthesized graphical representation is sent to a second video codec where it is encoded using conventional video coding techniques. The encoded video data stream is sent from the VAN towards the second video terminal .

The video channel can be bi-directional and the VAN can process the video data stream received from the second video terminal in the same way as it can process the video data stream received from the first video terminal .

The problem of not having the possibility to control the behaviour of the video-coding algorithm is resolved by implementing a subscriber database and an interface to this database in the VAN.

The subscriber database is storing data for each video terminal user that is subscribing to the feature of displaying synthesized animated graphical representations (the ^λanimation feature'). The data in the subscriber database comprises for example parameters that are pointing out which synthesized representation is to be displayed and under which conditions. The subscribers have access to the subscriber database via a data communication link from the subscriber's video terminal to the database interface in the VAN. By sending control signalling over this interface, the subscriber can alter the parameters for the synthesized representation. Accessing this database can be done either when the video connection is established or at any other time at the leisure of the subscriber.

One object of the invention is to use the transformation of facial characteristics and movements to provide for amusement features for users involved in a video conversation. One example is to display an animated cartoon or some other synthesized representation which mimics the face of the user. This synthesized representation could be unique for each user involved in the video conversation. The appearance of the synthesized representation can also depend on the combination of the identity of the involved users .

When you for example call users that are among your family or your friends a selected animated image is displayed but when calling someone else, a default image with no connection to the calling user at all is displayed (for integrity purposes) .

Another important object of the invention is to enhance the amusement features by allowing the subscriber to alter and update the parameters that control said synthesized representations .

The concept of implementing the transformation algorithm and the subscriber database centrally in a VAN offers several advantages .

An overall advantage of the invention is that the amusement aspect can boost the usage of video conversation applications, which will increase the traffic in the communication network thereby increasing the revenues for communication network operators.

By using the invention, significant processor load is moved from the video terminal (the client) to the centralised VAN (the server) . This means simpler design and lower cost for the terminals .

With a centralised VAN, network operators (controlling the VAN) can continuously offer new features, which easily can be made available to the users without software updates in the video terminals. Network operators can also apply different charging schemes to these features.

The invention will now be described in more detail and with preferred embodiments and referring to accompanying drawings .

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram showing a typical video telephony application between two mobile video telephones .

Figure 2 is a block diagram showing typical video telephony clients according to the ITU-T Recommendation H.324.

Figure 3 is a block diagram showing a video telephony client with local implementation of the transformation algorithms.

Figure 4 is a block diagram showing the transformation algorithms implemented centrally in the Video Application Node.

Figure 5 is a block diagram showing an example on how the 'animation feature' can be perceived by mobile video telephone users.

Figure 6 is a flow chart showing the steps of establishing and processing a video telephony call over a VAN. Figure 7 is a block diagram showing the involved network elements in a call setup.

Figure 8 is a block diagram showing the involved network elements when a subscriber accesses the subscriber database in the VAN.

Figure 9 is a table over subscriber parameters in the subscriber database.

DETAILED DESCRIPTION OF EMBODIMENTS

The current invention is in a preferred embodiment applied to a video telephony application in a mobile telephony network. In Figure 1 (prior art) a video telephony application between two mobile video telephones is showed. A first video telephony user 101 is calling a second video telephony user 102 each user using a mobile video telephone (111 and 112 respectively) . Each mobile video telephone 111,112 is equipped with an inbuilt video camera and a display and enables one of the users to talk to the other user while simultaneously seeing his/her face on the display.

Figure 2 (prior art) is showing a block diagram with two video telephony clients 210, 220. A typical video telephony client 210 comprises a number of function elements:

A video I/O equipment element 211 includes for example a video camera and a display. This element is connected to a video codec element 214, which carries out redundancy reduction encoding and decoding for video data streams.

An audio I/O equipment element 212 includes for example a microphone and a speaker. This element is connected to an audio codec element 215, which encodes the audio signal from the microphone for transmission, and decodes the audio code, which is output to the speaker. A system control element 213 is an entity that uses a control protocol element 216 for end-to-end signalling for proper operation of the video telephony client (such as reversion to speech-only telephony mode etc.) .

A multiplex/demultiplex element 217 connected to the elements 211, 212 and 213 multiplexes the video data stream, the audio signal and the control signal into a single bit stream, and demultiplexes a received single bit stream into a video data stream, an audio signal and a control signal. In addition, it performs logical framing, sequence numbering, error detection, and error correction by means of retransmission, as appropriate to each stream or signal.

Figure 3 shows a first video telephony client 310 that is equipped with a video camera 311 and a video transformation functional block 312, both connected to each other. The video transformation functional block 312 detects facial characteristics and movements in a first video data stream 313 received from the video camera 311 and transforms these movements into a synthesized animated graphical representation 314 which is transmitted in a second video data stream 315 to a second video telephony client 320. The video transformation is made locally in the video telephony clients 310, 320. As mentioned earlier, the algorithms for this video transformation require a large amount of computing capacity which makes it unsuitable for video terminals with limited processing capacity such as mobile video telephones.

The essence of the current invention is therefore to instead implement the algorithms for the video transformation process in a centralized server, a Video Application Node (VAN) .

A block diagram of a VAN 4100 is found in Figure 4. The VAN 4100 comprises a number of functional elements such as multiplex/demultiplex elements 4101, 4102, video codec elements 4103, 4104, control protocol elements 4105, 4106 and a Real Time Video Effecter element 4107.

Two video telephony clients 4200 and 4300 correspond to the video telephony client 210 described in Figure 2.

The multiplex/demultiplex element 4101 is on one side connected towards the video telephony client 4200 and on the other side connected to two elements, the video codec element 4103, and the control protocol element 4105 respectively. The video codec element 4103 is also connected to the Real Time Video Effecter element 4107. The Real Time Video Effecter element 4107 is further on connected to the video codec element 4104. The control protocol element 4105 is connected to another control protocol element 4106. The video codec element 4104 and the control protocol element 4106 are both connected to the multiplex/demultiplex element 4102, which in turn is connected towards the video telephony client 4300.

The video telephony client 4200 sends a multiplexed video telephony data stream 4400 to the VAN 4100. In the VAN 4100, the multiplexed video telephony data stream 4400 is demultiplexed into a first video data stream 4109, an audio signal 4108 and a control signal 4110 by the multiplex/demultiplex element 4101. The first video data stream 4109 is sent to the video codec 4103 and the control signal 4110 is sent to the control protocol element 4105. The audio signal 4108 is sent to the multiplex/demultiplex element 4102. A decoded first video data stream from the video codec 4103 is sent to the Real Time Video Effecter element 4107. This element detects facial characteristics and movements and transforms in real time the decoded first video data stream to a synthesized animated graphical representation. The synthesized animated graphical representation is sent to the video codec 4104, where it is encoded to an encoded second video data stream 4111 which in turn is sent to the multiplex/demultiplex element 4102.

In the multiplex/demultiplex element 4102 the encoded video data stream is multiplexed together with the encoded audio signal and the control signal and sent as a second video telephony data stream 4500 towards the receiving video telephony client 4300.

The block diagram in Figure 4 is symmetrical and the notations 'sending' and 'receiving' video telephony clients are reciprocal. That is, a video telephony data stream sent from the video telephony client 4300 to the video telephony client 4200 is treated in the same way in the VAN as a video telephony data stream sent from 4200 to 4300.

In a preferred embodiment of the invention, all elements in Figure 4 except the Real Time Video Effecter element 4107 comply to the standardised functional elements in ITU-T Recommendation H.324, that is the Multiplex/demultiplex elements 4101, 4102 are according to H.223, the Video Codecs

4103, 4104 are according to H.263 and the Control protocol elements 4105, 4106 are according to H.245. The video transformation process implemented in the Real Time Video Effecter element 4107 can for example use one of the several facial-sensing algorithms known from prior art.

An example on how the 'animation feature' can be perceived by two mobile video telephone users is shown in Figure 5.

A user 501 using a mobile video telephone 503 is calling another user 502 with a mobile video telephone 504. The video telephony call passes through a VAN 505, which has the same functionality as the VAN 4100 of Figure 4. In this example, the screen on the mobile video telephone 503 displays an animated cartoon 506 mimicking user 502. Likewise, the screen on the mobile video telephone 504 displays an animated cartoon 507 mimicking user 501.

An example on a call establishment of a video telephony call as seen by e.g. the VAN 4100 of Figure 4 is shown in a flow diagram in Figure 6.

Step 601: The VAN receives a call setup from a call originating video telephony client.

Step 602: The VAN sends a call set-up to a destination video telephony client.

Step 603: A video telephony call including a bi-directional video channel is established between the originating and the destination video telephony clients.

Step 604: The VAN receives an encoded first video data stream from a first one of the video telephony clients, which can be any of the originating or destination video telephony clients.

Step 605: The encoded first video data stream is decoded by a video codec in the VAN.

Step 606: Facial characteristics and movements are detected in the decoded first video data stream and a synthesized animated graphical representation is generated.

Step 607: The synthesized animated graphical representation is encoded into an encoded second video data stream.

Step 608: The encoded second video data stream is sent to a second one of the video telephony clients.

The clients mentioned in Figure 6, are e.g. the clients 4200 and 4300 of Figure 4. Figure 7 shows a system overview of involved network elements when a video telephony call is established between two mobile video telephones:

- A call originating mobile video telephone 701,

- A call destination mobile video telephone 702,

- Radio base stations 703, 704,

- Mobile Switching Centres, MSC 705, 706,

- A Gateway MSC, GMSC 707,

- A Video Application Node, VAN 708.

The call establishment is using a signalling protocol such as the ISUP (ISDN User Part) . ISUP is a part of the standardised Signalling System Number 7 (SS7) protocol and consists of signalling messages and signalling information elements necessary for the call establishment. The VAN is equipped with a signalling device 709 in order to receive and send this SS7 signalling.

There are at least two possible options of implementing the call establishment procedure.

When for example making a mobile telephony call set-up from the call originating mobile video telephone 701 to the call destination mobile video telephone 702 according a first option, a number of signalling information elements are sent from the mobile video telephone 701 including:

- A 'Called Party Number' information element comprising a telephone number to the call destination mobile video telephone 702,

- A 'Call Type' information element, - A 'Calling Party Number' information element comprising a telephone number to the call originating mobile video telephone 701.

The call is routed over the radio base station 703 to the MSC 705. The MSC 705 analyzes the 'Call Type' information element received from the mobile video telephone 701. If the value in the information element is set to 'video telephony' , the call is routed via the GMSC 707 to the VAN 708.

The VAN 708 analyses the received 'Calling and Called Party Number' information elements and checks if the call originating mobile video telephone 701 and the call destination mobile video telephone 702 are subscribers to the 'animation feature'. If one or both mobile video telephones are subscribers, the VAN routes the call back to the GMSC 707 which further routes the call towards the call destination mobile video telephone 702, via the MSC 706 and the radio base station 704. The video data stream between the two mobile video telephones is now passing through the VAN and is processed as illustrated in Figure 4.

If the VAN 708 concludes that none of the mobile video telephones are subscribers to the 'animation feature', the VAN will instead return the call control to the GMSC 707 which will route the call as an ordinary video telephony call to the call destination mobile video telephone 702. The VAN will not take any further part in the call and the video data stream will not pass through the VAN.

In a second option, the 'Called Party Number' information element sent from the originating mobile video telephone 701 comprises a telephone number to the VAN. In order to identify the call destination mobile video telephone, a 'User-to-user' information element is sent from the originating mobile video telephone 701 comprising a Subscriber ID of the call destination mobile video telephone 702. The call is routed to the VAN over the radio base station 703, the MSC 705 and the GMSC 707 respectively. The VAN translates the Subscriber ID received in the 'User-to- user' information element to a telephone number to the call destination mobile video telephone 702. The VAN routes the call back to the GMSC 707 which further routes the call towards the call destination mobile video telephone 702, via the MSC 706 and the radio base station 704. The video data stream between the two mobile video telephones is passing through the VAN and is processed as illustrated in Figure 4.

The advantage of the second option is that it does not require any upgrades of the nodes in the existing core network (the radio bases stations, the MSCs and the GMSCs) .

The VAN comprises a subscriber database where data for each video telephony user subscribing to the 'animation feature' is stored including parameters that control the graphical representations. The subscribers have access to the subscriber database via a data communication link from the mobile video telephone to an interface in the VAN.

Figure 8 illustrates two types of database update signalling for accessing the subscriber database. The first type is carried on a separate packet switch network such as for example the GPRS (General Packet Radio Service) network. The second type is using in-band signalling in the established video telephony data stream.

Involved network elements are:

- A mobile video telephone 803,

- A base station 804,

- A Serving GPRS Support Node, SGSN 805, - A Gateway GPRS Support Node, GGSN 806,

- A packet router 807,

- A VAN 800 including, in addition to the functionality already described in Figure 4, a subscriber database 801, a web server 810 and a DTMF receiver 811,

- A Mobile Switching Centre, MSC 808,

- A Gateway MSC, GMSC 809.

The data link in the GPRS network from the mobile video telephone 803 to an interface in the VAN 800 is established over the radio base station 804, the SGSN 805, the GGSN 806 and the packet router 807.

The web server 810 in the VAN 800 has a peer-to-peer web interface towards the mobile video telephone 803. The user can access and alter the contents of the subscriber database 801 using an inbuilt web browser in the mobile video telephone 803. The database update signalling between the mobile video telephone 803 web client and the web server 810 is carried on an Internet protocol such as the Hypertext Transfer Protocol, HTTP or the Wireless Application Protocol, WAP.

The web interface is normally used when the mobile video telephone user is not engaged in any call.

When accessing the subscriber database 801 during an existing call, a simplified user interface can be applied. Instead of accessing the database using a web browser, the mobile video telephone's numeric keypad can be used. For each pressed key, a DTMF (Dual Tone Multi-Frequency) signal is sent to the VAN. DTMF uses tone signalling which is sent in-band on the already established voice channel in a call. In order to process the DTMF signals, the VAN is equipped with the DTMF receiver 811. Figure 8 shows a voice channel established over the base station 804, the MSC 808 and the GMSC 809 to the VAN 800.

The subscriber's access to the database can preferably be restricted in the sense that it requires a standard login procedure using a unique subscriber ID and a keyword for each subscriber.

An example of content of a subscriber database is shown in a list in Figure 9. Each unique subscriber is identified by a key parameter (column 91), here the mobile video telephone's MSISDN (Mobile Subscriber ISDN) number. The MSISDN is basically the telephone number to a mobile video telephone. For each MSISDN a number of unique parameters are assigned. The rows in the list in Figure 9 show for each MSISDN a subscriber (or user) ID (column 92) , a Password (column 93) , a default effect pattern (i.e. the default synthesized graphical representation of the user) (column 94) and Specific Effect Pattern Conditions (column 95) . The latter parameter controls which synthesized graphical representation are to be displayed as a function of one or several conditions.

The list comprises two examples of subscribers, MSISDN = 8190111 and MSISDN = 8190222. For MSISDN = 8190111, the Subscriber ID is 'Anne¹. For MSISDN = 8190222, the subscriber ID is 'Bob' . For both subscribers there is a Password defined. When Anne is calling another video telephone, an animated cartoon of a princess that mimics Anne's facial movements is displayed on the called party's video telephone by default. However, if Anne calls Bob who has the MSISDN number 8190222, the animated cartoon is modified and eyeglasses are added to the cartoon (Condition

1 for MSISDN = 8190111) . If Anne is calling a video telephony user that is not a subscriber to the 'animation feature', no animated image is shown what so ever (Condition

2 for MSISDN = 8190111) . On the other hand when talking to Bob, Anne's video telephone will display an animated image of a lion (Condition 1 for MSISDN = 8190222) .

When accessing the subscriber database using the simplified user interface with DTMF signalling, different synthesized animated graphical representations of the user can be selected in real time during the established call . When pressing for example digit '0' on the keypad, the synthesized animated graphical representation of the users face will change to the 'default' effect pattern. When pressing digit '1', the effect pattern number 'one' will be used instead. The set of synthesized animated graphical representations and additional graphical effects could be pre-defined and configured by the subscriber using the earlier described web interface to the subscriber database.

Again, the current invention is in a preferred embodiment applied to a mobile telephony network. It is however obvious to a person skilled in the art to also apply the inventive concept to video conversation applications in the fixed telephony network or to the internet where the video terminals can be for example personal computers (PCs) or similar.

Claims

1. A video application server, comprising:

- a first video codec (4103) designed to decode a received encoded first video data stream (4109) from a first video client (4200) ;

- a video transformation device (4107) designed to detect facial characteristics and movements in the decoded first video data stream and to generate a synthesized animated graphical representation, where the generation of said representation is a function of at least one parameter (94,95) uniquely assigned to the first video client; a second video codec (4104) designed to encode said graphical representation to an encoded second video data stream (4111) intended for a second video client (4300) ; and

- a database (801) containing subscriber data for registered video clients where said subscriber data comprises said at least one uniquely assigned parameter.

2. A video application server as in Claim 1 comprising a first signalling device (810, 811), designed to receive database update signalling from any registered video client.

3. A first signalling device as in Claim 2 comprising a web server (810) designed to receive said database update signalling carried on an Internet protocol .

4. A first signalling device (811) as in Claim 2 designed to receive said database update signalling carried on the

DTMF protocol.

5. A video application server as in Claim 1 comprising a second signalling device (709) designed to receive and send call control signalling.

6. A second signalling device as in Claim 5, designed to receive and send SS7 call control signalling information.

7. A video application server as in Claim 1 comprising a demultiplexer (4101) that is designed to extract the received encoded first video data stream from a first multiplexed single bit stream (4400) received from the first video client .

8. A video application server as in Claim 1 comprising a multiplexer that is designed to multiplex the encoded second video data stream to a second multiplexed single bit stream (4500) intended for the second video client.

9. A method of providing a video application to video clients including the steps of:

- establishing (603) a video channel between at least two video clients;

- receiving (604) an encoded first video data stream from a first one of the video clients;

- decoding (605) the received encoded first video data stream;

- detecting (606) facial characteristics and movements in the decoded first video data stream and generating (606) a synthesized animated graphical representation where the generation of said representation is a function of at least one parameter uniquely assigned to the first video client;

- encoding (607) the synthesized animated graphical representation into an encoded second video data stream;

- sending (608) the encoded second video data stream to a second video client.

10. Receiving (604) an encoded first video data stream as in _Claim 9, where said encoded first video data stream is extracted from a first multiplexed single bit stream.

11. Sending (608) the encoded second video data stream to a second video client as in Claim 9, where said encoded second video data stream is multiplexed into a second multiplexed single bit stream.

12. Establishing (603) a video channel as in Claim 9, where the establishment is done using call control signalling.

13. A parameter (94) in Claim 9 pointing out which synthesized animated graphical representation to be displayed from a set of predefined synthesized animated graphical representations.

14. A parameter (95) in Claim 9 pointing out which graphical effect to be added to the synthesized animated graphical representation from a set of predefined graphical effects.

15. A parameter (95) in Claim 9 representing the conditions of sending a synthesized animated graphical representation to the second video telephony client.