WO2005124686A1

WO2005124686A1 - Audio-visual sequence analysis

Info

Publication number: WO2005124686A1
Application number: PCT/IE2005/000068
Authority: WO
Inventors: David Sadlier; Alan Smeaton; Sean Marlow; Noel Murphy; Noel O'connor
Original assignee: Dublin City University
Priority date: 2004-06-18
Filing date: 2005-06-16
Publication date: 2005-12-29
Also published as: IE20040412A1

Abstract

A system (101) and method are provided for extracting semantic events from a digitised audio-visual sequence (504) depicting sporting events. The system (101) comprises digitised audio-visual sequence input means (206, 207, 211, 212), memory means (209), data processing means (208), and data outputting means (206, 207, 211, 212). Said memory means (209) stores instructions (503), which configure said processing means (208) to filter (304) said digitised audio-visual sequence (504) according to first image data processing criteria (403) to generate first output audio-visual data (506, 507). Said instructions (503) also configure said processing means (208) to process (701, 702, 703) said first output audio-visual data (506, 507) to generate second audio-visual data processing criteria (809, 914, 1016, 1207) and filter (704) said first output audio-visual data (506, 507) according to said second audio-visual data processing criteria to generate second output audio-visual data (513). Said instructions (503) also configure said processing means (208) to output (306) said second output audio-visual data (513) with said data outputting means.

Description

Title

Audio- Visual Sequence Analysis

Field of the Invention The present invention relates to analysis of audio-visual sequences. More particularly, the present invention relates to a system and method for detecting temporal events or semantics recorded in a digitised audio-visual sequence.

Background to the Invention Modern developments in digital video compression technologies allow extensive archiving of audio-visual content. However, the issues of limited bandwidth, battery life or disparity of device configurations for archive storage and reproduction, or the combination of these issues, impede the development of handheld wireless video applications, built upon such archives. In this context, applications for highlighting or summarization of such content play an increasingly crucial role.

By way of example, sports video analysis is a particularly buoyant area in the field of digital video summarisation, indeed, demand for instant event summarisation is very high from avid sports followers equipped with any type of handheld wireless devices, but costs associated with the purchase of such devices and the continued provision of data services thereto severely limit the amount of distributable data at any given time. However, sport events which are economically important enough to warrant regular capturing on video are typically field- type sports having extended play duration, of the order of one and a half to two hours for soccer for instance. Having regard to the technological issues outlined above, it would be highly impractical to broadcast an entire event to said handheld devices. Therefore, summarisation of any such semantic event, such as a goal having been scored, and the broadcast thereof to handheld wireless devices, whether as a simple Short Message Service notification or as a short field or stream of audio-visual data, such as an MPEG file, is particularly advantageous.

Problems persist nonetheless in sport video analysis for applications as described above, firstly due to the dramatic variances in sports broadcast video styles for different sports types: most research work in the area of sports video analysis is specific to a certain sport and therefore providing sport-specific applications, the underlying solutions of which may not be easily reusable across a broad spectrum of different sports. Secondly, the requirement for the instant provision of summaries is computationally expensive to meet. Indeed, in prior art documents WO 2004/014061, US2004017389 and WO0223891, the respective authors present techniques for the detection of semantic events, particularly goals, from soccer video. However, while attaining reasonable results, central to their solutions are complex algorithms performing tracking of soccer specific objects such as the ball, players, referee, field lines, goalposts, etc. Few have successfully investigated the challenge of developing a solution or scheme that reveals the common structures of multiple events across multiple domains.

Object of the Invention A system and method are therefore required for analysing and extracting semantic events from generic field-sport audio-visual footage, generally for optimising the process of content archiving and particularly for minimising the latency between the capture of the real-time event and the distribution of archived, most relevant audio-visual data representing said event to remote recipients, with a less computationally-intensive analysis. This more efficient system and method should preferably be adaptable to analyse and extract semantic events from multiple types of field-sport audio-visual content.

Summary of the Invention

According to an aspect of the present invention, a system is provided for extracting semantic events from a digitised audio-visual sequence depicting sporting events, which comprises digitised audio-visual sequence input means, memory means, data processing means, and data outputting means. Said memory means stores instructions, which configure said processing means to filter said digitised audio-visual sequence according to first image data processing criteria to generate first output audio-visual data. Said instructions also configure said processing means to process said first output audio-visual data to generate second audiovisual data processing criteria and filter said first output audio-visual data according to said second audio-visual data processing criteria to generate second output audio-visual data. Said instructions also configure said processing means to output said second output audio-visual data with said data outputting means.

According to another aspect of the present invention, a method is provided for extracting semantics from a digitised audio-visual sequence depicting sporting events, Said method comprises the step of filtering said digitised audio-visual sequence according to first image data processing criteria to generate first output audio-visual data. Said method also comprises the step of processing said first output audio-visual data to generate second audio-visual data processing criteria and filtering said first output audio-visual data according to said second audio-visual data processing criteria to generate second output audio-visual data. Said method also comprises the step of outputting said second output audio-visual data with said data outputting means.

Brief Description of the Drawings

Figure 1 shows a preferred embodiment of the present invention in an environment, including at least one audio-visual data processing terminal and at least one remote terminal;

Figure 2 provides an example of the audio-visual data processing terminal, which includes processing means, memory means and communicating means;

Figure 3 details the processing steps according to which the audio-visual data processing device of Figures 1 and 2 operates, including steps of loading audio-visual data at runtime, processing loaded data and broadcasting processed data;

Figure 4 further details the audio- visual data loading step shown in Figure 3; Figure 5 illustrates the contents of the memory means shown in Figure 2 further to the audiovisual data loading step shown in Figures 3 and 4 including audio-visual data processing instructions and a plurality of buffers;

Figure 6 shows audio-visual data de-multiplexed into audio and image buffers shown in

Figures 4 and 5; Figure 7 further details the audio-visual data processing step shown in Figure 3, including a step of processing audio data, steps of processing image data with functions and a step of analysing processed audio- visual data;

Figure 8 further details a first processing function performed upon audio data shown in

Figure 7, including a step of submitting output data to a module; Figure 9 further details a second processing function performed upon image data shown in

Figure 7, including a step of submitting output data to a module;

Figure 10 further details a third processing function performed upon image data shown in

Figure 7, including a step of submitting output data to a module;

Figure 11 provides an example of visual data divided into quadrants according to the step shown in Figure 10;

Figure 12 further details a fourth processing function performed upon image data shown in

Figure 7, including a step of submitting output data to a module;

Figure 13 provides an example of visual data divided into quadrants according to the step shown in Figure 12; and Figure 14 further details the step of analysing output data shown in Figures 8 to 13. Detailed Description of the Drawings

A networked environment is shown in Figure 1, which includes a system 101 for processing audio-visual (AV) data received from a plurality of network-connected data-generating devices 102, 103. System 101 is also configured for distributing processed AV data to at least one data-receiving device 104, but preferably a plurality 104, 105, 106 thereof, over said network wherein some of said network connections are wired (107) and some are wireless (108). Network-connected data-generating devices 102, 103 are for instance broadcast-quality cameras and are located at a sporting venue 109, such as a football stadium, at which they record live sporting action as AV data, which is relayed in real-time to system 101 for processing. Network-connected recipients of processed AV data may be equipped with a variety of different data receiving devices, examples of which include a mobile communication device 104 configured with similar wireless networking means conforming to the cellular GPRS protocol or, alternatively, conforming to the digital BlueTooth® protocol, personal digital assistant computer 'PocketPC 105 configured with wireless LAN means similar to device 106 or, alternatively, wireless networking means conforming to the cellular GPRS protocol, and personal portable computer 106 configured with wireless local area networking (LAN) means conforming to the IEEE 802.11 'Wi-Fi' protocol. Mobile communication device 104 is for instance a mobile phone such as a Nokia 9500 manufactured by the Nokia® Group in Finland. Computer 105 is for instance a Palm m505® manufactured by PalmOne® Inc. of Milpitas, California, USA or a Portable Digital Computer (PDC) such as an IPAQ® manufactured by the Hewlett-Packard® Company of Palo Alto, California, USA. Computer 106 is for instance a laptop computer such as a Portege R100® manufactured by Toshiba Ltd of Kawasaki, Japan, or a tablet PC such as a Versa T400 manufactured by NEC Ltd of Kawasaki, Japan. All such recipient data receiving devices are generally configured with processing means, output data display means, memory means, input means and wired or wireless network connectivity. Wired networks generally include a Local Area Network (LAN) or a Wide Area Network (WAN) and, in the example, connections 108 are WAN connections to the Internet 110. Wireless networks generally include cellular, microwave and radio telecommunications network and, in the example, cameras 102 and 103 are connected to system 101 by way of a radio network within which live input AV data is relayed with a satellite 111. Processed AV data output by system 101 is relayed to remote device 106 by way of a wireless LAN connection (108) to a wired (109), WAN-connected router 112. Processed AV data output by system 101 is relayed to remote devices 104, 105 as a digital signal by the geographically- closest communication link relay 113 of a plurality thereof. Said plurality of communication link relays allows said digital signal to be routed between any data source, such as system

101, and its intended recipients 104, 105 by means of a remote gateway 114. Gateway 114 is for instance a communication network switch coupling digital signal traffic between wireless telecommunication networks, such as the network within which wireless data transmission takes place, and a wide area network (WAN), an example of which is the Internet 110. Said gateway 114 further provides protocol conversion if required, for instance if the mobile device 104 uses Wireless Application Protocol (WAP) and mobile device 105 uses HyperText Markup Language (HTML). Within the composite network described in Figure 1, there is therefore provided the scope for any which one of the remote data-generating devices 101,

102, 103 to distribute audio-visual data to any which one of the data receiving devices 104, 105, 106 over networks.

An example of the system 101 at the media content provider shown in Figure 1 is provided in Figure 2. System 101 is a computer terminal configured with a data processing unit 201, data outputting means such as video display unit (VDU) 202, data inputting means such as a keyboard 203 and a pointing device (mouse) 204 and data inputting/outputting means such as network connections 107, 108, magnetic data-carrying medium reader/writer 206 and optical data-carrying medium reader/writer 207.

Within data processing unit 201, a central processing unit (CPU) 208, such as an Intel Pentium 4 manufactured by the Intel Corporation, provides task co-ordination and data processing functionality. Instructions and data for the CPU 208 are stored in main memory 209 and a hard disk storage unit 210 facilitates non- volatile storage of data and data processing instructions. A modem 211 provides a wired connection to the Internet 110 or said wired connection 107 to the Internet 110, as well as network connection 108, may be performed through a Local Area Network (LAN) network card 212. A universal serial bus (USB) input/output interface 213 facilitates connection to the keyboard and pointing device 203, 204. All of the above devices are connected to a data input/output bus 214, to which said magnetic data-carrying medium reader/writer 206 and optical data-carrying medium reader/writer 207 are also connected. A video adapter 215 receives CPU instructions over said bus 214 for outputting processed data to VDU 202.

In the embodiment, data processing unit 201 is of the type generally known as a compatible Personal Computer ('PC'), but may equally be any device configured with data inputting, processing and outputting means providing at least the functionality described above. Any such device may include, but is not limited to, Onyx 4®, Fuel® or Tezro® workstation manufactured by Silicon Graphics Inc. of Mountain View, California, USA or PowerMac® or iMac® computers manufactured by the Apple® Corporation of Cupertino, California, USA;

Processing steps are described in Figure 3 according to which computer 10 operates. Computer 10 is first switched on at step 301. At step 302, the operating system is loaded which provides said computer 104 with basic functionality, such as initialisation of data input and/or output devices, data file browsing, keyboard and/or mouse input processing, video data outputting, network connectivity and network data processing. At step 303, an application is loaded into memory 209, which is a set of instructions for configuring CPU 208 to process data according to rules described hereafter. Said application may be loaded from HDD 210, an optical storage medium read by optical storage media reading device 207, a magnetic storage medium read by magnetic storage media reading device 206 or from a remote location across networks 107, 108 through network card 212. Audio and video data is next loaded at step 304, which will be further described hereinbelow, and which is obtained via WAN connection 107 or radio networking connection 108 from cameras 102, 103. In the particular context of AV data depicting sporting events, empirical evidence suggests that immediately following a semantic event such as a goal scored during a soccer match, the characteristics of the data content typically includes a close-up shot of the player(s) and/or relevant parties involved, an increase in audio activity, particularly in the voice band corresponding to commentator vocals, a camera shot of the crowd celebrating and a surge in motion activity as the camera captures the intense celebratory behaviour of the goal scorer. Such events may occur only sporadically throughout the duration of the content, but it is proposed that when they are found occurring within closely bound localities, an increase in the probability that a goal has been scored may be deducted. According to the present invention, low-level audio-visual evidence is first gathered from soccer video, which is subsequently mapped to higher-level content features based on relevance to goal scores and including: motion activity tracking, close-up shot modelling, crowd shot modelling, and audio speech band energy. These features are extracted and quantified with processing input AV data 504 and submitted to a support vector machine, which determines a binary condition of a goal having been scored or not.

At the next step 305 therefore, system 101 processes the data loaded at step 304 according to the present invention for outputting processed AV data across said connection 108 to recipients devices 104, 105 and 106 at step 306. A question is asked at step 307 as to whether further AV data should be processed or, in an alternative embodiment of the present invention, whether further AV data has been received whilst steps 304 to 306 are performed according to the invention. If the question of step 307 is answered positively, control proceeds back to step 304, wherein said additional AV data is therefore loaded, processed and output and so on and so forth. The question of step 307 is eventually answered negatively, whereby the application loaded at step 303 may be closed and unloaded from memory 209 and the data processing unit 201 may eventually be switched off at the next step 309.

Figure 4 further details the audio-visual data loading step 304. Prior to the feature extraction stage, it is desirable to pre-filter the AV data 504, such that only those that correspond to global (main) camera perspectives of cameras 102, 103 are processed. The reason for this is that it is generally only global shots that have the potential to contain a goal. AV data is preferably formatted as an MPEG data file, which may be either downloaded in its entirety across the network in order to obtain file header data representing required processing parameters, or preferably streamed across said network in order to process AV data packets as they arrive, thus potentially configuring system 101 for real-time processing of real-time distributed AV data. At step 401, the application loaded at step 303 therefore processes metadata of the incoming MPEG sequence in order to obtain AV data processing parameters, which include encoding parameters, frame size, frame type and frequency and audio track band configuration. Typically, an MPEG sequence contains entire image frames known as I- frames, partial image frames known as P-frames and extrapolated image frames known as B- frames. The frequency, or cycle, at which said I, P and B-frames appear in the sequence will depend upon the number of frames per second required to output seamless and artefact-free video. To achieve the desired input AV data discrimination, grass regions are segmented from the images. Grass colour clusters well in the hue space [60°- 100°], and therefore it may be easily isolated from other colours in the images. For each image a value representing the grass coloured pixel ratio (GCPR) is calculated. This value represents the ratio of the number of pixels that fall into the grass coloured interval, to that of the overall number of pixels in the region. Upon determining said cycle according to step 401, the application selects an I- frame at step 402 and processes each picture screen element, or pixel, thereof, each such pixel having a colour defined by three component red, green and blue values, each such value ranging between zero and 255 at the next step 403. Said processing is performed only on said I-frames because these are the only frames within the input AV data having full colour information for each pixel therein. Said processing is performed in order to obtain a normalised Grass Colour Pixel Ratio for the frame, wherein said GCPR determines whether the frame depicts a global view of the sporting action if its normalised value equals or exceeds an empirical threshold value. According to the present invention, the GCPR processing of step 403 is preferably performed in order to delineate portions of the input AV data of interest for further processing in order to avoid the unnecessary overhead of processing the entire sequence. At step 404 a question is therefore asked as to whether said normalised frame GCPR equals or exceeds said empirical value. If the question of step 404 is answered positively, the application de-multiplexes the AV input data at step 405 in order to buffer separately the audio track and the I and P-frames. In the preferred embodiment of the present invention, said audio track and I-frames are buffered for a sequence length of eighteen seconds and said P- frames are buffered for a sequence length of thirty-five seconds. Further to the buffering step 405, control proceeds back to step 402 in order to select and GCPR-process the next available I-frame and repeat the process if appropriate, whilst control also proceeds to the next AV data processing step 305, wherein said processing is performed upon said buffered AV data. Alternatively, the question of step 404 is answered negatively, whereby control also proceeds back to step 402.

Figure 5 illustrates the contents of the memory 209 shown in Figure 2 further to the audiovisual data loading step 304 and 401 to 405. An operating system is shown at 501 which, in the embodiment, is Windows™ XP™ Professional™ manufactured and licensed by the Microsoft Corporation of Redmond, California, USA. It will be understood by those skilled in the art that said OS is not limitative and, having regard to the alternative terminals described there above, may be OSX™ provided by Apple™, PalmOS™ provided by PalmOne™ or LINUX which is freely distributed. Network communication instructions are shown at 502, which may be integrated with the OS 501, or specific for the wireless radio network 109, and generally configure data processing unit 201 of system 101 to both receive and broadcast input and output AV data respectively over said wired and wireless networks. An application is shown at 403, which integrates the set of processor instructions according to which audiovisual data is distributed as described herein. Input AV data is shown at 504 which, in the example, is a streamed MPEG sequence but may be any other sequence of digitised AV data having a different encoding format, such as an AVI sequence, a DIVX or XVID sequence, a Real Media sequence, or a Windows Media Video sequence or other such network-distributed AV data being multiplexed or not. The metadata processed by the application 503 at step 401 is shown at 505 and the first output audio-visual data processed at step 405 is shown as buffered video data and buffered audio data at 506 and 507. The GCPR data of step 403 is shown at 508 and processing parameters are shown at 509, 510, 511 and 512 for final output AV data, which is shown at 513. Said processing parameters 509 to 512 are four normalised vectors, a "speech band energy level" output vector VI, a "visual motion activity level" output vector V2, a "crowd image confidence level" output vector V3 and a "close-up image confidence level" output vector V4 respectively. Second output audiovisual data 513 includes first output audio-visual data processed, selected and re-multiplexed by application 503 according to steps 304 to 306.

Figure 6 shows audio-visual data de-multiplexed into audio and image buffers 506, 507 according to step 405. A portion 601 of an MPEG sequence 504, 505 is shown as a cyclical succession of I-frames 602, P-frames 603 and B-frames 604. In accordance with the present description, a first I-frame 602 is stored in a first portion 605 of video data buffer 506, because it has a high GCPR parameter value since the field of view of the camera 102 or 103 is very wide and mostly trained upon the playing field. Consequently, the input AV data stream is de-multiplexed into buffers 506, 507 as soon as said high GCPR parameter value is detected, wherein eighteen seconds of the audio track 606 are stored in buffer 507. Likewise, I-frames 607, 608 and 609, which are subsequent to I-frame 602 over the duration of the buffered sequence, are also stored in portion 605 of buffer 506 and a plurality of P-frames 603 are stored in a second portion 610 of buffer 506 for thirty- five seconds.

The step 305 of processing input AV data 602, 603 and 606 to 609 stored in buffers 506, 507 is further detailed in Figure 7. At step 701, application 503 processes the de-multiplexed audio data stored in buffer 507 in memory 209, whereby the first "speech band energy level" vector 509 is output. At step 702, application 503 processes the de-multiplexed, P-frame video data stored in portion 610 of buffer 506 in memory 209, whereby the second "visual motion activity level" vector 510 is output. At step 703, application 503 processes the demultiplexed, I-frame video data stored in portion 605 of buffer 506 in memory 209, whereby the third "crowd image confidence level" and fourth "close-up image confidence level" vectors 511, 512 are respectively output. At step 704, application 503 processes all four vectors 509 to 512 in order to determine whether the buffered input AV data 504 represents a semantic event, i.e. whether the buffered audio data and video data depict a goal having being scored on the playing field during the sporting event. A question is therefore asked at step 705 as to whether the processing of step 704 identifies such an event. If the question of step 705 is answered positively, control proceeds to step 306, which will be further described hereinbelow and wherein input AV data depicting a semantic event is broadcast to recipient devices 104 to 106. Alternatively, the question of step 705 is answered negatively whereby application 503 clears buffers 506, 507 at step 706 and control returns to step 701, such that application 503 may process the next input AV data stored according to step 405.

The step 701 of processing audio data 606 buffered in buffer 507 in order to obtain the first "speech band energy level" output vector VI is further detailed in Figure 8. Speech-band energy is estimated by examining the scalefactor weights of the encoded audio bitstream. A fundamental component of MPEG audio bitstreams is the scalefactor. Scalefactors are variables that normalize groups of audio samples such that they use the full range of the quantiser. The scalefactor for such a group is determined by the next largest value (in a lookup table) to the maximum of the absolute values of the samples. Hence, they provide an indication of the maximum power (volume) of a group of samples. The scalefactors may be individually extracted from any of 32 equally spaced frequency subbands, which uniformly divide up the 0-20kHz audio bandwidth. Therefore, they provide an efficient audio filtering, envelope tracking technique. Subbands 2-7 correspond to the frequency range 0.6kHz - 4.4kHz, which approximates that of the speech band. Therefore, manipulation of scalefactors from these subbands provides for the establishment of a speech- band energy profile of the audio data. At step 801, application 503 initialises a counter SV of the scalefactor values which will be extracted from the scalefactor data within audio data 606. At the next step 802, application 503 selects said audio data 606, which is an audio track containing thirty-two audio bands reproducing sounds within a zero to 20 kilohertz frequency spectrum, such that six bands thereof, may be extracted therefrom at the next step 803, sub-bands two to sub-bands seven, reproducing sounds within a 0.6 to 4.4 kilohertz frequency spectrum, wherein said spectrum corresponds to the human voice.

Further to performing the extraction of step 803, application 503 then extracts scalefactor data at step 804 in order to process and temporarily store a root mean square scalefactor value (RMS SV) at step 805. At step 806, application 503 then increments the counter initialised at step 801 by one. A question is asked at step 807 as to whether another scalefactor value has been extracted at step 804, whereby if the question of step 807 is answered positively, control proceeds back to step 805 to process and store the RMS thereof and increment the counter according to step 806. The question of step 807 is eventually answered negatively, i.e. when all scalefactor data has been processed and, at step 808, the first "speech band energy level" output vector VI is processed as the sum of the stored RMS values of step 805 divided by the number of values counted by the counter of steps 801, 806.

The step 702 of processing P-frames video data 603 buffered in portion 610 of buffer 506 in order to obtain the second "visual motion activity level" output vector V2 is further detailed in Figure 9. Application 503 selects the first buffered P-frame 603 at step 901 and initialises a macroblock counter and a Non-Zero Vector Value (NZVV) counter at the next step 902. The NZVV is calculated by counting up the number of macroblocks in the frame whose motion vector length is greater than an empirically selected threshold. This value is normalized such that it lies between zero and one. Only the motion vectors from P-macroblocks are considered. I-macroblocks, which possess zero length motion vectors, do not represent zero motion. Secondly, the mean overall length value (MOLV) is calculated by an averaged superposition of all the motion vectors in the frame. This value is also normalized to lie between zero and one. These two statistics are calculated for each P-frame. A visual motion activity measure is calculated for each image by finding the average of its associated NZVV and MOLV values. A higher value indicates increased visual activity.

The first macroblock of sixteen pixels by sixteen pixels of said selected buffered P-frame is therefor selected at step 903 and the motion vector length L thereof is extracted and temporarily stored at the next step 904. A question is asked at step 905 as to whether said length L is equal to or exceeds an empirical threshold value. If the question of step 905 is answered positively, the NZVV counter is incremented at step 906, then the macroblock counter is incremented at step 907. Alternatively, the question of step 905 is answered negatively control proceeds directly to step 907 in order to increment the macroblock counter only. Hereafter, the question is asked at step 908 as to whether the macroblock counter has the value of 396, which is the total number of sixteen pixels by sixteen pixels macroblock in a standard MPEG1 image frame having a width of 352 pixels and a height of 288 pixels. If the question of step 908 is answered negatively, control returns to step 903, whereby the next macroblock is selected, for instance with using a first macroblock pixel offset based upon the X,Y co-ordinates of said first pixel of said next macroblock. Eventually, all 396 macroblocks of the P-frame 603 are processed according to steps 903 to 908 and said question of step 908 is answered positively, whereby the value of the NZVV counter is normalised at step 909 as Nl. At step 910, application 503 processes the average LI of the stored motion vector lengths L of the frame, which it also normalises. At step 911, application 503 processes the average An of said normalised values Nl and LI, which is temporarily stored, whereby stored motion vector length of step 904 are cleared from memory 209. A question is subsequently asked at step 912 as to whether another P-frame 603 remains to be processed in buffer portion 610. If the question of step 912 is answered positively, the reference An is incremented as An+1 at step 913 and control returns to step 901, in order to select the next buffered P-frame 603, initialise the counters and eventually arrive at another average An+1 of normalised value N2 and L2 and so on and so forth, until all P-frames 603 have been processed, the question of 912 is answered negatively and, at step 914, application 503 outputs said second "visual motion activity level" output vector V2 as the average of said An averages.

The step 703 of processing I-frames video data 602 buffered in portion 605 of buffer 506 in order to obtain the third "crowd image confidence level" output vector V3 is further detailed in Figure 10. Crowd image detection is performed by exploiting the inherent characteristic that such images are intrinsically detailed, in the context of an ordinarily non-complex image environment. Discrimination between detailed and non-detailed pixel blocks is made by examining the number of non-zero frequency (AC) Discrete Cosine Transform (DCT) coefficients used to represent the data in the frequency domain. It may be assumed that a (8x8) pixel block, which is represented by very few AC-DCT coefficients, contains consistent, non-detailed data, whereas a block which requires a considerable amount of AC- DCT coefficients for the representation thereof may comprise relatively more detailed information. In the context of sporting events, and the soccer video content of the example, the majority of images capture relatively sizeable monochromatic, homogeneous regions, e.g. grassy pitch or player jersey. Therefore, in the context of this limited environment, crowd images are isolated by simply detecting these very high, uniform frequency images. At step 1001, application 503 selects the first I-frame in buffer portion 605 and divides said frame into four quadrants Ql, Q2, Q3 and Q4 of 176 pixels by 144 pixels each, again corresponding to a standard MPEGl frame of 352 pixels by 288 pixels at step 1002. At step 1003, the first quadrant Ql is selected and the first block of eight pixels by eight pixels thereof is selected at the next step 1004.

At step 1005, application 503 processes the luma coefficient LC of its AC-DCT, such that a question may be asked at step 1006, as to whether the LC value is equal to or exceeds an empirical threshold value. If the question of step 1006 is answered positively, a block counter is incremented at step 1007, whereby a question is asked at step 1008, as to whether another block of a eight pixels by eight pixels should be processed in said selected quadrant, for instance by using the offset and block counting methods as described above. Alternatively, the question of step 1006 is answered negatively and control proceeds to said question 1008 which, if answered positively, returns control to step 1004 and the selection and subsequent processing of the next block and so on and so forth. The question of step 1008 is eventually answered negatively whereby all blocks of the selected quadrant have been processed and the standard deviation of the quadrant is processed and temporarily stored at step 1009. Another question is asked at the next step 1010, as to whether another quadrant Qn remains to be processed according to steps 1003 to 1009. If the question of step 1010 is answered positively, control proceeds back to the quadrant selection step 1003 and eventual processing of the standard deviation thereof according to step 1009.

Eventually, all blocks of all quadrants Ql to Q4 have been processed and the question of step 1010 is answered negatively, whereby application 503 processes the mean of the incremented block counter values at step 1011, representing the mean number of high-frequency blocks, and processes the average of the standard deviations of the quadrants Ql to Q4 at step 1012. At step 1013, a vector parameter An is calculated as the mean value of step 1011 minus the average value of step 1012. A question is then asked at step 1014, as to whether another I- frame 602, e.g. frame 607, remains to be processed within buffer portion 605. If the question of step 1014 is answered positively, the reference An is incremented as An+1 at step 1015 and control returns to step 1001. The question of step 1014 is eventually answered negatively when all I-frames have been processed according to steps 1001 to 1015, whereby application 503 outputs said third "crowd image confidence level" output vector V3 as the average of said An vector parameters.

An example of an I-frame 602 stored in buffer portion 605 of memory 209 is shown in Figure 11, which is divided according to step 1002 into four quadrants Ql 1101, Q2 1102, Q3 1103 and Q4 1104. For each quadrant 1101 to 1104 of the image frame, the AC-DCT coefficients of every (8x8) luminance pixel block 1105 are analysed. If the number of AC-DCT coefficients LC used to encode such blocks is greater than an empirically selected threshold, it can be said that the block represents reasonably complex data, such as a crowd depicted with a variety of R, G and B pixel colour values, and it is counted according to step 1007. Thus we obtain a value representing the number of high frequency blocks, per total number of blocks, for each quadrant. In the example, the question of step 1008 is answered negatively when the last block 1106 of the quadrant 1101 is processed, whereby the standard deviation of the quadrant 1101 is processed and temporarily stored at step 1009. The question of step 1010 is then answered positively, whereby quadrant 1102 is next selected and the blocks 1107 thereof are processed and so on and so forth. The vector parameter An is eventually calculated for I-frame 602, whereby the question of step 1014 may then be answered positively in order to select the next I-frame 607 and the processing thereof, until such time as application 503 outputs said third "crowd image confidence level" output vector V3 as the average of said An vector parameters of I-frames 602, 607, 608 and 609.

The step 703 of processing I-frames video data 602 buffered in portion 605 of buffer 506 in order to obtain the fourth "close-up image confidence level" output vector V4 is further detailed in Figure 12. Close-up image detection is performed by detecting the presence of skin-coloured pixels (i.e. a face), and the occlusion of a substantially uniformly-coloured pixel background, such as the green grass of the playing field, by a single homogeneous monochromatic region (i.e. the player's torso). In general, for field-sports video, the bottom regions of the images tend to capture some of the grass playing field irrespective of camera perspective. This knowledge can therefore be used, without assuming predictability of the image background as a whole, but rather at a localized level. In the present, soccer example, it has been shown that both grass-colour and skin-colour cluster well in the hue space ([60°- 100°] and [10°-55°] respectively), and therefore may be easily discriminated from other colours in the images.

At step 1201, application 503 delineates a first region Rl in the first buffered I-frame 602, as an 80 pixels by 80 pixels sub-division thereof substantially centred in relation to the vertical median of said frame and substantially centred in relation to the horizontal median of the upper half of said frame. At step 1202, application 503 next delineates three regions R2, R2' and R2" in said first buffered I-frame 602 with respectively dividing the lower half of said I- frame in three substantially equal and adjoining portions, thus wherein region R2' is substantially centred in relation to the vertical median of said lower half frame. It will be readily understood by those skilled in the art that the respective pixel dimensions of Rl, R2, R2' and R2" are described herein by way of example only and may vary to a lesser or larger extent, as well as vary in location within the frame, especially having regard to the applicability of the present invention to a vast range of sports audio-video data. At step 1203, application 503 processes a first parameter 'a' as the Skin Colour Pixel Ratio (SCPR) of frame region Rl, substantially as described in relation to the processing of the GCPR described in Figure 4, but wherein processing parameters are provided for identifying skin tone colours, as opposed to grass tone colours. At said step 1203 still, application 503 processes a second parameter 'b' as the Grass Colour Pixel Ratio (GCPR) of frame regions R2 and R2", substantially as described in relation to the processing of the GCPR described in Figure 4. Finally, at said step 1203, application 503 processes a third parameter 'c' as the Grass Colour Pixel Ratio (GCPR) of frame region R2', substantially as described in relation to the processing of the GCPR described in Figure 4.

Ideally, parameter 'a' should have a value close to one, indicating a close-up on the face of a player in region Rl, parameter 'b' should also have a value close to one, indicating grass surrounding the player in close-up and parameter 'c' should have a value close to zero, as the respective region R2' thereof should indicate a close-up on the clothing of said player, preferably having a colour different from green. A vector parameter An is thus calculated at step 1204 as the multiplication by parameter 'a' of the difference between parameter V and 'c'. At step 1205, a question is asked as to whether another I-frame 602, e.g. I-frame 607 remains to be processed within buffer portion 605. If the question of step 1205 is answered positively, the reference An is incremented as An+1 at step 1206 and control returns to step 1201 for the delineation of the next buffered frame and the output of a subsequent vector parameter An+\. The question of step 1205 is eventually answered negatively, whereby application 503 outputs said fourth "close-up image confidence level" output vector V4 as the average of said An vector parameters.

An example of an I-frame 607 stored in buffer portion 605 of memory 209 is shown in Figure 13, which is divided according to steps 1201 then 1202 into four regions Rl 1301, R2 1302, R2' 1303 and R2" 1304. Close-up image detection is performed by detecting the presence of skin-coloured pixels (i.e. a face), and the occlusion of a substantially uniformly-coloured pixel background, such as the green grass of the playing field, by a single homogeneous monochromatic region (i.e. the player's torso). In general, for field-sports video, the bottom regions of the images tend to capture some of the grass playing field irrespective of camera perspective. This knowledge can therefore be used, without assuming predictability of the image background as a whole, but rather at a localized level. In the present, soccer example, it has been shown that both grass-colour and skin-colour cluster well in the hue space ([60°- 100°] and [10°-55°] respectively), and therefore may be easily discriminated from other colours in the images. In accordance with the present description, Rl 1301 is as an 80 pixels by 80 pixels sub- division of the upper half 1305 of image frame 60,7 substantially centred in relation to the vertical median of said frame and substantially centred in relation to the horizontal median of the upper half of said frame. R2 1302, R2' 1303 and R2" 1304 are pixel sub-divisions of I- frame 607, with respectively dividing the lower half 1306 of said I-frame 607 in three substantially equal and adjoining portions, thus wherein region R2' 1303 is substantially centred in relation to the vertical median of said lower half frame. Application 503 processes the Skin Colour Pixel Ratio (SCPR) of frame region Rl 1301, wherein processing parameters are provided for identifying skin tone colours 1307, as opposed to grass tone colours 1308. Application 503 processes Grass Colour Pixel Ratio (GCPR) of frame regions R2 1302 and R2" 1304, wherein processing parameters are provided for identifying grass tone colours 1308. Finally, Application 503 processes the Grass Colour Pixel Ratio (GCPR) of frame region R2' 1303, wherein processing parameters are provided for identifying grass tone colours 1308 but the output ratio value is expected to be close to zero.

The step 704, at which application 503 processes all four vectors 509 to 512 in order to determine whether the buffered input AV data 504 represents a semantic event, i.e. whether the buffered audio data and video data depict a goal having being scored on the playing field during the sporting event, is further detailed in Figure 14. In accordance with the present description, aach retained input AV sequence has an associated 4-dimensional feature vector, with each individual component VI to V4 pertaining to a particular feature statistic.

At step 1401, application 503 retrieves output vectors VI to V4 calculated according to steps 701 to 703 from memory 209 and submits each of said vectors to respective value tests in order to declare whether a semantic event took place. The role of VI, the Speech-Band Audio Activity Measure is to detect and flag the segments that exhibit a noticeable increase in audio activity, particularly in the voice band, corresponding to commentator vocals. These patterns typically occur within a reasonable time after the goal has been scored. Therefore, for 18s after the shot-end-boundary of each shot, the average of the speech-band energy levels is calculated. This value becomes the first component of the feature vectors. At step 1402 a first question is asked as to whether the first "speech band energy level" output vector VI is equal to or exceeds an empirical threshold value, which is a normalised value learned by a support vector machine featuring a machine learning known to those skilled in the art. In the example and following a training procedure, said support vector machine infers a goal score model based on the intelligence acquired from a corpus of 15 hours of soccer content used as the training set. During the training phase, the shot feature vectors have have labelled as positive class if they contained a goal, otherwise labelled as negative class. The resulting trained classifier was tested on a 25-hour test corpus consisting of soccer content from various broadcast sources, distinct from that used in the training phase. If the question of step 1402 is answered negatively, control proceeds directly to step 705, the question of which is answered negatively, buffers 506, 507 are cleared and the next input AV data 504, 505 is processed substantially as hereinbefore described.

The role of V2, the Visual Motion Activity Measure, is to detect and flag the periods of intense motion, which typically occur when the camera attempts to capture the intense celebratory behaviour of the player after he/she has scored a goal. This segment of interest generally occurs after the scoring of the goal, and before a scene cut to slow-motion replay. Empirical evidence suggests that the average length of such segments is 18s. Therefore, for 18s after the shot-end-boundary of each shot, the average of the visual motion activity measures is calculated. This value becomes the second component of the feature vectors.

Alternatively therefore, the question of step 1402 is answered positively, whereby a second question is asked at step 1403, as to whether the second "visual motion activity level" output vector V2 is equal to or exceeds an empirical threshold value, which is a learned value as described hereinbefore. If the question of step 1403 is answered negatively, control proceeds directly to step 705, the question of which is answered negatively, buffers 506, 507 are cleared and the next input AV data 504, 505 is processed substantially as hereinbefore described.

The role of V3, the Crowd Image Detector, is to detect and flag the segments that show the crowd celebrating after a player has scored a goal. These shots typically occur, sometimes intermittently, within a reasonable time after the goal has been scored. Empirical evidence suggests that such shots typically occur within 35s of the event. Therefore, for 35s after the shot-end-boundary of each shot, the maximum of the crowd image confidence values is determined. This value becomes the third component of the feature vectors. Alternatively therefore, the question of step 1403 is answered positively, whereby a third question is asked at step 1404, as to whether the third "crowd image confidence level" output vector V3 is equal to or exceeds an empirical threshold value, which is a learned value as described hereinbefore. If the question of step 1404 is answered negatively, control proceeds directly to step 705, the question of which is answered negatively, buffers 506, 507 are cleared and the next input AV data 504, 505 is processed substantially as hereinbefore described.

The role of V4, the Close-Up Image Detector, is to detect and flag the segments that exhibit a player in close-up. These shots typically occur, after the scoring of the goal, and before a scene cut to slow-motion replay. Empirical evidence suggests that the average length of such segments is 18s. Therefore, for 18s after the shot-end-boundary of each shot, the average of the close-up confidence values is calculated. This value becomes the fourth component of the feature vectors.

Alternatively therefore, the question of step 1404 is answered positively, whereby a fourth question is asked at step 1405, as to whether the fourth "close-up confidence level" output vector V4 is equal to or exceeds an empirical threshold value, which is a learned value as described hereinbefore. If the question of step 1405 is answered negatively, control proceeds directly to step 705, the question of which is answered negatively, buffers 506, 507 are cleared and the next input AV data 504, 505 is processed substantially as hereinbefore described. Alternatively, the question of step 1405 is answered positively, signifying that all four output vectors VI to V4 match respective conditions indicative of a semantic event which, in the example, is that a goal has been scored during the football game and application 503 therefore declares said occurrence of said event at step 1406. System 101 may then output processed input AV data containing a semantic event at step 306 to remote recipient devices 104, 105 and 106.

The words "comprises/comprising" and the words "having/including" when used herein with reference to the present invention are used to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof.

Claims

1. A system for extracting semantic events from a digitised audio-visual sequence depicting sporting events, comprising digitised audio-visual sequence input means, memory means, data processing means, and data outputting means, wherein said memory means stores instructions, which configure said processing means to filter said digitised audio-visual sequence according to first image data processing criteria to generate first output audio-visual data; process said first output audio-visual data to generate second audio-visual data processing criteria; filter said first output audio-visual data according to said second audio-visual data processing criteria to generate second output audio-visual data; and output said second output audio-visual data with said data outputting means.

2. A system according to claim 1, wherein said visual data includes a plurality of frames having picture screen elements (pixels) therein, each of said pixels being configured with a red, a green and a blue colour component value.

3. A system according to claim 2, wherein said instructions further configure said processing means to filter said audio-visual sequence with processing a grass colour pixel ratio from said RGB values of said pixels in at least one of said frames.

4. A system according to claim 2, wherein said instructions further configure said processing means to filter said first output audio-visual data with processing a grass colour pixel ratio from said RGB values of said pixels in at least one of said frames.

5. A system according to claim 2, wherein said instructions further configure said processing means to filter said first output audio-visual data with processing a skin colour pixel ratio from said RGB values of said pixels in at least one of said frames.

6. A system according to claim 1, wherein said second audio-visual data processing criteria includes a Speech-Band Audio Activity Measure, a Visual Motion Activity Measure, a Crowd Image Confidence Level and a Close-Up Image Confidence Level.

7. A system according to claim 6, wherein said instructions further configure said processing means to filter said first output audio-visual data with said Speech-Band Audio Activity Measure with processing the scalefactor weights of said first output audio data.

8. A system according to claim 6, wherein said instructions further configure said processing means to filter said first output audio-visual data with said Visual Motion Activity Measure with processing the number of macroblocks in at least one frame of said first output visual data whose motion vector length is greater than an empirically selected threshold, whereby a higher value indicates increased visual activity.

9. A system according to claim 6, wherein said instructions further configure said processing means to filter said first output audio-visual data with said Crowd Image Confidence Level with processing the number of non-zero frequency Discrete Cosine Transform coefficients used to represent the data in the frequency domain, whereby detailed and non-detailed pixel blocks are discriminated in at least one frame of said first output visual data.

10. A system according to claim 6, wherein said instructions further configure said processing means to filter said first output audio-visual data with said Close-Up Image Confidence Level with processing the number of pixels in a first region of at least one frame of said first output visual data having respective red, green and blue colour component values indicative of skin tone, and the number of pixels in a second region having respective red, green and blue colour component values indicative of grass tone.

11. A method for extracting semantic events from a digitised audio-visual sequence depicting sporting events, said method comprising the steps of filtering said digitised audio-visual sequence according to first image data processing criteria to generate first output audio-visual data; processing said first output audio-visual data to generate second audio-visual data processing criteria; filtering said first output audio-visual data according to said second audio-visual data processing criteria to generate second output audio-visual data; and outputting said second output audio-visual data with said data outputting means.

12. A method according to claim 11, wherein said visual data includes a plurality of frames having picture screen elements (pixels) therein, each of said pixels being configured with a red, a green and a blue colour component value.

13. A method according to claim 12, wherein said method includes the further step of filtering said audio-visual sequence with processing a grass colour pixel ratio from said RGB values of said pixels in at least one of said frames.

14. A method according to claim 12, wherein said method includes the further step of filtering said first output audio-visual data with processing a grass colour pixel ratio from said

RGB values of said pixels in at least one of said frames.

15. A method according to claim 12, wherein said method includes the further step of filtering said first output audio-visual data with processing a skin colour pixel ratio from said RGB values of said pixels in at least one of said frames.

16. A method according to claim 11, wherein said second audio-visual data processing criteria includes a Speech-Band Audio Activity Measure, a Visual Motion Activity Measure, a Crowd Image Confidence Level and a Close-Up Image Confidence Level.

17. A method according to claim 16, wherein said method includes the further step of filtering said first output audio-visual data with said Speech-Band Audio Activity Measure with processing the scalefactor weights of said first output audio data.

18. A method according to claim 16, wherein said method includes the further step of filtering said first output audio-visual data with said Visual Motion Activity Measure with processing the number of macroblocks in at least one frame of said first output visual data whose motion vector length is greater than an empirically selected threshold, whereby a higher value indicates increased visual activity.

19. A method according to claim 16, wherein said method includes the further step of filtering said first output audio-visual data with said Crowd Image Confidence Level with processing the number of non-zero frequency Discrete Cosine Transform coefficients used to represent the data in the frequency domain, whereby detailed and non-detailed pixel blocks are discriminated in at least one frame of said first output visual data.

20. A method according to claim 16, wherein said method includes the further step of filtering said first output audio-visual data with said Close-Up Image Confidence Level with processing the number of pixels in a first region of at least one frame of said first output visual data having respective red, green and blue colour component values indicative of skin tone, and the number of pixels in a second region having respective red, green and blue colour component values indicative of grass tone.

21. A data carrying medium having data-processing instructions thereon which, when processed by processing means, configure said processing means to extract semantic events from a digitised audio-visual sequence depicting sporting events, with performing the steps of filtering a digitised audio-visual sequence according to first image data processing criteria to generate first output audio-visual data; processing said first output audio-visual data to generate second audio-visual data processing criteria; filtering said first output audio- visual data according to said second audio-visual data processing criteria to generate second output audio-visual data; and outputting said second output audio-visual data with said data outputting means.