WO2012120531A2

WO2012120531A2 - A method for fast and accurate audio content match detection

Info

Publication number: WO2012120531A2
Application number: PCT/IN2012/000076
Authority: WO
Inventors: Makarand Prabhakar Karanjkar
Original assignee: Makarand Prabhakar Karanjkar
Priority date: 2011-02-02
Filing date: 2012-02-02
Publication date: 2012-09-13
Also published as: WO2012120531A3

Abstract

The present invention describes a method for detection of audio content match. The method is fast and capable of accurately detecting match for an audio content. Further, the method is capable of improving match times for a sample of audio exposed to noise and distortion, in inverse proportion to the amount of noise in the sample. The method is also capable of matching very fast a relatively noise free audio sample, further, with incremental noise in the audio sample resulting in a gradually increasing in time for match. Moreover, the method offers a substantial improvement in the ability to reject false positive matches, resulting from the use of a technique whereby for the same set of matched feature points, a greater amount of information relating to matching of the clips is derived, significantly improving the ability to reject false positives.

Description

A Method for Fast and Accurate Audio Content Match Detection Field of the Invention The present invention relates to a method for detection of audio content match. Background and Prior Art

With increasing growth of the broadcast media for delivery of content, value of advertisements being broadcast has been constantly rising. Advertisement slots for advertising before and in between specific events are constantly rising in cost per second, and advertisers are therefore naturally interested in independent monitoring of the broadcasting of the advertisement they have paid for, in the slot they have paid for. Further, there is also an interest amongst advertisers in trying to quantify the impact or reception of their advertisement across a target area of coverage. All these factors are driving a need to monitor audio content as it is being broadcast, as close to real time as possible, if not actually in real time. This allows for rapid adjustment and potential recalibration of the advertisement or content broadcast strategy. Traditionally, a lot of the monitoring has happened using manual listening techniques, with logs recording time and frequency of playback being produced for the content being monitored. With the advent of digital techniques and explosion in the number of channels being broadcast, this approach obviously presents serious limitations, and could result in delays and errors in reporting. Approaches currently being used to achieve automated monitoring of audio content typically analyze the piece of audio under consideration and extract features, and query a database of content previously analyzed and whose features have been stored, looking for a measure of similarity between the features. Numerous content based retrieval methods are available in prior art. For example U.S. Pat No. 5,437,050 issued to R.G Lamb and E.F. Mazer, discloses a method which include steps of receiving a set of broadcast information; converting the set of broadcast information into a frequency representation of the set of broadcast information; dividing the frequency representation into a predetermined number of frequency segments, each frequency segment representing one of the frequency bands associated with the semitones of the music scale; forming an array, wherein the number of elements in the array correspond to the predetermined number of. frequency segments, and wherein each frequency segment with a value greater than a threshold value is represented by binary 1 and all other frequency segments are represented by binary 0; comparing the array to a set of reference arrays, each reference array representing a previously identified unit of information; determining, based on the comparison, whether the set of broadcast information is the same as any of the previously identified units of broadcast information.

A method for recognizing an audio sample is disclosed in U.S. Patent No. 7,346,512 B2 issued to Wang and Smith, which locates an audio file that most closely matches the audio sample from a database indexing a large set of original recordings. Each indexed audio file is represented in the database index by a set of landmark timepoints and associated fingerprints. Landmarks occur at reproducible locations within the file, while fingerprints represent features of the signal at or near the landmark timepoints. To perform recognition, landmarks and fingerprints are computed for the unknown sample and used to retrieve matching fingerprints from the database. For each file containing matching fingerprints, the landmarks are compared with landmarks of the sample at which the same fingerprints were computed. If large number of corresponding landmarks are linearly related, i.e., if equivalent fingerprints of the sample and retrieved file have the same time evolution, then the file is identified with the sample. The method can be used for any type of sound or music, and is particularly effective for audio signals subject to linear and nonlinear distortion such as background noise, compression artefacts, or transmission dropouts. The sample can be identified in a time proportional to the logarithm of the number of entries in the database. Given sufficient computational power, recognition can be performed in nearly real time as the sound is being sampled. U.S. Patent No. 6,990,453 B2 also issued to Wan and Smith discloses a similar recognition method for audio samples. Additionally, US Patent No. US 7,580,832 B2 issued to Allamanche et al., discloses an apparatus for producing a fingerprint signal from an audio signal includes a means for calculating energy values for frequency bands of segments of the audio signal which are successive in time, so as to obtain, from the audio signal, a sequence of vectors of energy values, a means for scaling the energy values to obtain a sequence of scaled vectors, and a means for temporal filtering of the sequence of scaled vectors to obtain a filtered sequence which represents the fingerprint, or from which the fingerprint may be derived. Thus, a fingerprint is produced which is robust against disturbances due to problems associated with coding or with transmission channels, and which is especially suited for mobile radio applications.

Additionally, US Patent No. 7,529,659 B2 issued to Wold, discloses a system for determining an identity of a received work. The system receives audio data for an unknown work. The audio data is divided into segments. The system generates a signature of the unknown work from each of the segments. Reduced dimension signatures are then generated by at least a portion of the signatures. The reduced dimension signatures are then compared to reduced dimensions signatures of known works that are stored in a database. A list of candidates of known works is generated from the comparison. The signatures of the unknown works are then compared to the signatures of the known works in the list of candidates. The unknown work is then identified as the known work having signatures matching within a threshold.

One outstanding problem observed in each of the above prior art is the relatively constant search times faced by each of the techniques described above. Given that the characterization of the features extracted from the audio sample do not segregate the extracted features in any way so as to provide search speed advantages for audio samples subject to reduced amounts of noise. The techniques described in the prior art profile a clip of audio like say an ad or a song, and store the resulting unique features resulting from such a profiling into a database, and subsequently search the database for a match with features extracted from a sample of audio, for which a match must be found. Profiling does not in any way alter the probabilistic match profile for any given set of features. The seek times for extracting matching features and also the potential number of matching features are fairly evenly spread for the audio sample under consideration.

Additionally, another outstanding problem faced by prior art methods is the relative difficulty in rejecting "false positives", where the result returned is a piece of content that does not actually match the audio sample. This problem relates broadly to the issue of "information density" that is achieved at the profiling stage. The greater the amount of information relating to the set of features, the greater is the ability of the system to distinguish between a real match and a false positive. The relatively even spread of information, typically one point of information derived per feature, results in a situation where results derived from the search provide a smaller differential between a real match and a false match, resulting in a greater probability of false positives. An attempt to increase the probability of a match, using prior art techniques by increasing the number of features extracted and searched for, per unit time, results in a slowdown in search times.

Objects of the Invention

Object of the present invention is to provide a method for audio content match detection, which is fast and capable of accurately detecting match for an audio content.

Another object of the present invention is to provide a method for audio content match detection, which is capable of improving match times for a sample of audio exposed to noise and distortion, in inverse proportion to the amount of noise in the sample.

Another object of the present invention is to provide a method for audio content match detection, which is capable of matching very fast a relatively noise free audio sample, further, with incremental noise in the audio sample resulting in a gradually increasing in time for match, proportional to the noise. Further object of the present invention is to provide a method for audio content match detection that provides a substantial improvement in the ability to reject false positive matches, resulting from the use of a technique whereby for the same set of matched feature points, a greater amount of information relating to matching of the clips is derived, significantly improving the ability to reject false positives.

Brief Description of the Diagrams

The objectives and advantages of the present invention will become apparent from the following description read in accordance with the accompanying drawings wherein,

Figure la shows a flow chart for method generating stets of links of a sample audio accordance with the present invention;

Figure lb shows a flow chart for method for audio content match detection in accordance with the present invention;

Figure 2 shows graph of signal value against the time duration for a portion of an audio sample;

Figure 3 shows a graph of Fast Fourier Transform obtained for the portion of the audio sample; Figure 4 shows a three dimensional Spectrogram of each of the portion of the audio sample;

Figure 5 shows top view of figure 4 illustrating a threshold peaks of spectrogram; Figure 6 shows a Frequency Peak Links; and

Figures 7a and 7b shows the links of peaks selected to form a set. Detail Description of the Invention

The foregoing objects of the present invention are accomplished and the problems and shortcomings associated with the prior art, techniques and approaches are overcome by the present invention as described below in the preferred embodiment.

The present invention provides a method for detection of audio content match. The method is fast and capable of accurately detecting match for an audio content. Further, the method is capable of improving match times for a sample of audio exposed to noise and distortion, in inverse proportion to the amount of noise in the sample. The method is also capable of matching very fast a relatively noise free audio sample, further, with incremental noise in the audio sample resulting in a gradually increasing in time for match. Moreover, the method offers a substantial improvement in the ability to reject false positive matches, resulting from the use of a technique whereby for the same set of matched feature points, a greater amount of information relating to matching of the clips is derived, significantly improving the ability to reject false positives. This present invention is illustrated with reference to the accompanying drawings, throughout which reference numbers indicate corresponding parts in the various figures. These reference numbers are shown in bracket in the following description.

For the purpose of better explanation a single method is bifurcated in two stages as shown in figure la & lb.

Referring now to figure la, a flow chart for method (100) for audio content match detection in accordance with the present invention is illustrated. The method (100) starts at step (10).

At step (15), a predefined digital audio sample is divided into plurality of equal portions by using digital sampling. A preferred embodiment uses a .WAV file, representing the analog values of the sampled audio, encoded into either 16 bits, and sampled at 8000kHz, with a block of 1024 such samples. This is illustrated in figure 2- At step (20), the Fast Fourier Transform (FFT) for each of the portion of the digital audio sample is computed as shown in figure 3. A preferred embodiment uses a set of 1024 audio samples, to which a Hamming Window is applied, and an FFT computed, with a time overlap of 128 samples, for each FFT. Frequency information for each portion of audio sample is placed along the 'Y' axis of a graph, and each frequency profile (FFT) computed is stacked up along the 'X' axis representing time. With frequency amplitude along the z-axis, the graph provides a frequency-time representation of the audio sample.

At step (25), the FFT's computed are stacked together to configure a spectrogram as shown in figure 4. The spectrogram may also involve using overlapping equally sized portions of the audio sample, to improve the time resolution of the representation. A preferred embodiment uses a time overlap of 128 audio samples.

At step (30), significant and dominant peaks for each portion of the audio record are selected, by first computing the absolute value of the FFT, followed by the logarithm, and then peaks detected by using a threshold computed by applying a point spread to each peak in the FFT, and accumulating and rescaling the resulting spread values, which results in a smooth threshold that is representative of relative signal strength at each frequency along the FFT. The result of applying the threshold produces a set of peaks that are truly dominant for that block of audio samples, without suffering from clustering in a single portion of the spectrum, and these peaks are marked along a frequency (y axis) versus time (x axis) as illustrated in figure 5. The improvement starts at step (35). At step (35), at least two peaks, and in a preferred embodiment 3 or 5 peaks are linked together. The selection of peaks to link is based upon criteria, such as a predetermined frequency range relative to the peak under consideration, and/or time distance with respect to the peak under consideration, value of the peak relative to the peak under consideration to form frequency links of various fixed lengths (see figure 6). The selected peaks are linked to produce peak frequency links, which are clubbed into sets. Each set containing frequency links produced from the same set of peaks of distinguishing frequency peak links linking at least two peaks each. The peak link frequency values are concatenated to produce a string of numbers, which represent a frequency peak link. Each of these strings (representing the frequency peak link) and a frame/time of occurrence of the first peak within that link, is clubbed together to form a complete feature.

At step (40) the clubbed frequency peak link sets are stored in a database. The database stores sets of features, of various peak links of fixed lengths (constructed using a varying number of peaks). An inverse relationship is maintained between the number of features in a set and their length (number of peaks). For example, to detect clear (noise free) audio, five peaks are clubbed together to form a set, whereas to detect unclear or an audio sample subjected to noise, a set containing three peaks are selected. The premise is that the probability of a match for longer peak links is smaller for a noisy sample, relative to that of finding a match of a shorter peak link. If the sample contains little noise, a higher probability exists for a match for longer peak links, since the absence of noise or distortion results in a higher probability of finding the same peaks, and corresponding peak links. Greater the noise, greater is the probability of missing peaks, and therefore missing of match for corresponding long length peak links. At step (45) the method ends.

Referring now to figure lb, a flow chart for method for audio content match detection in accordance with the present invention is illustrated. The method (200) starts at step (110). At step (120), an audio which is to be matched with the audio sample is subjected to compute Fast Fourier Transform (FFT). At step (130), the computed FFT of the audio to be matched are stacked together to form a spectrogram.

At step (140), the significant and dominant peaks form the spectrogram are selected.

At step (150), at least two peaks are linked to produce peak frequency links, which are clubbed into sets.

At step (160), searching the database for a first set of matching peak frequency links, and combining each element of each set of peak frequency links into an exhaustive set of combination pairs without repeating combinations, similarly, searching the database for a second set of matching peak frequency links, and combining each element of each set of peak frequency links into an exhaustive set of combination pairs without repeating combinations extracted from the database.

At step (170), these sets are compared with the sets stored in the database of the audio sample to detect an audio match.

While detecting, a single set of frequency peak links representing the longest peak links are computed from the audio sample and a search is conducted in a database to seek a match for frequency peak links representing similar peak links of the same size (number of peaks linked), that were computed and saved into the database for an earlier similarly profiled piece of audio content. The search yields a list of matching peak links, and their corresponding frame (time) of occurrence (refer Computation below).

The set of peak links, computed from the audio sample, for which matches were found in the database is used to form peak link pairs amongst themselves ordered over time, resulting in a list of n!/2*(n-r)! pairs of peaks links. Each of the peak links is paired with peak links that follow it in time. Likewise the matched links retrieved from the database are also paired. These results in two lists of matched peak link pairs are shown in the computation below. It should be apparent that longer matching lists will result from a greater set of matched peak links. For example, if five matched peak links were matched, then 5!/2*(5-2)! = 10 pairs will result in each list. A check is performed to determine the degree of similarity between these two clusters of peak links, by finding how many of the link pairs in the two lists have the same (or similar) temporal relationship (matching time difference).

Computation:

Peak Links from audio sample (length 5 peaks):

1. PL 1 = pfl ->pf3->pf6->pf9->pfl 2 (occurs at 11 )

2. PL2 = pf5->pf8->pfl l->pfl4->pfl6->pfl7 (occurs at t3)

3. PL3 = pfl 5->pfl 7->pfl 8->pfl 9->pf20 (occurs at t7) Matching Peak Links (found in the database) for the audio to be matched (length 5 peaks)

1. PLl' = pfl->pf3->pf6->pf9->pfl2 (occurs at tl*)

2. PL2' = pf5->pf8->pfl l->pfl4->pfl6->pfl7 (occurs at t3')

3. PL3' = pfl 5->pfl7->pfl8->pfl9->pf20 (occurs at t7')

Peak Links Pairs from audio sample:

1. PL1, PL2 (time diff = tdl)

2. PLl, PL3 (time diff= td2)

3. PL2, PL3 (time diff = td3)

Matched Peak Link Pairs for the audio to be matched:

1. PL1', PL2' (time diff = tdl')

2. PL1', PL3' (time diff = td2^*)

3. PL2^*, PL3' (time diff = td3^*)

For example, for detecting the match for an audio without disturbances on account of noise, or distortion, a five peak links set (PL1, PL2 & P13) is matched with(PLl', PL2' & P13¹). Further, the time difference between peak link pairs in each set PL1, PL2 (time diff. = tdl), PL1, PL3 (time diff. = td2) and PL2, PL3 (time diff. = td3) are compared with the time difference of matched sets PLT, PL2' (time diff. = tdl'), PL1', PL3' (time diff. = td2') and PL2', PL3' (time diff. = td3') to derive and determine if a statistically significant number of time difference matches exist across the two sets to conclude the match. If the time difference (tdl = tdl', td2 = td2' and td3 = td3') is obtained then the audio is considered as an exact match. Similar, a plurality of sets containing peak links formed using three peaks are consider for matching an audio subjected to noise.

To match an audio with higher or lower tempo as compared to the audio sample, the average of tdl: td2: td3 is computed. Further, the time differences are then recomputed and represented as a percentage of the computed average. Similar treatment is applied to tdl': td2': td3'. Now a comparison is made between these two times, computed as a percentage of the average time difference between link pairs. This makes the method (100) robust to distortions in playback speed.

The fact that each matched peak link is paired with at least one other, results in at least two bits of information that contributes towards computation of the result. In fact the greater the number of pairs, the greater the number of bits of information that each peak link contributes towards the computation of the result. This is as against prior art in which a single feature contributes typically a single bit of information towards determining a match or mismatch. This additional information directly contributes to the ability of the method (100) to distinguish between a real match and a false positive in a much more robust way. The speedup for a matching case and graceful degradation in speed for a miss during the matching method/search is obtained by using sets of longer peak link pairs initially, followed by a check using successively shorter peak link sets. The database thus stores multiple sets of peak links of various fixed length (number of peaks), corresponding to originally profiled audio samples. For an audio sample that has been subjected to noise, the number of peaks computed from the aμdio sample (subjected to noise) will match fewer peaks computed from a potentially matching sample, profiled and stored (in the form of frequency peak links) into a database earlier. This is on account of distortion resulting in missing or misaligned peaks in the audio sample due to the presence of noise. While matching, the set of longer length links are used first, since the probability of finding a longer matching link is lowered due to missing or misaligned peaks in the audio sample, and a fewer number of potential matches are found. This results in a quicker determination of a match, given the smaller number of links to match. In the event that the method (100) yields an ambiguous result or fails to produce a match, the next set of smaller links (fewer peaks per link) are searched, increasing the probable matches, but also taking more time. This technique speeds up the overall search in the event the audio sample has not been subject to noise or distortions, but ensures that time to match degrades gracefully in the event that the sample was subject to noise or other distortion. The trade-offs in this approach being additional computations and storage space. At step (180), the method (200) ends.

The method (100 & 200) provides advantage of increasing accuracy for detaching match for an audio sample with some disturbance. Further, the method (100 & 200) improves match times for a sample of audio exposed to noise and distortion, in inverse proportion to the amount of noise in the sample. Furthermore, the method (100 & 200) matches very fast a relatively noise free audio sample, further, with incremental noise in the audio sample resulting in a gradually increasing in time for match. Moreover, the method (100 & 200) provides a substantial improvement in the ability to reject false positive matches, resulting from the use of a technique whereby for the same set of matched feature points, a greater amount of information relating to matching of the clips is derived, significantly improving the ability to reject false positives.

The foregoing objects of the invention are accomplished and the problems and shortcomings associated with prior art techniques and approaches are overcome by the present invention described in the present embodiment. Detailed descriptions of the preferred embodiment are provided herein; however, it is to be, understood that the present invention may be embodied in various forms. Therefore, specific details disclosed herein are not to be interpreted as limiting, but rather as a basis for the claims and as a representative basis for teaching one skilled in the art to employ the present invention in virtually any appropriately detailed system, structure, or matter. The embodiments of the invention as described above and the methods disclosed herein will suggest further modification and alterations to those skilled in the art. Such further modifications and alterations may be made without departing from the spirit and scope of the invention.

Claims

I Claim:

1. A method for audio content match detection with respect to a pre-defined digital audio sample, the method comprising steps of: feeding the predefined digital audio sample by input means to a server;

dividing the predefined digital audio sample in plurality of equal portions; computing a Fast Fourier Transform (FFT) for each of the portion of the digital audio sample, characterized in that:

stacking together computed FFT's of each of the portion of the digital audio sample to configure spectrogram separately;

selecting significant and dominant peaks form each portion of the digital audio sample;

linking at least two selected peaks with predefined time period therebetween to produce peak frequency links of the digital audio sample, the links are clubbed into first sets;

saving the generated first sets in a database;

receiving audio to be matched from a broadcasting means by the server; dividing the audio to be matched in plurality of equal portions;

computing a Fast Fourier Transform (FFT) for each of the portion of the audio to be matched, characterized in that:

stacking together computed FFT's of each of the portion of the audio to be matched to configure spectrogram separately;

selecting significant and dominant peaks form each portion of the audio to be matched;

linking at least two selected peaks with predefined time period therebetween to produce peak frequency links of the audio to be matched, the links are clubbed into second sets;

searching the database for a first set of matching peak frequency links, and combining each element of each set of peak frequency links into an exhaustive set of combination pairs without repeating combinations, similarly, searching the database for a second set of matching peak frequency links, and combining each element of each set of peak frequency links into an exhaustive set of combination pairs without repeating combinations extracted from the database; and comparing each of the second sets with each of the first sets stored in the database, wherein when each of the pair of the second set are compared with each of the pair of the first sets, and a time difference between individual each of the pair of frequency pair link of the second set and each of the respective pairs of the first sets is constant, then the audio to be matched is considered as a perfect match with the digital audio sample, accordingly results are generated in the server.

2. The method as claimed in claim 1, wherein the input means is selected from a group consisting of a Compact Disc (CD), a Digital Versatile Disc (DVD), a hard disc, a MP3 player, a Universal Serial Bus (USB), an audio playback device like an FM/AM radio, TV, DVB receiver or STB.

3. The method as claimed in claim 1, wherein the broadcasting means is selected from a group consisting of a Television Broadcast (TV), a radio broadcast, online streaming, cable TV, and DTH receiver.

4. The method as claimed in claim 1, wherein the server is a selected from a group consisting of computer, a personal computer, an embedded processor, and embedded PC.

5. The method as claimed in claim 1, wherein the database is stored on a hard disc.

6. The method as claimed in claiml, wherein a large number for example five selected peaks are linked for obtaining match when an audio to be matched has clear sound.

7. The method as claimed in claiml, wherein a smaller number for example three selected peaks are linked for obtaining match when an audio to be matched has sound with disturbance.

8. The method as claimed in claiml, wherein the server comprises a hardware and a software for processing the predefined digital audio sample and the audio to be matched.

9. The method as claimed in claiml, wherein the digital audio sample having a format selected from a group consisting of WMA, WAV, MP3, and RM.

10. The method as claimed in claiml, wherein the audio to be matched having a format selected from a group consisting of WMA, WAV, MP3, and RM.