CN104715033A

CN104715033A - Step type voice frequency retrieval method

Info

Publication number: CN104715033A
Application number: CN201510113675.1A
Authority: CN
Inventors: 牛保宁; 姚姗姗; 王运生
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2015-03-16
Filing date: 2015-03-16
Publication date: 2015-06-17

Abstract

A step type voice frequency retrieval method comprises the steps that the Fibonacci Hash algorithm is used to establish a Hash index table for a voice frequency original fingerprint database; the voice frequency original fingerprint database is converted through the BoF algorithm to generate a voice frequency intermediate fingerprint database corresponding to the voice frequency original fingerprint database; by screening for three times, a third serial number set of a possible outcome is screened out, and the retrieval range is narrowed once again; in original fingerprints corresponding to the third serial number set, accurate match retrieval is conducted with the original fingerprints of a voice frequency clip to be retrieved, and the final retrieval result is obtained. By means of the step type voice frequency retrieval method, when the quick voice frequency retrieval is conducted, on the premise that the precision is not lowered, the calculated amount can be reduced, it is achieved that the efficiency is improved, and the use of internal storage can be reduced.

Description

A kind of staged audio search method

Technical field

The invention belongs to content-based audio-frequency fingerprint searching field, specifically based on a kind of staged audio search method of Philip audio-frequency fingerprint and Bag-of-Features (BoF) algorithm.

Background technology

Along with internet since the new century is in mondial extensively universal, the fast development of audio encoding and decoding technique and the birth of high-capacity storage medium, exponentially other increases the DAB resource quantity in network.While the network digital audio resource of magnanimity brings great convenience; due to present stage internet DAB management system and the lack of standardization and imperfection of copyright protection scheme; the network user can arbitrarily upload or download DAB resource and even change audio content, and this is in the legitimate rights and interests of DAB resource copyright owner that virtually constituted a serious infringement.

Audio search method main is at present divided into based on text and content-based two large classes, and content-based audio retrieval has become the focus of recent domestic research.

Content-based audio-frequency fingerprint retrieval is that the fingerprint in audio-frequency fingerprint to be retrieved and audio fingerprint database is carried out similarity mode, obtains the process of result for retrieval by comparing similarity.

Philips(Philip) audio-frequency fingerprint is at present comparatively conventional a kind of fingerprint.The most direct audio retrieval algorithm is that the fingerprint of the reference audio in the fingerprint of audio fragment to be retrieved and audio repository is carried out similarity mode one by one, but this method is along with the amplification of the audio repository effect expected that causes recall precision not reach people completely.

Audio-frequency fingerprint has the characteristic of higher-dimension usually, and the similarity mode of higher-dimension fingerprint can cause calculating and storage cost to increase with the form of index.Treat audio-frequency fingerprint and retrieve the data higher-dimension problem brought, the problem of most critical is design one search method fast and accurately.

Different audio-frequency fingerprints needs to take corresponding suitable fingerprint searching algorithm and Similarity Match Method to be solved according to its data structure characteristic sum application scenarios etc.At present, the research direction of quick retrieval mainly contains and reduces dimension and set up index two class.

The thought of dimensionality reduction technology is the calculated amount reduced by minimizing finger print data amount in fingerprint similarity mode process, thus reaches the object improving recall precision.

The dimensionality reduction technology based on OPCA that Diamantaras and Kung proposes, the identification of streaming media is very effective, but has that classifying quality is undesirable, the problem of unstable result.Based on this, the people such as Hu propose the audio frequency dimensionality reduction technology of a kind of w-PCA based on weighting, in data low-dimensional, have remarkable superiority, but comparatively responsive to choosing of dimension.The people such as Shen propose a kind of summation algorithm, significantly can improve retrieval rate, but only have when maximum eigenwert is much larger than being applicable to during further feature value using.In addition, the people such as Zheng propose a kind of quantization method of weighting self-similarity, before retrieval, the audio feature vector fingerprint of multidimensional are carried out dimension-reduction treatment.The people such as Panagiotou set up Markov model after using the technology of gathering of a kind of Based PC A to carry out dimensionality reduction for Delta Mel Cepstral Frequency Coefficients or Delta chromaticity, effectively accelerate retrieving.But dimensionality reduction technology is while raising recall precision, reduces precision and the recall rate of fingerprint, this mates with the retrieval that we will carry out and runs in the opposite direction.

The object of indexing means is by setting up index association to fingerprint, thus rapid drop range of search, realize efficient retrieval.The people such as Haitsma propose all possible audio-frequency fingerprint to set up a fast query table to Philips fingerprint, fingerprint in audio-frequency fingerprint storehouse is associated with fast query table respectively, in question blank, the song of inquiry associated by audio-frequency fingerprint can be found fast.But when inquiring about audio distortion, retrieval performance can significantly decline.The people such as Chen improve the method, propose a kind of quick retrieval based on Fibonacci Hash, according to the size of the capacity adjustment Hash table of internal memory, effectively can save internal memory.The form that the people such as Kurth propose audio-frequency fingerprint to quantize code book sets up index, can significantly improve retrieval rate, but when index building due to the restriction of the error rate to feature, False Rate can be caused to raise.Meanwhile, Kurth and Muller proposes a kind of search method based on the index of falling sort file to the multiple fault-tolerant strategy that is machine-processed with arrangement and that repeatedly inquire about of CENS (chroma energy normalized statistics) integrate features.The people such as Vitola use a kind of hash function to take the fingerprint to frequecy characteristic, and use Hash numbering to divide search space, have higher extendability in parallel architecture.But set up index and need to spend extra storage space, along with the continuous increase of amount of audio data, this will be a very serious problem.

Summary of the invention

While improving recall precision, reduce the precision of fingerprint to overcome dimensionality reduction technology in searching algorithm and set up the deficiency of the storage space that index needs cost extra, the invention provides efficient a kind of staged audio search method, calculated amount can be reduced under the prerequisite not reducing precision, the raising of implementation efficiency, and the use reducing internal memory.

The technical solution adopted for the present invention to solve the technical problems is:

1, audio frequency original fingerprint storehouse is set up;

2, use Fibonacci Hash (Fibonacci Hash) algorithm, hash index table is set up to original fingerprint storehouse;

3, original fingerprint storehouse is converted to middle fingerprint base through BOF algorithm;

4, three screenings are carried out to middle fingerprint base;

Described three screenings are first time screening, programmed screening, for the third time screening;

First time screening adopts Fibonacci Hash filters;

Programmed screening, third time screening: all adopt the fixed intervals sampling matching method based on threshold value to filter;

5, the original fingerprint corresponding to the result filtered out third time adopts Philip algorithm to carry out exact matching with retrieval audio frequency original fingerprint, obtains final result for retrieval;

The present invention is according to Philips(Philip) audio-frequency fingerprint and Bag-of-Features(BOF) technology, devise the middle fingerprint that a kind of data volume is less, be used for the dissimilar audio frequency of fast filtering.Devise a kind of fixed intervals sampling matching method based on threshold value simultaneously, when using the fingerprint of audio fragment to be retrieved to mate with storehouse sound intermediate frequency, owing to supposing that audio clip length to be retrieved is less than storehouse sound intermediate frequency, once mate every a segment distance, and in each matching process, only coupling is at a distance of the sub-fingerprint of fixed intervals, according to its similarity of threshold decision, can matching times be reduced, accelerate retrieval matching speed.

Add Fibonacci hash algorithm, the size of generating indexes can be adjusted according to the size of internal memory, reduce the excessive use of storage space.

The present invention, when carrying out audio frequency quick-searching, can reach and reduce calculated amount under the prerequisite not reducing precision, the raising of implementation efficiency, and the use that can reduce internal memory.

The described fixed intervals sampling matching method based on threshold value is as follows:

1. sub-fingerprint amount threshold: if the middle fingerprint totalframes of audio fragment to be retrieved is less than the middle fingerprint totalframes of reference audio, then judge that reference audio is as possible outcome;

2. single frames distance threshold α and average distance threshold ā: if the single frames distance ε of the middle fingerprint of audio fragment to be retrieved _ibe less than single frames distance threshold α, or front N _ithe mean distance ∑ ε of frame _i/ N _iwhen being less than mean distance threshold value ā, directly judge that reference audio is as possible outcome; Fixed intervals sampling matching method is adopted during calculating; α and ā is the integer being greater than 0; N _ifor being greater than the integer of zero, scope 0-N _m/ Q; N _mfor the totalframes of fingerprint in the middle of audio fragment to be retrieved, N _mfor being greater than the integer of zero; Q is a constant, and Q is 1-N _m, carry out a similarity mode at interval of Q frame; N _m/ Q is the total degree that the middle fingerprint of audio fragment to be retrieved needs to carry out similarity mode;

3. Cumulative Distance threshold value beta and cumulative frequency threshold value Ω: namely process 2. in, before accumulative, the distance ε m of fingerprint in the middle of m frame, if ε m is less than β, or when m does not reach Ω, then judges that reference audio is as possible outcome; β and Ω is the integer being greater than 0; M be greater than zero integer, scope 0-N _m/ Q; Wherein N _mfor the totalframes of fingerprint in the middle of audio fragment to be retrieved, N _mfor being greater than the integer of zero; Q is a constant, and Q is 1-N _m, carry out a similarity mode at interval of Q frame; N _m/ Q is the total degree that the middle fingerprint of audio fragment to be retrieved needs to carry out similarity mode;

4. front t frame similarity threshold γ: when namely using original fingerprint at every turn to slide window coupling, the similarity S of the front t frame fingerprint of first contrast _t, work as S _tduring > γ, judge that reference audio is as possible outcome, calculates the similarity S of overall fingerprint _v; γ, S _tand S _vbe the real number being greater than 0; Fixed intervals sampling matching method is adopted during calculating; T be greater than zero integer, scope 0-N _o/ Q; Wherein N _ofor the totalframes of audio fragment original fingerprint to be retrieved, N _ofor being greater than the integer of zero; Q is a constant, scope 1-N _o, carry out a similarity mode at interval of Q frame; N _o/ Q is the total degree that the original fingerprint of audio fragment to be retrieved needs to carry out similarity mode;

5. accumulation similarity threshold η: namely process 4. in, the similarity ε of n frame original fingerprint before accumulative _nif, ε _n< η, judges that reference audio is as possible outcome; η and ε _nbe the real number being greater than 0; N be greater than zero integer, scope 0-N _o/ Q; Wherein N _ofor the totalframes of audio fragment original fingerprint to be retrieved; Q is a constant, scope 1-N _o, carry out a similarity mode at interval of Q frame; N _o/ Q is the total degree that the original fingerprint of audio fragment to be retrieved needs to carry out similarity mode;

6. slip interval threshold θ: when namely similarity is lower than slip interval threshold θ when between fingerprint, increases slip number of times, then carries out similarity mode; θ be greater than 0 real number.

Described fixed intervals sampling matching method is as follows:

For the audio fragment to be retrieved that length is N frame, in reference audio, first choose the audio fragment that length is N frame.For two fragments, at interval of Q frame, get a sub-fingerprint and calculate its similarity (Q is a constant, scope 1-N).(N be greater than zero integer) if similarity reaches single frames distance threshold α, or front N _ithe mean distance of frame reaches mean distance threshold value ā, or front t frame similarity threshold γ, then slides window backward, reference audio is chosen the audio fragment that other end length is N frame, repeats above-mentioned deterministic process.Until judge do not meet threshold value and stop, or sliding window is to audio frequency ending, obtains the overall similarity of audio frequency, completes and once mate.

Fixed intervals sampling matching method above based on threshold value is applied in the filtration retrieving of middle fingerprint and original fingerprint, carries out quick similar judgement, can reach retrieval effectiveness more efficiently.

Accompanying drawing explanation

Fig. 1 is searching system logic diagram of the present invention.

Fig. 2 is the system of selection schematic diagram of fixed bit position of the present invention.

Fig. 3 is fixed intervals sampling matching method schematic diagram of the present invention.

Embodiment

First step: set up audio frequency original fingerprint storehouse; Namely before retrieval, original fingerprint (Philips fingerprint) is extracted to reference audio, set up audio frequency original fingerprint storehouse; Use Fibonacci hash algorithm, hash index table is set up to original fingerprint storehouse;

Second step: convert original fingerprint storehouse to middle fingerprint base through BOF algorithm;

The relatively little middle fingerprint base of data volume is generated by the conversion of BoF algorithm, the audio frequency sequence number one_to_one corresponding in order in two kinds of fingerprint bases by original fingerprint storehouse.

The generating algorithm program of middle fingerprint is as follows:

Input:

The Philips fingerprint of F [m] // comprise a m fingerprint

M // by the number of bits of sub-fingerprint classification institute foundation

The interval frame number of fingerprint in the middle of an X // two adjacent son

Y // foundation Y original sub-fingerprint forms fingerprint in the middle of a son

Export:

MF [(m-Y)/X+1] [2 ^m] // middle fingerprint

Start

1. i←0, j←0;

2. for i = 0 to ((m-Y)/X +1) do

3. for j=0 to 2M-1 do

4. MF [i] [i] ← 0 // middle fingerprint initialization

5. end for

6. end for

7. while i < m-Y do

8. for j = i to i+Y-1 do

9. the most M-bit position, end of g ← F [j]

10. MF[i][g]←MF[i][g] +1

11. end for

12. i←i+X

13. end while

Terminate.

First, according to M the bit at the most end in sub-fingerprint, the sub-fingerprint of original fingerprint (i.e. Philips fingerprint) is divided into 2 ^mclass.This M bit is selected from usually has more high-octane low frequency range.Because a sub-fingerprint of original fingerprint is made up of 32 bits, so the span of M is 1-32, and the least possible, just can reach the object reducing fingerprint dimension.In experimentation, inventor constantly changes the value of M from small to large, and through comparing, as M=3, experimental result is best.

Then, sub-fingerprint is divided into each frame length and comprises Y sub-fingerprint, and the overlapping frame of interval X sub-fingerprint.Wherein 0<X<Y<N., N _ofor the totalframes of audio fragment original fingerprint to be retrieved, N _ofor being greater than the integer of zero.In each frame, calculate a continuous print Y fingerprint and belong to 2 respectively ^mwhich kind of in class, add up the sub-fingerprint number comprised in each class, these are 2 years old ^mindividual statistical value is as fingerprint in the middle of the son of this frame, and in the middle of all sons, fingerprint forms the middle fingerprint of this first audio frequency.

Test in the present embodiment select M to be 3, X be 32 and Y be 480.Namely in the middle of, a frame of fingerprint comprises 480 sub-fingerprints, and two sub-fingerprints in 32, consecutive frame interval.According to 3 bits at their most ends, these 480 sub-fingerprints are divided into 8 classes.

Calculate the number of each class neutron fingerprint, form fingerprint in the middle of a son comprising 8 integers.Fig. 2 is shown in the system of selection of M fixed bit position.Each bit has 0 or 1 two kind of expression, so 3 bits can represent 2 ³, i.e. 8 kinds of different situations.According to these 8 kinds different situations, the sub-fingerprint of original fingerprint can be divided into 8 classes.Such as, in Fig. 1, first three sub-fingerprint most end 3 is all 100, can be divided into the first kind, and the 4th and the 7th sub-fingerprint are all 101, Equations of The Second Kind can be divided into, 5th and the 8th sub-fingerprint are all 100, can be divided into the 3rd class, by that analogy.

Third step:

First time screening is carried out in original fingerprint storehouse; Namely use the original fingerprint of audio fragment to be retrieved, in hash index table, carry out indexed search, filter out the sequence number collection I of possible outcome, reduce range of search;

Programmed screening is carried out in middle fingerprint base; The original fingerprint of retrieval audio frequency is converted to the middle fingerprint of audio fragment to be retrieved by Bag-of-Features (BOF), adopt the fixed intervals sampling matching method based on threshold value to carry out filtration with the doubtful middle fingerprint that may mate in fingerprint base in the middle of audio frequency in the middle fingerprint of audio fragment to be retrieved to retrieve, rapid screening goes out the sequence number collection II of possible outcome, reduces range of search further;

Can calculate in many ways with the similarity of fingerprint in the middle of the doubtful son that may mate in fingerprint base in the middle of audio frequency for fingerprint in the middle of the son of audio fragment to be retrieved.We use Euclidean distance.

Wherein, x ₁, x ₂x _nrepresent respectively the 1st of audio fragment to be retrieved, the 2nd ... fingerprint in the middle of the n-th son, y ₁, y ₂y _nrepresent respectively the 1st of fingerprint in the middle of in fingerprint base, one may mate in the middle of audio frequency doubtful son, the 2nd ... fingerprint in the middle of the n-th son.

If the similarity obtained is less than single frames distance threshold α at every turn, or front N _ithe similarity mean value of frame is less than mean distance threshold value ā, then obtain possible outcome; The distance ε of the middle fingerprint of m frame before accumulative _mif, ε _mbe less than Cumulative Distance threshold value beta or m when not reaching cumulative frequency threshold value Ω, obtain the possible outcome after reducing, put into sequence number collection II.Wherein 0<N _i<N _m/ Q, 0<m<N _m/ Q(is N wherein _mfor the totalframes of fingerprint in the middle of audio fragment to be retrieved, N _mfor being greater than the integer of zero; Q is a constant, scope 1-N _m, carry out a similarity mode at interval of Q frame; N _m/ Q is the total degree that the middle fingerprint of audio fragment to be retrieved needs to carry out similarity mode)

Third time screening is carried out in original fingerprint storehouse; Use the original fingerprint of audio fragment to be retrieved, the fixed intervals sampling matching method based on threshold value is adopted to carry out filtration retrieval in the doubtful original fingerprint that the possible outcome sequence number gone out at programmed screening is corresponding, filter out the sequence number collection III of possible outcome, again reduce range of search;

The similarity bit error rate (BER) (BER) of the sub-fingerprint of retrieval audio frequency original fingerprint and doubtful original fingerprint judges.

Wherein, a represents the number of not identical bits in matching process, and b represents the total length of original fingerprint.

If front N _obefore the bit error rate of frame, t frame bit error threshold gamma is large, then obtain possible outcome; Calculate the bit error S of overall fingerprint again _v; Fixed intervals sampling matching method is adopted during calculating; Wherein 0<t<N _o/ Q(is N wherein _ofor the totalframes of audio fragment original fingerprint to be retrieved, N _ofor being greater than the integer of zero; Q is a constant, scope 1-N _o, carry out a similarity mode at interval of Q frame; N _o/ Q is the total degree that the original fingerprint of audio fragment to be retrieved needs to carry out similarity mode)

4th step:

Original fingerprint corresponding to the result filter out third time adopts Philip algorithm to carry out exact match search with retrieval audio frequency original fingerprint, obtains final result for retrieval.

3. Cumulative Distance threshold value beta and cumulative frequency threshold value Ω: namely process 2. in, the distance ε of fingerprint in the middle of m frame before accumulative _mif, ε _mbe less than β, or when m does not reach Ω, then judge that reference audio is as possible outcome; β and Ω is the integer being greater than 0; M be greater than zero integer, scope 0-N _m/ Q; Wherein N _mfor the totalframes of fingerprint in the middle of audio fragment to be retrieved, N _mfor being greater than the integer of zero; Q is a constant, and Q is 1-N _m, carry out a similarity mode at interval of Q frame; N _m/ Q is the total degree that the middle fingerprint of audio fragment to be retrieved needs to carry out similarity mode;

Described fixed intervals sampling matching method is shown in Fig. 3.Z is fingerprint dimension.For the audio fragment to be retrieved that length is N frame, in reference audio, first choose the audio fragment that length is N frame.For two fragments, at interval of Q frame, get a sub-fingerprint and calculate its similarity (Q is a constant, scope 1-N).(N be greater than zero integer) if similarity reaches single frames distance threshold α, or front N _ithe mean distance of frame reaches mean distance threshold value ā, or front t frame similarity threshold γ, then slides window backward, reference audio is chosen the audio fragment that other end length is N frame, repeats above-mentioned deterministic process.Until judge do not meet threshold value and stop, or sliding window is to audio frequency ending, obtains the overall similarity of audio frequency, completes and once mate.

The present invention is known content at described Bag-of-Features (BoF) algorithm, delivering document is: F. Precioso, M. Cord, D Gorisse, and N. Thome, " Efficient bag-of-features kernel representation for image similarity search; " Proc of the 18th IEEE International Conference on Image Processing (ICIP2011). Brussels, pp. 109-112, September 2011.

Claims

1. a staged audio search method, comprises following content:

(1) original fingerprint extracted to reference audio and set up audio frequency original fingerprint storehouse;

(2) use Fibonacci hash algorithm, hash index table is set up to audio frequency original fingerprint storehouse;

(3) audio frequency original fingerprint storehouse is generated fingerprint base in the middle of the audio frequency corresponding with original fingerprint storehouse through the conversion of BoF algorithm;

(4) first time screening: extract original fingerprint by audio fragment to be retrieved, and carry out indexed search in hash index table, filter out the first sequence number collection possessing possible outcome, reduce range of search;

(5) programmed screening: the middle fingerprint original fingerprint of audio fragment to be retrieved being generated audio fragment to be retrieved through the conversion of BoF algorithm, use the middle fingerprint of audio fragment to be retrieved in middle fingerprint base, adopt the fixed intervals sampling matching method based on threshold value to carry out filtration retrieval, filter out the second sequence number collection possessing possible outcome, reduce range of search further;

(6) third time screening: the original fingerprint namely using audio fragment to be retrieved, in the original fingerprint that the first sequence number set pair is answered, adopt the fixed intervals sampling matching method based on threshold value to carry out filtration retrieval, filter out the 3rd sequence number collection of possible outcome, again reduce range of search;

(7), in the original fingerprint of answering at the 3rd sequence number set pair, carry out exact match search with the original fingerprint of audio fragment to be retrieved, obtain final result for retrieval.

2. a kind of staged audio search method according to claim 1, is characterized in that the described fixed intervals sampling matching method based on threshold value comprises following content:

3. a kind of staged audio search method according to claim 2, is characterized in that described fixed intervals sampling matching method comprises following content:

For the audio fragment to be retrieved that length is N frame, in reference audio, first choose the audio fragment that length is N frame;

For two fragments, at interval of Q frame, get a sub-fingerprint and calculate its similarity, Q is a constant, scope 1-N; N be greater than zero integer; If similarity reaches single frames distance threshold α, or front N _ithe mean distance of frame reaches mean distance threshold value ā, or front t frame similarity threshold γ, then slides window backward, reference audio is chosen the audio fragment that other end length is N frame, repeats above-mentioned deterministic process; Until judge do not meet threshold value and stop, or sliding window is to audio frequency ending, obtains the overall similarity of audio frequency, completes and once mate.