WO2008083689A1

WO2008083689A1 - System and method for qur'an recitation rules

Info

Publication number: WO2008083689A1
Application number: PCT/EG2007/000013
Authority: WO
Inventors: Mohsen Abdel-Razik Ali Rashwan; Salah El-Dien Hamed Moursi Metwally; Sherif Mahdi Abdo Essawy; Abdel-Rahman Samir Abdel-Rahman Mohamed; Ossama Abdel-Hamied Mohamed Abdel-Hamid; Walied Nazih Abdel-Kawy Ahmed; Moustafa Ali Abd-Alla Shahien
Original assignee: The Engineering Company For The Development Of Computer Systems ; (Rdi)
Priority date: 2007-01-14
Filing date: 2007-04-26
Publication date: 2008-07-17

Abstract

A system and a method for Qur'an recitation rules learning for all modes of recitation (rewayat, e.g. Hafs An Asem, Warsh An Nafee, etc) through assessment of user's pronunciation quality is presented. The present system analyzes the user recitation of the Holy Qur'an then detects types and positions of errors in the input utterance and prepares an appropriate corrective feedback for each detected error. A module for the automatic generation of recitation errors possibilities was invented and incorporated in the system. A phoneme duration classification algorithm was invented to detect recitation errors related to phoneme duration. A feedback layer was incorporated to generate the appropriate response and present it to the user in audio, text, and visual form. The decisions reached by the system and the method are accompanied by a confidence measure to reduce effect of misleading system feedbacks.

Description

System and method for Qur'an recitation rules

Technical field

Usage of speech recognition in teaching the correct recitation of the Holy Qur'an.

Background Art

We found two previous works close to this invention 1- "Learning the Quran Karim, GB2298514A", The proposed system displays the relevant ayats (verses) in a selected calligraphic style, with translation text, graphical wave image in selected qirat and pronounces through a speaker in an elected qirat (mode of recitation). The user then repeats the ayats. (verses) through a microphone, the system displays and rewards the response and plays back the attempt, a graphical image of the two recitations being displayed on the screen in waveform.

2-" Device for multimedia electronic book for holy quran, KR20020071056", A device for a multimedia electronic book for the holy Quran is provided to reproduce a desired part selected by the manipulation of a button through a display screen and a speaker by scanning the pages of the real holy Quran, recording the recitation, and storing the scanned data and the recorded sound data in a non- volatile memory.

Disclosure of invention

According to the invention, a system and methods are provided for Qur'an recitation rules learning through assessment of user's pronunciation quality.

1. Automatic generation of recitation error hypotheses

In the present invention, the recitation error hypotheses are represented in the form of a linear lattice that is flexible to support addition, deletion and overlapping of probable recitation errors.

A block diagram for recitation error hypotheses building used in the present system is shown in figure ()• The lattice generator is based upon the holy Qur'an transcription engine. The transcription engine is built in the form of multi-layer event driven modules. This architecture was selected to enable any higher level analysis module to use this core engine and benefit from its results.

The events engine scans the input holy Qur'an Ottoman text searching for symbols and features and at each probably pronounced character it generates its code, its pronunciation status and its acoustic characteristics (such as voicing, place of articulation, nasalization and aspiration).

The transcription engine analyzes those codes and characteristics and generates the corresponding correct phonetic transcription according to the holy Qur'an recitation rules and their exceptions. The pattern engine gathers all the information from proceeding layers and generates recitation patterns at probable pronunciation locations.

The following rules presents the way these recitation errors hypotheses are generated

These pronunciation patterns are used for matching with pronunciation variants rules at the lattice generator. The lattice generator sorts the matched rules descending with their error relevance or impact then all rules resulting in the same phoneme sequence are omitted except the first one. Finally the lattice is generated with remaining pronunciation variants in a format suitable to the speech recognizer. Also a mapping file is generated that holds the locations of the suitable feedbacks for each pronunciation variant. In the present invention, the lattice unit was selected to be similar to that is used in the traditional methods of Qur'an recitation learning as the users are not familiar with the concept of phonemes. In the present system, the text is divided to basic units Consonant + short vowel (CV), Consonant + long vowel (CW), Non-vowelled consonant (C), Repeated consonant +short vowel (CCV), Repeated consonant + long vowel (CCW), Non-vowelled repeated consonant (CC).

In the present system, a database containing 663 rules of pronunciation errors in the holy Qur'an recitation was constructed. This database is connected via error code to two other databases. The first database is the feedback database which holds the coloring codes, the readable feedbacks and the audible feedbacks. The second database also holds links between each specific recitation error and relevant holy Qur'an recitation rules. This database is used to filter the pronunciation errors to concentrate on specific recitation rules for a given lesson.

2. Verification HMM models:

A Hidden Markov Model (HMM) is a network of states connected by directed transition branches. A HMM based speech recognizer uses a HMM model to model the production of speech sounds. The HMM recognizer represents each type of phone in a language by a phone model made up of a handful of connected states.

Each state in an HMM has an associated probability distribution of the acoustic features which are produced while in the state. The output distributions may be Gaussian distributions, or weighted mixtures of Gaussian distributions, etc. Each transition branch in a HMM model has a transition probability indicating the probability of transiting from the branch's source state to its destination state. All transition probabilities out of any given state, including any self transition probabilities, sum to one. The output and transition probability distributions for all states in a HMM are established from training data using standard HMM training algorithms such as the famous forward-backward (Baum- Welch) algorithm.

All phone models plus the grammar can be considered as a vast virtual network called "the recognition HMMs". The HMM recognizer models every spoken sentence as having been produced by traversing a path through the states within the HMMs. In general, a frame of acoustic features is produced at each time-step along this path. The path identifies the sequence of states traversed. The path also identifies the duration of time spent in each state of the sequence, thereby defining the time- duration of each phone and each word of a sentence. Put in another way, the path describes an "alignment" of the sequence of frames with a corresponding sequence of states of the HMMs.

An HMM search engine within the HMM recognizer computes a maximum likelihood path. The maximum likelihood path is a path through the hidden Markov models with the maximum likelihood of generating the acoustic feature sequence extracted from the speech of the user. The maximum likelihood path includes the sequence of states traversed and the duration of time spent in each state. The maximum likelihood path defines an acoustic segmentation of the acoustic features into a sequence of phones.The acoustic segmentation is a subset of the path information, including time boundaries and the phone-type labels of the sequence of phones. The HMM search engine computes the maximum likelihood path through its HMMs according to a standard pruning HMM search algorithm that uses the well-known Viterbi search method.

In the application of pronunciation errors detection the sequence of spoken words from the speaker is known in advance by the pronunciation evaluation system. Using the known word sequence as an additional constraint can reduce recognition and segmentation errors and also reduce the amount of computation required by the HMM engine. The input speech is composed of its constituent words, which are in turn broken down into constituent phones, which are themselves broken down into constituent states.

3. User enrollment:

The target of this phase is to collect some utterances for the user and use this data to adapt the system acoustic models to match the user speech characteristics. The user enrolment process can be summarized in the following steps:

Step 1:- collect few common sentences from the user to be able to select the nearest cluster to the user's voice in the acoustic space. This nearst cluster model will be used as a reference model for that user. Step 2:- Prompt the user to utter phrases and test them with reference models generated in step 1. If the system decides that an utterance is free of pronunciation errors add it to the group that is used in speaker adaptation.

Step 3:- Continue until the amount of collected adaptation data is sufficient to produce a system with merely acceptable performance then apply MLLR incremental speaker adaptation technique to transform the reference models to the speaker's domain.

4. Phoneme duration analysis:

Many of the Holy Qw' an recitation rales are concerned with the phoneme durations. Vowel durations in the Holy Qur'an recitation are measured in units called "motion" which is defined as the time it takes to fold or unfold a hand finger. Main possible extra lengthening are 2, 4 and 6 motions.

For phonemes that have variable duration according to its location in the Holy Qur'an, this layer determines whether these phonemes have correct lengths or not. To overcome inter-speaker and intera-speaker variability in recitation speed that may mislead the phone duration classification module. An algorithm for Recitation Rate Normalization (RRN) was invented.

To create phoneme duration modeling, a database of an hour of a single speaker recitation was segmented using HMM forced alignment. A manual revision was done to verify this automatic alignment. The selected speaker was chocen carefully such that his recitation is of constatnt rate.

For each phoneme having N occurrences in the data base, duration was assumed to follow a Gaussian normal distribution that has mean and variance equal to sample mean and sample variance.

After segmenting an input utterance using HMM, each decoded phoneme has a duration error percentage (DEi) calculated such that: DE₁ = Λ -*

Where, T is a variable that represents the syetm tolerance. It is used to tune the accepted duration range according to the user preferences. This value is determined by

0 For incorrect duration

T I ¹HLSX,, For correct duration

Where, Tmin correspond to the most strict system behavior, and Tmax correspond to the most tolerant system behavior.

In the present invention recitation speed is normalized due to the fact that recitation rate has wide variance both inter-speaker and for the same speaker in different sessions. After observing the characteristics of various phonemes, it was noticed that consonants usually have shorter durations and high duration variances. Nevertheless, they are highly subject to segmentation inaccuracy. This means that their durations are not a good estimate for recitation rate so in this invention vowels' durations only was used to calculate recitation rate (RR); thus for an utterance having N vowels, RR is calculated by the formula:

∑ςTend^ -Tstart_nJ jy? =J≡- J

Then to normalize phoneme duration we use

_ (Tend _Αt -Tstartj^) _ d_m

So the duration error percentage (DEi) will be

2_S 4 5. Confidence scoring:

In this invention a confidence scoring algorithm is implemented. This algorithm receives the n-best decoded word sequence from the decoder, then analyzes their scores. The first alternative path model is used as the competing decode model to calculate the confidence score based on likelihood ratio as shown in the following equation.

Where, N is the number of frames of a hypothesized phone, S is the start frame, E is the end frame, ^(χ> ^/M^<⁾ {_s the hypothesized path score, ^' ^ls( -^alt ' is the first alternative competing path score.

Due to the fact that the difference between these two competing paths may be significant only in small portion of the path, these small portions should have the most significant effect on the computed confidence score. Therefore, in the present system the confidence score of each path is weighted by the distance between the two competing models.

6. Feedback Generator:

This layer Analyze results from the speech recognizer and user selectable options to produce useful feedback messages to the user. In the present system, the feedback response is designed based on the confidence score that was calculated by the speech recognizer. When the system suspects the presence of a pronunciation error with low confidence score the present system handle this issue with one of the following scenarios based on the confidence score value:

- Omit the reporting of the error at all (which is good for novice Qur'an users because reporting false alarms discourages them to continue learning correct recitation).

- Ask the user to repeat the utterance because it was not pronounced clearly. - Report the existence of an unidentified error and ask the user to repeat the utterance (which is better for more advanced users than ignoring an existent error or reporting wrong type of recitation error).

- Report most probable recitation error (which if wrong- can be very annoying to many users).

Brief description of the drawings

FIG.l shows the outer layout of the device. It consists of 101 speakers to hear the reference reading and the corrective audio feedback, 102 display, 103 microphone for the user to enter his response, 104 navigation buttons, 105 rewayat selection button to select the need rewayat to be learnt, 106 verse selection button to select the needed verse to be practiced, and 107 an enter button to enter selection.

FIG. 2 shows the internal layout of the device. It consists of 201 the speech output controller, 202 the display controller, 203 Random Access Memory (RAM) to be store temporarily results, 204 Read Only Memory (ROM) to store the system permanent settings, 205 Digital Signal Processor (DSP) to perform the need system operations, 206 speech input controller, 207 main controller to control all system operations, and 208 Electrically Erasable Programmable Random Access Memory (EEPROM) to store user settings and profile.

FIG.3 shows the block diagram of the user enrolment phase. First 305 cluster transform is selected out of 304 user clusters. Then 308 user input is applied to 307 verification HMM with 306 enrollment phase lattices to generate 309 phonetic recognition and segmentation of 308 user input. 310 phonetic error detector decides if the 308 user input is correct or not. Correct utterances are store in 315 correct files database to be entered to 314 user adaptation to create 313 user transform.

FIG. 4 shows the block diagram of the recitation variants generator block. Starting with 401 Qur'an text and 402 Qur'an symbols entered to the 403 events engine producing 404 features on each characters. Then the 405 transcription engine outputs the 406 phonetic transcription of the input 401 Qur'an text. 410 lattice generator takes 409 lattice generation rules and the 408 recitation patterns resulted from the 407 pattern engine to produce a 411 searchable lattice.

FIG. 5 shows the block diagram of the whole system to judge certain user selected verse in a certain user selected mode of recitation (rewayat). First the user selects 508 the needed rewayat and verse, then the 411 searchable lattice is generated using the 504 recitation variants generator (which was described in figure 4). The 307 verification HMM takes the 313 user transform (generated in the user enrollment phase described in figure 3) and the 505 acoustic features ,that was generated from the 504 feature extraction block, to produce 507 phonetic recognition and segmentation of the 508 user selected verse which is passed to the 509 confidence layer. The 511 phoneme duration analysis layer gets decisions on recitation rules related to phoneme duration (e.g. Ghonna, Madd, etc), then the 512 user feedback generator produces the 513 corrective feedback to the user depending on his 510 configuration and preferences.

Claims

1. A system for Qur'an recitation rules learning through assessment of user's pronunciation quality comprises:

- Mean for Automatic generation of Holy Qur'an verses recitation variants:

It analyzes current prompt and generates all possible pronunciation variants that are fed to the speech recognizer in order to test them against the spoken utterance.

- Mean for Verification HMM models: Is the acoustic HMM models used in the system.

- Mean for User enrolment:

In this phase, the reference acoustic models are adapted to suit the user acoustic characteristics. It uses speaker classification, Maximum Likelihood Linear Regression (MLLR) speaker adaptation algorithms and supervised incremental technique.

- Mean for Phoneme duration analysis:

- Mean for Confidence Score Analysis:

It receives n-best decoded word sequence from the decoder, then analyzes their scores to determine whether to report that result or not.

- Mean for Feedback Generation:

This layer Analyze results from the speech recognizer and user selectable options to produce useful feedback messages to the user.

2. According to claim 1, the process of Automatic generation of Holy Qur'an verses recitation variants comprises:

- The events engine scans the input holy Qur'an Ottoman text searching for symbols and features and at each probably pronounced character it generates its code, its pronunciation status and its acoustic characteristics. The transcription engine analyzes those codes and characteristics and generates the corresponding correct phonetic transcription according to the holy Quf an recitation rules and their exceptions.

- The pattern engine gathers all the information from proceeding layers and generates recitation patterns at probable pronunciation locations .

- An algorithm for generating recitation patterns. The pronunciation patterns are used for matching with pronunciation variants rules at the lattice generator. pronunciation patterns are used for matching with pronunciation variants rules at the lattice generator. the lattice is generated with a format suitable to the speech recognizer.

3. According to claim 2, an algorithm for generating recitation errors' patterns utilizing the current unit status, previous unit status, and next unit status. Then it adds a record for an expected recitation error based on a previously built general recitation errors database.

4. According to claim 1, the user enrolment phase comprises:

Selection of the nearest cluster to the user's voice in the acoustic space. Collection of adaptation utterances from the user and test them with models adapted to the selected user cluster.

Apply MLLR incremental speaker adaptation technique to transform reference models to the speaker's domain.

5. According to claim 1, the phoneme duration analysis layer comprises:

Phonemes' duration models were built based on a database of an hour to a single speaker recitation in constant rate.

Each phone duration model is assumed to follow a Gaussian distribution with a mean and a variance equal to the sample mean and sample variance. Based on the utterance segmentation using the verification HMM models, a duration error percentage is calculated for each phoneme based on the normalized phoneme duration.

6. According to claim 1, A confidence score analysis layer ensures the segmentation provided from the verification HMM models utilizing a likelihood ratio between the best and its first alternative weighted by their inter-distance.

7. According to claim 1, a feedback generation layer comprises:

A corrective feedback is provided to the user based on the associated confidence score and based on the user preferences. - A description for each error is provided to the user.

In case of low confidence score, the system gives a general corrective feedback.

The provided feedback could be in an audio, text, 2d graphics or 3d graphics.

8. A Method for Qur'an recitation rules learning through assessment of user's pronunciation quality comprises:

- Automatic generation of Holy Qur'an verses recitation variants:

- Verification HMM models: Is the acoustic HMM models used in the system.

- User enrolment: In this phase, the reference acoustic models are adapted to suit the user acoustic characteristics. It uses speaker classification, Maximum Likelihood Linear Regression (MLLR) speaker adaptation algorithms and supervised incremental technique.

- Phoneme duration analysis:

- Confidence Score Analysis:

- Feedback Generation:

9. According to claim 8, the process of Automatic generation of Holy Qur'an verses recitation variants comprises:

- The events engine scans the input holy Qur'an Ottoman text searching for symbols and features and at each probably pronounced character it generates its code, its pronunciation status and its acoustic characteristics.

- The transcription engine analyzes those codes and characteristics and generates the corresponding correct phonetic transcription according to the holy Qur'an recitation rules and their exceptions.

- An algorithm for generating recitation patterns.

- The pronunciation patterns are used for matching with pronunciation variants rules at the lattice generator.

- pronunciation patterns are used for matching with pronunciation variants rules at the lattice generator.

- the lattice is generated with a format suitable to the speech recognizer.

10. According to claim 9, an algorithm for generating recitation errors' patterns utilizing the current unit status, previous unit status, and next unit status. Then it adds a record for an expected recitation error based on a previously built general recitation errors database.

11. According to claim 8, the user enrolment phase comprises:

Selection of the nearest cluster to the user's voice in the acoustic space. Collection of adaptation utterances from the user and test them with models adapted to the selected user cluster. - Apply MLLR incremental speaker adaptation technique to transform reference models to the speaker's domain.

12. According to claim 8, the phoneme duration analysis layer comprises:

- Phonemes' duration models were built based on a database of an hour to a single speaker recitation in constant rate.

Each phone duration model is assumed to follow a Gaussian distribution with a mean and a variance equal to the sample mean and sample variance.

- Based on the utterance segmentation using the verification HMM models, a duration error percentage is calculated for each phoneme based on the normalized phoneme duration.

13. According to claim 8, A confidence score analysis layer ensures the segmentation provided from the verification HMM models utilizing a likelihood ratio between the best and its first alternative weighted by their inter-distance.

14. According to claim 8, a feedback generation layer comprises:

A corrective feedback for each detected error is provided to the user based on the associated confidence score and based on the user preferences.

- A description for each error is provided to the user.

- The provided feedback could be in an audio , text, 2d graphics or 3d graphics.

15. According to all previous claims, this system and methods can be applied for all Qera't of the Holy Qura'n.