WO2010134885A1 - Predicting the correctness of eyewitness' statements with semantic evaluation method (sem) - Google Patents

Predicting the correctness of eyewitness' statements with semantic evaluation method (sem) Download PDF

Info

Publication number
WO2010134885A1
WO2010134885A1 PCT/SE2010/050548 SE2010050548W WO2010134885A1 WO 2010134885 A1 WO2010134885 A1 WO 2010134885A1 SE 2010050548 W SE2010050548 W SE 2010050548W WO 2010134885 A1 WO2010134885 A1 WO 2010134885A1
Authority
WO
WIPO (PCT)
Prior art keywords
correctness
statements
words
text corpus
semantic space
Prior art date
Application number
PCT/SE2010/050548
Other languages
French (fr)
Inventor
Farhan Sarwar
Sverker SIKSTRÖM
Original Assignee
Farhan Sarwar
Sikstroem Sverker
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Farhan Sarwar, Sikstroem Sverker filed Critical Farhan Sarwar
Publication of WO2010134885A1 publication Critical patent/WO2010134885A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • Eyewitnesses are the key actors in crime situations and frequently the only source of information for the investigators, lawyers and courts. Although blood, DNA, and other analysis do provide valuable information about a crime, eyewitnesses' testimonies still have a significant role in determining the nature of crime and fixing somebody with responsibility. Accordingly eyewitnesses are expected to provide credible information about a crime. However, eyewitnesses often fail to provide the required information. For example, relying on DNA evidence the well known "Innocent project" has exonerated 252 people who were on death row.
  • eyewitness has direct experience of the event occurring at the crime scene and no other criterion is available for evaluating the eyewitness testimony .
  • cognitive interview CI, Allwood, Ask, & Granhag, 2005; Fisher & Schreiber, 2007
  • criteria based content analysis CBCA, Kulkofsky, 2008; vrij, 2003
  • reality monitoring RM, Johnson & Raye 1981
  • SEM is inspired by the theory of latent semantic analysis (LSA) (Landauer & Dumais, 1997; Landauer, McNamara, Dennis, & Kintsch, 2007). According to this theory humans posses knowledge and they express their knowledge through language (or words). How the words occur in text with relation to each other basically determine the meaning and communicate the knowledge. Based on experimental research we have found support that SEM significantly distinguishes between correct and incorrect statements.
  • LSA latent semantic analysis
  • the object of SEM is to provide a computer implemented method and a system that allows for prediction of the correctness of a statement.
  • the method is performed on at least one computer and comprising the steps of: collecting a text corpus comprising a set of words; generating a representation of the text corpus; creating a semantic space for the set of words, summarizing statements in the semantic space, training on a set of training statements where the correctness is known to identify a prediction model, applying the model to statements for predicting correctness.
  • a "text corpus” is a large and structured set of texts which is typically electronically stored and which may be electronically processed.
  • the text corpus may contain texts in a single language or text data in multiple languages, and is collected by using conventional, known methods and systems.
  • a "semantic space” is the result of a mathematical algorithm that takes a text corpus as an input and creates a high dimensional space, where the dimensions in the space corresponds to semantic qualities, or features in of the words in the corpus. For example, one dimension may represent whether the words relate to something that is alive, whereas another dimension may represent to what extent the word relates to an emotion. Synonyms are located nearby each other in the space, and the distance between words is a measure of how semantically close the words are. The distance between two words is typically measured by the cosines of the angel between vectors representing the words, although other distant measures may also be used.
  • Semantic spaces are created by using information of co-occurrence, and examples of algorithms for creating semantic spaces include the known Latent Semantic Analysis (LSA) (Landauer & Dumais, 1997; Landauer, et al, 2007), Independent Component Analysis and the random indexing (RI) method (Sahlgren, 2007).
  • LSA Latent Semantic Analysis
  • RI random indexing
  • a location in the semantic space is a point in the semantic space, which represents e.g. a word, but may also represent several words or even set(s) of keywords.
  • a "semantic dimension” is any judgment relating to the meaning (semantic) of a word (concept), such as positive or negative evaluations, trustworthiness, innovations, intelligence, etc.
  • Statements can be generated in writing or orally. Statements can be correct, incorrect, or partly correct. Comparing the statement with an external criterion may validate the correctness of the statement.
  • said text corpus is an electronically structured and processed set of texts, and said text corpus contains texts in a single language which is the same language as is used for said new statement whose correctness is to be predicted.
  • said semantic space is calculated using an algorithm selected from the group of LSA or RI.
  • the semantic space is then compressed using the algorithm SVD.
  • the frequency is normalized by taking a logarithm of the frequency when creating a semantic space using the LSA algorithm.
  • said text corpus comprise more than ten times, preferable more than fifty times as many words as said statement whose correctness is to be predicted.
  • a system for predicting a value of a variable associated with a target word comprises at least one computer and is configured to: collect a text corpus comprising a set of words; generate a representation of the text corpus; create a semantic space for the set of words, based on the representation of the text corpus; define, for a location in the semantic space; based on the semantic space and the defined variable value of the location in the semantic space; and calculate a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word.
  • a computer readable medium having stored thereon a computer program having software instructions which when run on a computer cause the computer to perform the steps of: collecting a text corpus comprising a set of words that include the target word; generating a representation of the text corpus; creating a semantic space for the set of words, based on the representation of the text corpus; defining, a location in the semantic space, a value of the variable; estimating, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space; and calculating a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word.
  • the inventive system and computer readable medium may, as described, comprise, be configured to execute and/or having stored software instructions for performing any of the features described above in association with the inventive method, and has the corresponding advantages.
  • This invention predicts the degree of correctness of statements.
  • the predictions are made by first converting the words in the statement to a representation in the semantic space.
  • the relation between how correct a statement is and the semantic representation is identified by studying known examples. This relation can then be used to predict correctness of new statements.
  • the invention has been validated in experimental studies. The invention can be divided into the following steps:
  • the creation of the semantic space requires a huge collection of text called corpus.
  • This corpus needs to be in the same language as the eyewitness data that is going to be analyzed, and it has to be large.
  • the general semantic topic of the corpus vaguely relates to the eyewitness statements.
  • the text corpus can, for example, be collected by conventional, automatic search robots that scan the internet, text or news databases, databases of spoken language, electronic sources or other collections of text.
  • LSA Latent Semantic Analysis
  • ICA Independent Component Analysis
  • RI random indexing
  • Other equivalent algorithms that may transform words to distributed semantic representations may also be used.
  • LSA first creates a table including words (rows) and local context (columns), where each table entry counts the frequency of the words in the local text context.
  • Semantic spaces are created by the known data compression algorithm called singular value decomposition (SVD) (Golub & Kahan, 1965) that reduces the large number of contexts to a moderate number of semantic dimensions.
  • SVD singular value decomposition
  • the quality of the semantic space can be measured by testing the semantic space on synonyms tests.
  • the algorithm, the parameters settings and the distance measure that yields the best performance on such test are preferred.
  • the result of such an analysis e.g. the parameter for the number of dimension used, etc.
  • the skilled person knows how to select appropriate algorithms and parameter settings. He may, for instance, use the information in the references cited in this application.
  • First training statements are collected. These statements have a known truth value i.e. are either correct or incorrect and are representative of the statements that it is desirable to predict correctness of. For example, if the purpose is to predict statements made from eyewitness testimony, then a number of eyewitness training statements are collected where it is known whether they are correct or false. Secondly, statements that it indeed is desirable to predict the correctness of are also collected.
  • a model for the relation between the training statements and the correctness of the word is built. This is conducted by known, suitable mathematical multidimensional optimization techniques, for example by using multiple linear regression where the dimensions in the semantic space is used as repressor for the correctness of the statement (Cohen, Cohen, West, & Aiken, 2003). However, other techniques for predicting the relation between the semantic space and an external variable may also be used; for example classifier such as support vector machine, etc (Meyer, Leisch, & Hornik, 2003). The predictor that produces the highest logistic correlation between predicted accuracy and true accuracy of the statements is selected.
  • Multiple linear regression is a known form of regression analysis in which the relationship between one or more independent variables and another variable, called dependent variable, is modeled by a least squares function, called linear regression equation.
  • This function is a linear combination of one or more model parameters, called regression coefficients.
  • a linear regression equation with one independent variable represents a straight line, and the results are subject to statistical analysis. In this context, conventional multiple linear regression is used.
  • the model can be applied to new data. This is done by summarizing the to- be predicted statements in the semantic space and conducting the multiple linear regression. The result is a predicted probability of correctness of the statement.
  • a simple example of summarizing such statements and conducting multiple linear regression can be found in the following example.
  • document 1 Lars is stealing
  • document 2 Lars is a judge
  • document 3 Rolf is stealing.
  • the first step is to create a semantic space.
  • semantic spaces can also be created using several other methods, such as probabilistic latent semantic analysis, random indexing or ICA.
  • ICA probabilistic latent semantic analysis
  • First a context by word frequency table of the words included in our corpora is made, where the words are represented in the rows and the contexts in the columns, as indicated in table 1 below.
  • Table 1 Word frequency table (matrix)
  • a word frequency table high frequency words not containing any semantic information (e.g., "a” and “the”) are not present.
  • the frequency table may be normalized by taking the logarithm of the frequency, but this step is here omitted for simplicity.
  • Each cell represents the number of occurrence of a word in the context. By context is meant either a document or subset of a document.
  • singular value decomposition (SVD) (Golub & Kahan, 1965) is conducted.
  • the method of performing singular value decomposition is known within the field of linear algebra and is a standard package in e.g the commercially available linear algebra package LAPACK or in the GNU Scientific Library (Anderson, et al, 1999).
  • the columns of u represent the dimensions in the space and the rows represent the words.
  • Each word is normalized to a length of 1. This is done by calculating the length of the vector representing each word and dividing the dimension values of the word with this length:
  • U 1 represents the semantic representation of word i and
  • is the length of vector U 1 .
  • a feature of the SVD algorithm is that it orders the dimensions in u by how important they are in predicting x', so that the explained variance of the first dimensions are larger than the later dimensions.
  • the dimensions represent features in the semantic space. To understand what features that are represented, it is necessary to make an interpretation of the dimensions. For example, Table 2 shows the first two dimensions of u' following normalization:
  • the statements in the semantic space are then summarized. This summary is made by averaging the corresponding vectors in the semantic space, and then normalizing the results so that the length of the resulting vector is one. For example, the semantic representation of 'Lars' [-0.68 0.73] and 'is' [-1.00 0.00] can be averaged, and then normalized to a length of 1, and the results is [-0.92, .40].
  • the semantic space can now be used to make a prediction (P) of the correctness of a statement (V)
  • a prediction of correctness (P) can be made by using multiple linear regressions, where we find the coefficients (R) that best describes the linear relation between the semantic space (u') and the known value of correctness of the statements (V):
  • R can be calculated by:
  • This formula can now be used to predict the last statements with an unknown correctness (and which has not been used during training).
  • the correlation between predicted variable and external variable is 0.87.
  • we would conclude that the statement "Rolf is stealing” is more likely to be correct than false because P is closer to 1 than to 0 (i.e. P 2).
  • %Word by context representation (x) of the words: %'Lars' (1), 'is' (2), 'stealing' (3), 'judge'(4), and 'rolf (5) x [l 1 0;l 1 l;1 0 l;0 1 0;0 0 1]

Abstract

A method to predict the correctness of eyewitness' statements. The method comprises: collecting a text corpus comprising a set of words; generating a representation of the text corpus; creating a semantic space for the set of words, summarizing statements in the semantic space; training on a set of set statements where the correctness is known to identify a prediction model, and applying the model to new statements for predicting correctness. A computer program and a computer-readable means containing said computer program are also included.

Description

Predicting the Correctness of Eyewitness' Statements with Semantic Evaluation Method (SEM)
Introduction
Eyewitnesses are the key actors in crime situations and frequently the only source of information for the investigators, lawyers and courts. Although blood, DNA, and other analysis do provide valuable information about a crime, eyewitnesses' testimonies still have a significant role in determining the nature of crime and fixing somebody with responsibility. Accordingly eyewitnesses are expected to provide credible information about a crime. However, eyewitnesses often fail to provide the required information. For example, relying on DNA evidence the well known "Innocent project" has exonerated 252 people who were on death row. In 74% of these cases conviction was based on eyewitness misidentification and in 16% of these cases an informant testified against the defendant, leading to the convictions on the basis of incorrect eyewitness testimony (Innocent project, n.d.). Research evidence shows that eyewitnesses' memories are generally very fragile and there are number of distortion factors. Such factors are, for example, simple forgetting, discussions among co-witnesses (Shaw-III, Garven, & Wood, 1997), eyewitness's exposure to the media coverage of the witnessed event (Loftus & Hoffman, 1989), questions asked by investigators, lawyers, and healthcare personnel and source attribution errors of the eyewitnesses. How to evaluate the eyewitness memory is another big issue. No other person but the eyewitness has direct experience of the event occurring at the crime scene and no other criterion is available for evaluating the eyewitness testimony . There are some indirect methods to test the validity of eyewitness statements. For example, cognitive interview (CI, Allwood, Ask, & Granhag, 2005; Fisher & Schreiber, 2007), criteria based content analysis (CBCA, Kulkofsky, 2008; vrij, 2003), reality monitoring (RM, Johnson & Raye 1981) technique, etc. In spite of these methods the people in criminal justice system mostly use eyewitness' confidence as a yardstick to evaluate the credibility of eyewitness statements (Brewer & Burke, 2002; Brewer, Potter, Fisher, Bond, & Luszcz, 1999; Juslin, Olsson, & Winman, 1996) Confident eyewitnesses are considered more credible then those who are not that confidence and vice versa. However, the relationship between confidence and accuracy is not constant across situations. An eyewitness' confidence can be influenced by social factors that are independent from perceptual and memorial processes (Luus & Wells, 1994). For example, multiple retellings of an event can simply increase the confidence while memory accuracy does not increase simply because of reiteration effect (Hertwig, Gigerenzer, & Hoffrage, 1997). Consequently, researchers have their reservations in using confidence as a barometer to measure accuracy in eyewitness statements (e.g. Brewer & Burke, 2002). The methods discussed above to evaluate the accuracy of eyewitness testimony are subjective, and have limitations in applied context. That's why the use of these methods to evaluate the credibility of eyewitness testimony is criticized in the research literature. Here we suggest a new statistical method, Semantic Evaluation Method (SEM), to evaluate accuracy of eyewitness statements. This method is objective, reproducible, and ecologically valid as compare to the existing methods.
Theory behind the method
SEM is inspired by the theory of latent semantic analysis (LSA) (Landauer & Dumais, 1997; Landauer, McNamara, Dennis, & Kintsch, 2007). According to this theory humans posses knowledge and they express their knowledge through language (or words). How the words occur in text with relation to each other basically determine the meaning and communicate the knowledge. Based on experimental research we have found support that SEM significantly distinguishes between correct and incorrect statements.
Summary of the Invention
More particularly, the object of SEM is to provide a computer implemented method and a system that allows for prediction of the correctness of a statement. The method is performed on at least one computer and comprising the steps of: collecting a text corpus comprising a set of words; generating a representation of the text corpus; creating a semantic space for the set of words, summarizing statements in the semantic space, training on a set of training statements where the correctness is known to identify a prediction model, applying the model to statements for predicting correctness.
Here, a "text corpus" is a large and structured set of texts which is typically electronically stored and which may be electronically processed. The text corpus may contain texts in a single language or text data in multiple languages, and is collected by using conventional, known methods and systems.
A "semantic space" is the result of a mathematical algorithm that takes a text corpus as an input and creates a high dimensional space, where the dimensions in the space corresponds to semantic qualities, or features in of the words in the corpus. For example, one dimension may represent whether the words relate to something that is alive, whereas another dimension may represent to what extent the word relates to an emotion. Synonyms are located nearby each other in the space, and the distance between words is a measure of how semantically close the words are. The distance between two words is typically measured by the cosines of the angel between vectors representing the words, although other distant measures may also be used. Semantic spaces are created by using information of co-occurrence, and examples of algorithms for creating semantic spaces include the known Latent Semantic Analysis (LSA) (Landauer & Dumais, 1997; Landauer, et al, 2007), Independent Component Analysis and the random indexing (RI) method (Sahlgren, 2007).
A location in the semantic space is a point in the semantic space, which represents e.g. a word, but may also represent several words or even set(s) of keywords. A "semantic dimension" is any judgment relating to the meaning (semantic) of a word (concept), such as positive or negative evaluations, trustworthiness, innovations, intelligence, etc.
By statements we mean any form of set of words generated by humans. Statements can be generated in writing or orally. Statements can be correct, incorrect, or partly correct. Comparing the statement with an external criterion may validate the correctness of the statement.
In a preferred embodiment of the computer-implemented method said text corpus is an electronically structured and processed set of texts, and said text corpus contains texts in a single language which is the same language as is used for said new statement whose correctness is to be predicted.
In a preferred embodiment of said method method, said semantic space is calculated using an algorithm selected from the group of LSA or RI. Preferably, the semantic space is then compressed using the algorithm SVD.
In a preferred embodiment of said method, the frequency is normalized by taking a logarithm of the frequency when creating a semantic space using the LSA algorithm. In a preferred embodiment of said method, said text corpus comprise more than ten times, preferable more than fifty times as many words as said statement whose correctness is to be predicted.
According to another aspect of the invention, a system for predicting a value of a variable associated with a target word is described. The system comprises at least one computer and is configured to: collect a text corpus comprising a set of words; generate a representation of the text corpus; create a semantic space for the set of words, based on the representation of the text corpus; define, for a location in the semantic space; based on the semantic space and the defined variable value of the location in the semantic space; and calculate a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word.
According to yet another aspect of the invention a computer readable medium is provided, having stored thereon a computer program having software instructions which when run on a computer cause the computer to perform the steps of: collecting a text corpus comprising a set of words that include the target word; generating a representation of the text corpus; creating a semantic space for the set of words, based on the representation of the text corpus; defining, a location in the semantic space, a value of the variable; estimating, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space; and calculating a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word. The inventive system and computer readable medium may, as described, comprise, be configured to execute and/or having stored software instructions for performing any of the features described above in association with the inventive method, and has the corresponding advantages.
This invention predicts the degree of correctness of statements. The predictions are made by first converting the words in the statement to a representation in the semantic space. The relation between how correct a statement is and the semantic representation is identified by studying known examples. This relation can then be used to predict correctness of new statements. The invention has been validated in experimental studies. The invention can be divided into the following steps:
(1) Creating a semantic space
The creation of the semantic space requires a huge collection of text called corpus. This corpus needs to be in the same language as the eyewitness data that is going to be analyzed, and it has to be large. In addition it is preferred, but not necessary, that the general semantic topic of the corpus vaguely relates to the eyewitness statements. However, it is more important that the corpus is large, than that there is a resemblance to the eyewitness text. The text corpus can, for example, be collected by conventional, automatic search robots that scan the internet, text or news databases, databases of spoken language, electronic sources or other collections of text.
Next a semantic space is created from the text corpus, for example by using Latent Semantic Analysis (LSA), Independent Component Analysis (ICA) or random indexing (RI). Other equivalent algorithms that may transform words to distributed semantic representations may also be used. In brief, LSA first creates a table including words (rows) and local context (columns), where each table entry counts the frequency of the words in the local text context. Semantic spaces are created by the known data compression algorithm called singular value decomposition (SVD) (Golub & Kahan, 1965) that reduces the large number of contexts to a moderate number of semantic dimensions. The quality of the semantic space can be measured by testing the semantic space on synonyms tests. In this invention the algorithm, the parameters settings and the distance measure that yields the best performance on such test are preferred. The result of such an analysis (e.g. the parameter for the number of dimension used, etc.) depends on the data corpus that is used, and may therefore vary for different applications. The skilled person knows how to select appropriate algorithms and parameter settings. He may, for instance, use the information in the references cited in this application.
(2) Representing a statement as a location in the semantic space
First training statements are collected. These statements have a known truth value i.e. are either correct or incorrect and are representative of the statements that it is desirable to predict correctness of. For example, if the purpose is to predict statements made from eyewitness testimony, then a number of eyewitness training statements are collected where it is known whether they are correct or false. Secondly, statements that it indeed is desirable to predict the correctness of are also collected.
These statements are summarized in the semantic space. This is done by identifying the location in the semantic space associated with each word in the sentence. The statement is summarized as the mean location of the words in the sentence.
(3) Training the relation between the semantic space and correctness of statements
A model for the relation between the training statements and the correctness of the word is built. This is conducted by known, suitable mathematical multidimensional optimization techniques, for example by using multiple linear regression where the dimensions in the semantic space is used as repressor for the correctness of the statement (Cohen, Cohen, West, & Aiken, 2003). However, other techniques for predicting the relation between the semantic space and an external variable may also be used; for example classifier such as support vector machine, etc (Meyer, Leisch, & Hornik, 2003). The predictor that produces the highest logistic correlation between predicted accuracy and true accuracy of the statements is selected.
Multiple linear regression is a known form of regression analysis in which the relationship between one or more independent variables and another variable, called dependent variable, is modeled by a least squares function, called linear regression equation. This function is a linear combination of one or more model parameters, called regression coefficients. A linear regression equation with one independent variable represents a straight line, and the results are subject to statistical analysis. In this context, conventional multiple linear regression is used.
(4) Predicting correctness of statements
Following training, the model can be applied to new data. This is done by summarizing the to- be predicted statements in the semantic space and conducting the multiple linear regression. The result is a predicted probability of correctness of the statement. A simple example of summarizing such statements and conducting multiple linear regression can be found in the following example. Example:
For providing an example with numerical values, the following corpus is considered as the text that the semantic space is created on. These corpus needs to be huge (megabytes or larger), however, for practical purpose here we consider a small toy example:
document 1 : Lars is stealing, document 2: Lars is a judge, document 3: Rolf is stealing.
The first step is to create a semantic space. In this example LSA is used, but semantic spaces can also be created using several other methods, such as probabilistic latent semantic analysis, random indexing or ICA. First a context by word frequency table of the words included in our corpora is made, where the words are represented in the rows and the contexts in the columns, as indicated in table 1 below.
Table 1 : Word frequency table (matrix)
Word/Contexts document 1 document 2 document 3
Lars 1 1 0
Is 1 1 1
Stealing 1 0 1
Judge 0 1 0
Rolf 0 0 1
In a word frequency table, high frequency words not containing any semantic information (e.g., "a" and "the") are not present. To improve performance, the frequency table may be normalized by taking the logarithm of the frequency, but this step is here omitted for simplicity. Each cell represents the number of occurrence of a word in the context. By context is meant either a document or subset of a document.
To create a semantic space, singular value decomposition (SVD) (Golub & Kahan, 1965) is conducted. The method of performing singular value decomposition is known within the field of linear algebra and is a standard package in e.g the commercially available linear algebra package LAPACK or in the GNU Scientific Library (Anderson, et al, 1999).
The following variables are written in matrix notation, where x is the context by word frequency table (the frequency matrix of Table 1), u is the semantic space, and s is the singular values. The SVD computes an approximation of x, labeled x': x' = u * S * V
where u, s and v can be calculated from x by applying the known algorithm of SVD:
[u s v] = SVD(x)
For details of how to calculate SVD see (Golub & Kahan, 1965).
The columns of u represent the dimensions in the space and the rows represent the words. Each word is normalized to a length of 1. This is done by calculating the length of the vector representing each word and dividing the dimension values of the word with this length:
U1' = U1 / Il U1 Il
where U1 represents the semantic representation of word i and || U1 1| is the length of vector U1. Hence, u' contains the normalized values of u. For example, if ui =[1 2], then the normalized vector with a length of one is m' = [1 2]/(l2 + 22)1/2=[5"1/2 2*5~1/2].
A feature of the SVD algorithm is that it orders the dimensions in u by how important they are in predicting x', so that the explained variance of the first dimensions are larger than the later dimensions. The dimensions represent features in the semantic space. To understand what features that are represented, it is necessary to make an interpretation of the dimensions. For example, Table 2 shows the first two dimensions of u' following normalization:
Table 2: the normalized semantic space (u')
Word/Dimensions 1 2
Lars -.68 0.73
Is -1.00 0.00
Stealing -.68 -.73
Judge -.39 0.92
Rolf -.39 -0.92
The statements in the semantic space are then summarized. This summary is made by averaging the corresponding vectors in the semantic space, and then normalizing the results so that the length of the resulting vector is one. For example, the semantic representation of 'Lars' [-0.68 0.73] and 'is' [-1.00 0.00] can be averaged, and then normalized to a length of 1, and the results is [-0.92, .40]. The semantic space can now be used to make a prediction (P) of the correctness of a statement (V)
A prediction of correctness (P) can be made by using multiple linear regressions, where we find the coefficients (R) that best describes the linear relation between the semantic space (u') and the known value of correctness of the statements (V):
V ~ R * u'
Following the well-known algorithm for solving multiple linear regression, R can be calculated by:
Figure imgf000010_0001
For example, assume that we have access to an eyewitness corpus (see first column of Table 3) and that the correctness (V) of the first two statements, but correctness of statement three is unknown (see second column in Table 3). Notice, that this eyewitness corpus does not need to be as large as the corpus that the semantic space is created from. However, to avoid overfitting the number of statements should be at least three times more than the number of dimensions used in the space.
The statements are first summarized by averaging the semantic representation of the words in the statements, and then normalizing, as described above. By applying the formula above on the first two statements, the following coefficients are obtained R = [1.0 0.0 -1.61] (where the first number represents a constant that is added to the prediction and the following numbers correspond to coefficients for dimension 1 and 2 respectively). The predicted correctness of all statements (P) can then be calculated by the following formula:
P = u9 * R
This formula can now be used to predict the last statements with an unknown correctness (and which has not been used during training).
Table 3
Figure imgf000010_0002
Figure imgf000011_0001
Table 3 shows the words in the corpora, the known correctness of the statements V (O=incorrect, l=correct) and the predicted correctness (P). The correlation between predicted variable and external variable is 0.87. In this example, we would conclude that the statement "Rolf is stealing" is more likely to be correct than false because P is closer to 1 than to 0 (i.e. P = 2).
The calculations underlying this example are disclosed in more detail in the following appendix:
Appendix
This Appendix implements the invention, using the examples and numbers described above. The implementation is made in standard Matlab code, and is commented. Commenting lines commence with the sign "%". The produced output follows the code.
function patentExample
%Word by context representation (x) of the words: %'Lars' (1), 'is' (2), 'stealing' (3), 'judge'(4), and 'rolf (5) x=[l 1 0;l 1 l;1 0 l;0 1 0;0 0 1]
"/{.Calculations of the SVD [u s v] = svd(x)
%Select the first two dimensions of u, and normalize each vector to a length of 1 u=normalize(u(:, 1 :2))
%Averaging and normalizing 'Lars is' normalize(u( 1 , :)+u(2, :))
%Creating semantic representation of eyewitness statements (u2) u2(l,:)=u(l,:)+u(2,:)+u(3,:);%'lars is stealing' u2(2,:)=u(l,:)+u(2,:)+u(4,:);%'lars is judge' u2(3 , :)=u(5, :)+u(2, :)+u(3 , :);%'rolf is stealing' u2=normalize(u2);%Normalize
% Setting accuracy of statements (first two known) V=[I; 0; I];
%Adding the constant 1 to u2 U=[ones(length(V),l) u2]
%Solving for R (regressions coefficients for known V, in V=R*U R=U(1 :2,:)\V( 1 :2) %Predicting accuracy (P) P=U*R
%Correlating predicted and known accuracy (known accuracy for statement 3 is assumed to
%be 1) corr(P,V) function u=normalize(u)
%Normalize the length of each vector to 1
[Nl N2]=size(u); for i=l :Nl u(i,:)=u(i,:)/sum(u(i,:).Λ2)Λ5; end
» patentExample x =
1 1 0 1 1 1 1 0 1 0 1 0
0 0 1
u =
-0 .4692 0. 5000 -0, .3935 0. 0836 0. 6066
-0 .6838 0. 0000 0. 1800 -0. 6277 -0. ,3256
-0 .4692 -0. 5000 -0 .3935 0 .5441 -0 .2810
-0 .2146 0. 5000 0. 5735 0. 5441 -0. 2810
-0 .2146 -0. 5000 0, .5735 0. 0836 0. 6066
S =
2.5243 0 0 0 1.4142 0 0 0 0.7923 0 0 0 0 0 0
V =
-0.6426 0 -0.7662 -0.5418 0.7071 0.4544 -0.5418 -0.7071 0.4544 U =
-0.6843 0.7292 -1.0000 0.0000 -0.6843 -0.7292 -0.3944 0.9189 -0.3944 -0.9189
ans =
-0.9177 0.3973
U =
1.0000 -1.0000 0.0000 1.0000 -0.7836 0.6213 1.0000 -0.7836 -0.6213
R
1.0000
0 -1.6096
P =
1.0000 -0.0000 2.0000
ans =
0.8660 References
Allwood, C. M., Ask, K., & Granhag, P. A. ((2005). The Cognitive Interview: Effects on the realism in witnesses’ confidence in their free recall. Psychology, crime & law, 11(2), 183-198. Anderson, E., Bai, Z., Bischof, C, Blackford, S., Demmel, J., Dongarra, J., et al. (1999).
LAPACK Users' Guide (Third ed.). . Society for Industrial and Applied Mathematics.
ISBN 0898714478. Brewer, N., & Burke, A. (2002). Effects of testimonial inconsistencies and eyewitness confidence on mock-juror judgments. Law and Human Behavior, 26(3), 353-364. Brewer, N., Potter, R., Fisher, R. P., Bond, N., & Luszcz, M. A. (1999). Beliefs and data on the relationship between consistency and accuracy of eyewitness testimony. Applied
Cognitive Psychology, 13(4), 297-313. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (2nd ed.). NJ: Lawrence
Erlbaum Associates. Fisher, R. P., & Schreiber, N. (2007). Interview protocols for improving eyewitness memory
The handbook of eyewitness psychology, VoI I: Memory for events, (pp. 53-80):
Mahwah, NJ, US: Lawrence Erlbaum Associates Publishers. Golub, G. H., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix Journal of the Society for Industrial and Applied Mathematics: Series B,
Numerical Analysis, 2(2), 205-224. Hertwig, R., Gigerenzer, G., & Hoffrage, U. (1997). The reiteration effect in hindsight bias.
Psychological Review, 104(1), 194-202. Innocent project (n.d.). Facts on post-conviction DNA exonerations. Retrieved 6th of April,
2010, from the web: http://www.innocenceproject.org/Content/Facts_on_PostConviction_DNA_Exoneratio ns.php Juslin, P., Olsson, N., & Winman, A. (1996). Calibration and diagnosticity of confidence in eyewitness identification: Comments on what can be inferred from the low confidence-accuracy correlation. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 22(5), 1304-1316. Kulkofsky, S. (2008). Credible but inaccurate: Can Criterion-Based Content Analysis
(CBCA) distinguish true and false memories? Child sexual abuse: Issues and challenges, (pp. 21-42): Hauppauge, NY, US: Nova Science Publishers. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.
Psychological Review, 104(2), 211-240. Latent semantic analysis (2007). Loftus, E. F., & Hoffman, H. G. (1989). Misinformation and memory: The creation of new memories. Journal of experimental psychology. General, 118(1), 100-104. Luus, C. A. E., & Wells, G. L. (1994). The malleability of eyewitness confidence: Co-witness and perserverance effects. Journal of Applied Psychology, 79(5), 714-724. Meyer, D., Leisch, F., & Hornik, K. (2003). The support vector machine under test. .
Neurocomputing 55(1-2), 169-186.
Sahlgren, M. (2007). An Introduction to Random Indexing. Stockholm university, Stockholm. Shaw-III, J. S., Garven, S., & Wood, J. M. (1997). Co-Witness Information Can Have
Immediate Effects on Eyewitness Memory Reports. Law and Human Behavior, 21(5),
503-523.

Claims

1. A method for predicting correctness of statements, comprising: collecting a text corpus comprising a set of words; generating a representation of the text corpus; creating a semantic space for the set of words, summarizing statements in the semantic space, training on a set of training statements where the correctness is known to identify a prediction model, applying the model to a new statement whose correctness is to be predicted.
2. The method according to claim 1, wherein said text corpus is an electronically structured and processed set of texts, and wherein said text corpus contains texts in a single language which is the same language as is used for said new statement whose correctness is to be predicted.
3. The method according to claim 1 or 2, wherein said semantic space is calculated using an algorithm selected from the group of LSD or RI followed by compression using the algorithm SVD.
4. The method according to claim 3, wherein the frequency is normalized by taking a logarithm of the frequency when creating a semantic space using the LSA algorithm.
5. The method according to any of claims 1 - 4 wherein said text corpus comprise more than ten times, preferable more than fifty times as many words as said statement whose correctness is to be predicted.
6. A system for predicting a value of a variable associated with a target word, said system comprising at least one computer, wherein said system is configured to: collect a text corpus comprising a set of words; generate a representation of the text corpus; create a semantic space for the set of words, based on the representation of the text corpus; train on a set of training statements where the correctness is known to identify a prediction model; and apply the model to a new statement whose correctness is to be predicted.
7. A computer readable medium having stored thereon a computer program having software instructions which when run on a computer cause the computer to perform the steps of: collecting a text corpus comprising a set of words that include the target word; generating a repres create a semantic space for the set of words, based on the representation of the text corpus; training on a set of training statements where the correctness is known to identify a prediction model; and applying the model to a new statement whose correctness is to be predicted.
PCT/SE2010/050548 2009-05-20 2010-05-20 Predicting the correctness of eyewitness' statements with semantic evaluation method (sem) WO2010134885A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SE0900685 2009-05-20
SE0900685-9 2009-05-20

Publications (1)

Publication Number Publication Date
WO2010134885A1 true WO2010134885A1 (en) 2010-11-25

Family

ID=43126389

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SE2010/050548 WO2010134885A1 (en) 2009-05-20 2010-05-20 Predicting the correctness of eyewitness' statements with semantic evaluation method (sem)

Country Status (1)

Country Link
WO (1) WO2010134885A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240177A (en) * 2021-05-13 2021-08-10 北京百度网讯科技有限公司 Method for training prediction model, prediction method, prediction device, electronic device and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4839853A (en) * 1988-09-15 1989-06-13 Bell Communications Research, Inc. Computer information retrieval using latent semantic structure
GB2391967A (en) * 2002-08-16 2004-02-18 Canon Kk Information analysing apparatus
US6847966B1 (en) * 2002-04-24 2005-01-25 Engenium Corporation Method and system for optimally searching a document database using a representative semantic space
US20050049867A1 (en) * 2003-08-11 2005-03-03 Paul Deane Cooccurrence and constructions
US7149695B1 (en) * 2000-10-13 2006-12-12 Apple Computer, Inc. Method and apparatus for speech recognition using semantic inference and word agglomeration
US20070217676A1 (en) * 2006-03-15 2007-09-20 Kristen Grauman Pyramid match kernel and related techniques
US20080114755A1 (en) * 2006-11-15 2008-05-15 Collective Intellect, Inc. Identifying sources of media content having a high likelihood of producing on-topic content
US20100094814A1 (en) * 2008-10-13 2010-04-15 James Alexander Levy Assessment Generation Using the Semantic Web

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4839853A (en) * 1988-09-15 1989-06-13 Bell Communications Research, Inc. Computer information retrieval using latent semantic structure
US7149695B1 (en) * 2000-10-13 2006-12-12 Apple Computer, Inc. Method and apparatus for speech recognition using semantic inference and word agglomeration
US6847966B1 (en) * 2002-04-24 2005-01-25 Engenium Corporation Method and system for optimally searching a document database using a representative semantic space
GB2391967A (en) * 2002-08-16 2004-02-18 Canon Kk Information analysing apparatus
US20050049867A1 (en) * 2003-08-11 2005-03-03 Paul Deane Cooccurrence and constructions
US20070217676A1 (en) * 2006-03-15 2007-09-20 Kristen Grauman Pyramid match kernel and related techniques
US20080114755A1 (en) * 2006-11-15 2008-05-15 Collective Intellect, Inc. Identifying sources of media content having a high likelihood of producing on-topic content
US20100094814A1 (en) * 2008-10-13 2010-04-15 James Alexander Levy Assessment Generation Using the Semantic Web

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SARGUR SRIHARI ET AL: "Automatic scoring of short handwritten essays in reading comprehension tests", ARTIFICIAL INTELLIGENCE, vol. 172, no. 2-3, February 2008 (2008-02-01), pages 300 - 324, XP022392587 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240177A (en) * 2021-05-13 2021-08-10 北京百度网讯科技有限公司 Method for training prediction model, prediction method, prediction device, electronic device and medium
CN113240177B (en) * 2021-05-13 2023-12-19 北京百度网讯科技有限公司 Method for training prediction model, prediction method, device, electronic equipment and medium

Similar Documents

Publication Publication Date Title
Abrishami et al. Predicting citation counts based on deep neural network learning techniques
US11030415B2 (en) Learning document embeddings with convolutional neural network architectures
Jacobi et al. Quantitative analysis of large amounts of journalistic texts using topic modelling
Borges et al. Combining similarity features and deep representation learning for stance detection in the context of checking fake news
Rudkowsky et al. More than bags of words: Sentiment analysis with word embeddings
US10061766B2 (en) Systems and methods for domain-specific machine-interpretation of input data
CN112084334B (en) Label classification method and device for corpus, computer equipment and storage medium
US20210027019A1 (en) Word-overlap-based clustering cross-modal retrieval
US9619460B2 (en) Identifying word-senses based on linguistic variations
Zhang et al. Deep am-fm: Toolkit for automatic dialogue evaluation
CN113378565B (en) Event analysis method, device and equipment for multi-source data fusion and storage medium
CN110765240A (en) Semantic matching evaluation method for multiple related sentence pairs
Lytvyn et al. Development of the linguometric method for automatic identification of the author of text content based on statistical analysis of language diversity coefficients
CN112686046A (en) Model training method, device, equipment and computer readable medium
CN104462408A (en) Topic modeling based multi-granularity sentiment analysis method
Liu et al. Application of entity relation extraction method under CRF and syntax analysis tree in the construction of military equipment knowledge graph
Guo et al. Proposing an open-sourced tool for computational framing analysis of multilingual data
Jurgovsky et al. Evaluating memory efficiency and robustness of word embeddings
Li et al. On improving knowledge graph facilitated simple question answering system
Peng et al. Trending sentiment-topic detection on twitter
Bochkarev et al. A method of semantic change detection using diachronic corpora data
Shovon et al. The performance of graph neural network in detecting fake news from social media feeds
WO2010134885A1 (en) Predicting the correctness of eyewitness' statements with semantic evaluation method (sem)
Kao et al. Detecting deceptive language in crime interrogation
Wei et al. Cover papers of top journals are reliable source for emerging topics detection: a machine learning based prediction framework

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10778025

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC, EPO FORM 1205A DATED 29.02.2012

122 Ep: pct application non-entry in european phase

Ref document number: 10778025

Country of ref document: EP

Kind code of ref document: A1