US20130275461A1 - Computer-Implemented Systems and Methods for Identifying Factual Information in a Written Document - Google Patents
Computer-Implemented Systems and Methods for Identifying Factual Information in a Written Document Download PDFInfo
- Publication number
- US20130275461A1 US20130275461A1 US13/795,126 US201313795126A US2013275461A1 US 20130275461 A1 US20130275461 A1 US 20130275461A1 US 201313795126 A US201313795126 A US 201313795126A US 2013275461 A1 US2013275461 A1 US 2013275461A1
- Authority
- US
- United States
- Prior art keywords
- noun phrases
- named entities
- query
- fact repository
- written document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G06F17/30389—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2423—Interactive query statement specification based on a database schema
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2428—Query predicate definition using graphical user interfaces, including menus and forms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3349—Reuse of stored results of previous queries
Definitions
- This document relates generally to identifying factual information and more particularly to computer implemented systems and methods for identifying factual information in a written document.
- Automated scoring of essays involves evaluating various aspects of the essay itself including, the grammar, usage, mechanics, organization and substantive content. For assessment of content, the focus has traditionally been on the topical appropriateness of the vocabulary. Recently, other aspects such as detection of sentiment or figurative language have also been considered. Although it is well known that a misleading premise, insufficient factual basis or an example that contradicts the reader's knowledge all detract from the quality of an essay, the effect that factual information in an essay has on the overall quality of the essay has not been addressed. It is believed that the use of factual information in an essay is correlated to the overall quality of the essay. Accordingly, identification and verification of factual information is important in a variety of contexts, including the scoring of essays and the like.
- a computer implemented method for identifying factual information in a written document may include identifying one or more named entities in the written document and identifying one or more noun phrases in the written document that are associated with a corresponding one or more named entity.
- at least one query may be built by combining the named entity with a respective noun phrase. Since it is unknown at this stage whether the query corresponds to a fact, the query corresponds to an assertion—i.e., a statement that is believed to correspond to a fact but has not been verified as a fact.
- the query is submitted for comparison with a fact repository and an assessment is made as to whether the query presents a factual assertion—i.e., whether the assertion represented in the query is present in the fact repository. If it is present, a match is returned.
- a system for identifying factual information in a written document may include one or more data processors and one or more computer readable mediums encoded with instructions for commanding the one or more data processors to perform processing steps.
- one or more named entities in the written document may be identified and one or more noun phrases in the written document that are associated with a corresponding one or more named entity may also be identified.
- at least one query may be built by combining the named entity with a respective noun phrase. Since it is unknown at this stage whether the query corresponds to a fact, the query corresponds to an assertion—i.e., a statement that is believed to correspond to a fact but has not been verified as a fact.
- the query is submitted for comparison with a fact repository and an assessment is made as to whether the query presents a factual assertion—i.e., whether the assertion represented in the query is present in the fact repository. If it is present, a match is returned.
- a computer readable medium may be encoded with instructions for commanding one or more data processors to perform processing steps.
- one or more named entities in the written document may be identified and one or more noun phrases in the written document that are associated with a corresponding one or more named entity may also be identified.
- at least one query may be built by combining the named entity with a respective noun phrase. Since it is unknown at this stage whether the query corresponds to a fact, the query corresponds to an assertion—i.e., a statement that is believed to correspond to a fact but has not been verified as a fact.
- the query is submitted for comparison with a fact repository and an assessment is made as to whether the query presents a factual assertion—i.e., whether the assertion represented in the query is present in the fact repository. If it is present, a match is returned.
- noun phrase may be identified from the same sentence as the corresponding named entity and/or the noun phrase may by identified from a neighboring sentence to the named entity.
- the noun phrase may be identified from a neighboring sentence if the corresponding named entity is a person and the neighboring sentence, from which the noun phrase is identified, includes at least one of an appropriate personal pronoun or a portion of the named entity.
- the noun phrases may be identified using a dependency path of sentence structure.
- the dependency path may be an upward step followed by between one and four downward steps (e.g., 1, 2, 3, or 4 downward steps).
- the process may further comprise building variants of the query.
- the variant of the query may be constructed by modifying the noun phrase.
- a variant may be created by the removal of determiners and/or pre-modifiers from the noun phrase.
- a variant may be created by modifying the noun phrase to only include a sequence of nouns ending with the head noun.
- Another variant may be a noun phrase that is modified such that it comprises only the word from the identified noun phrase that has the lowest frequency of occurrence.
- a further example of a variant includes a noun phrase that is modified such that it comprises only the rightmost capitalized word of the identified noun phrase, if the identified noun phrase includes capitalized parts.
- the process may further comprise filtering matches to eliminate undesired matches.
- the match may be filtered if the matched noun phrase in the fact repository comprises modal or hedged predicates.
- the match may be filtered if the named entity or the noun phrase in the fact repository is more specific than the named entity or the noun phrase in the query.
- the match may be filtered if any of a plurality of conditions are met.
- Such conditions may include, for example: (i) if a capitalized word follows the named entity or noun phrase in the fact repository but is not present in the portion of the written document from which the named entity or noun phrase are identified; (ii) if more than one capitalized or rare words precedes the named entity or noun phrase in the fact repository but is not present in the portion of the written document from which the named entity or noun phrase are identified and the capitalized or rare words are not honorifics; (iii) if the named entity or noun phrase in the fact repository is longer than eight words; or (iv) if more than three words follow the named entity or noun phrase in the fact repository. Additionally, the match may be filtered if the ratio of negative to positive predicates among a plurality of matches is greater than a predetermined threshold.
- FIG. 1 is a flow diagram illustrating an example of a method for identifying factual information in written documents
- FIG. 2 is a flow diagram illustrating another example of a method for identifying factual information in written documents
- FIG. 3 is a flow diagram illustrating another example of a method for identifying factual information in written documents.
- FIGS. 4A , 4 B, and 4 C are block diagrams illustrating an example systems for use in identifying factual information in written documents.
- factual information may be important in a variety of contexts including the scoring of essays and the like.
- a fact can be understood in a number of different manners.
- argumentation e.g., an argumentative essay
- the notion of a fact may be characterized as data which is common to several beings and for which there is agreement as to the correctness of that data.
- a fact can be distinguished from a presumption which may be a statement about what is normal and/or likely.
- this distinction in the scope of required agreement may be related to the referential device used in a particular statement.
- the identified statements may be compared against a fact repository.
- the fact repository may be an encyclopedia, the world wide web, based on Open Information Extraction (OIE), and/or the TextRunner system.
- OIE Open Information Extraction
- FIG. 1 is a flow diagram illustrating an example of a method for identifying factual information in written documents.
- the process begins 105 with identifying a named entity (NE) 110 .
- the named entity may comprise at least one of people, proper names, locations, organizations, government, awards, events, science and technology, and/or art.
- the named entity may be the subject of a particular statement or sentence or the argument of the predicate of a sentence.
- the named entity may be identified from a written document by comparing the words and/or phrases in the written document with named entities contained within an existing set of data.
- the named entities may are identified using the Stanford Named Entity Recognizer.
- a noun phrase is generally a word or phrase which includes a noun and the modifiers which distinguish it. Selection of the noun phrase may be based on, for example, a grammar-based approach.
- noun phrase may be identified using a dependency path.
- the dependency paths may be obtained from the Stanford Dependency Parser.
- the dependency path may be an upward step followed by between one and four downward steps. For example, the it is believed that the most prolific family of paths starts with an upward step and then between 1-4 downward steps. The first upward step may connect the named entity to the predicate of which it is an argument.
- the downward step(s) may connect the predicate to the head of another argument (e.g., noun phrase) or to an argument's head's modifier.
- Some examples of statements with different dependency paths include: “a Nobel Prize in a science field” (one downward step); “Chaucer, in the 14th century . . . ” (one downward step); “the prestige of the Nobel Prize” (one upward step); “Kidman's talent” (one upward step); “Kroemer received the Nobel Prize” (one upward step followed by one downward step); and “Kroemer received the Nobel Prize for his work on the Heterojunction Bipolar Transistor” (one upward step followed by two downward steps).
- the noun phrase may be contained within the same sentence as the corresponding named entity or it may be located in a neighboring sentence to the one with the named entity.
- the noun phrase may be identified from a neighboring sentence if the corresponding named entity is a person and/or the neighboring sentence includes at least one of an appropriate personal pronoun and/or a portion of the named entity (e.g., just a last name of a person).
- the process may confirm that the gender of the pronoun matches that of the named entity and/or if the gender of the named entity cannot be confirmed, the process may not expand identification of the noun phrase into a neighboring sentence.
- the written document that the named entity and noun phrase are identified from is e.g., a test taker's essay and/or the identification of factual information is utilized in the scoring of the test taker's essay.
- the named entity and the noun phrase are used to build a query 120 .
- the query may be structured as a 3-tuple query.
- the structure of the query may be ⁇ NE, ?, NP>.
- the “?” may be the predicate that links the named entity with the noun phrase.
- the query is submitted for comparison to a fact repository 125 .
- the fact repository may be an encyclopedia, the world wide web, based on Open Information Extraction (OIE), and/or the TextRunner system.
- OIE Open Information Extraction
- the comparison of the query with the fact repository assesses whether the query presents a factual assertion 130 .
- the query is built with the belief that the assertion is factual but it is unknown whether the assertion is actually true.
- the process determines whether there is a match within a data set that is believed to contain facts. If the query does match corresponding information within the fact repository, a match is returned 135 .
- the match may require that the fact repository contain a corresponding named entity and noun phrase to the ones in the query.
- the named entity may need to be contained within the fact repository but the noun phrase may not need to be exactly present.
- neither the named entity of the noun phrase in the fact repository would need to be exactly matched to the query as long at some predetermined criteria is met.
- the predicate in the query may or may not need to be matched.
- the process determines whether there are any additional named entities and/or noun phrases 140 . If there are, the process begins again and if there are not, the process terminates 145 .
- FIG. 2 is a flow diagram illustrating another example of a method for identifying factual information in written documents.
- FIG. 2 is similar to the example illustrated in FIG. 1 except that an additional series of steps 200 are included to create variants of the query built at 125 .
- the variants are built 200 before submitting any queries for comparison to the fact repository.
- the variants may be created after the initial query is submitted and then individually or collectively submitted for comparison with the fact repository.
- numerous variants of the query may be created by for example, modifying the noun phrase.
- the variants may assist in increasing the chances of finding a match for a particular named entity and noun phrase.
- one way a query may be modified is to remove determiners and/or pre-modifiers 210 . For example, if the noun phrase was “a very beautiful photograph,” the modified phrase may be “beautiful photograph.”
- the noun phrase can be modified to create a query variant that comprises a sequence of nouns ending with the head noun 220 .
- the noun phrase may be modified to “photograph.”
- the noun phrase can be modified to create a query variant that comprises only the word from the noun phrase that has the lowest frequency of occurrence.
- capitalized words may be given the lowest frequency so that if the noun phrase contained any capitalized word the variant might contain the left most capitalized word (e.g., the first capitalized word) or if an out of vocabulary word was present in the noun phrase, the out of vocabulary word.
- the name may be split such that only the first name is taken in the variant. For example, in the noun phrase “that author Orhan Phamuk” the variant noun phrase may be “Orhan.” If no capitalized word exists, the variant may simply select the rarest word from within the phrase. For example, if the noun phrase was “category 3 hurricane” the variant noun phrase may be “hurricane.”
- the noun phrase can be modified to create a query variant that comprises only the rightmost capitalized word, if the noun phrase includes capitalized parts. For example, if the noun phrase was “the actress Nicole Kidman” the variant noun phrase would be “Kidman.” This variant may serve to select last names as a potential complement to the variant discussed above which potentially selects only first names.
- FIG. 3 is a flow diagram illustrating another example of a method for identifying factual information in written documents.
- FIG. 3 is similar to the example illustrated in FIG. 1 except that an additional series of steps 320 - 370 are included to filter out potentially undesirable matches from those returned by the comparison with the fact repository 135 .
- the filters illustrated in FIG. 3 may also be combined, for example, with the variants illustrated in FIG. 2 . In the example illustrated in FIG. 3 , the filtering is performed after each match is returned but could also be performed after some or all of the matches are returned. Filters such as those shown in FIG. 3 may be desirable in examples where matches are returned based on predetermined criteria that potentially yields some undesirable matches.
- Matches may be filtered if the fact (e.g., named entity and/or noun phrase) in the fact repository comprises modal or hedged predicates 310 . For example, matches based on predicates such as “might turn out to be” or “possibly attended” may be filtered out. Similarly, matches based on future tense predicates may be filtered out as well.
- the fact e.g., named entity and/or noun phrase
- predicates such as “might turn out to be” or “possibly attended”
- matches based on future tense predicates may be filtered out as well.
- Matches may be filtered if the fact in the fact repository is more specific than the one in the query 320 .
- the match may be filtered if any of the following conditions are met.
- the match may be filtered if a capitalized word follows the fact in the fact repository but is not present in the sentence (or neighboring sentence) from which the query was identified 330 .
- the match may be filtered if more than one capitalized or rare words precedes the fact in the fact repository but is not present in the sentence (or neighboring sentence) from which the query was identified and the capitalized or rare words are not honorifics 340 .
- the match may be filtered if the fact in the fact repository is longer than eight words 350 .
- the match may be filtered if more than three words follow the fact in the fact repository 360 .
- Matches may also be filtered if the ratio of negative to positive predicates among a plurality of matches is greater than a predetermined threshold 370 .
- a query such as ⁇ Barack Obama, ?, US citizen> may be filtered out based on the following pattern of matches:
- matches may be filtered if the matches themselves reflect a lack of consensus and/or an argumentative statement.
- each of the examples of filters are shown in FIG. 3 serially, in an example, only one or only two of the filters may be included. Additionally, the filters may be configured such that more than one filter needs to be satisfied before a match is filtered out. For example, the filters may be configured such that a match is not filtered out unless the noun phrase comprises a modal or hedged predicate and the fact in the fact repository is more specific than the one in the query.
- FIGS. 4A , 4 B, and 4 C depict example systems for use in implementing recognition of phrasal terms.
- FIG. 4A illustrates an exemplary system 400 that includes a standalone computer architecture where a processing system 402 (e.g., one or more computer processors located in a given computer or in multiple computers that may be separate and distinct from one another) includes a fact identification engine 404 being executed on it.
- the processing system 402 has access to at least one computer-readable memory 406 in addition to one or more data stores 408 .
- the one or more data stores 408 may include the queries (and/or written documents) 410 as well as a fact repository 412 .
- FIG. 4B depicts a system 420 that includes a client server architecture.
- One or more user PCs 422 access one or more servers 424 running a part of fact recognition engine 426 on a processing system 427 via one or more networks 428 .
- the one or more servers 424 may access a computer readable memory 430 as well as one or more data stores 432 .
- the one or more data stores 432 may contain queries (and/or written documents) 434 as well as a fact repository 436 .
- FIG. 4C shows a block diagram of exemplary hardware for a standalone computer architecture 450 , such as the architecture depicted in FIG. 4A that may be used to contain and/or implement the program instructions of system embodiments of the present invention.
- a bus 452 may serve as the information highway interconnecting the other illustrated components of the hardware.
- a processing system 454 labeled CPU (central processing unit) e.g., one or more computer processors at a given computer or at multiple computers, may perform calculations and logic operations required to execute a program.
- a non-transitory processor-readable storage medium such as read only memory (ROM) 456 and random access memory (RAM) 458 , may be in communication with the processing system 454 and may contain one or more programming instructions for performing the method of implementing a part of speech pattern scoring engine.
- program instructions may be stored on a non-transitory computer readable storage medium such as a magnetic disk, optical disk, recordable memory device, flash memory, or other physical storage medium.
- a disk controller 460 interfaces one or more optional disk drives to the system bus 452 .
- These disk drives may be external or internal floppy disk drives such as 462 , external or internal CD-ROM, CD-R, CD-RW or DVD drives such as 464 , or external or internal hard drives 466 .
- These various disk drives and disk controllers may be optional devices.
- Each of the element managers, real-time data buffer, conveyors, file input processor, database index shared access memory loader, reference data buffer and data managers may include a software application stored in one or more of the disk drives connected to the disk controller 460 , the ROM 456 and/or the RAM 458 .
- the processor 454 may access each component as required.
- a display interface 468 may permit information from the bus 452 to be displayed on a display 470 in audio, graphic, or alphanumeric format. Communication with external devices may optionally occur using various communication ports 472 .
- the hardware may also include data input devices, such as a keyboard 473 , or other input device 474 , such as a microphone, remote control, pointer, mouse and/or joystick.
- data input devices such as a keyboard 473 , or other input device 474 , such as a microphone, remote control, pointer, mouse and/or joystick.
- the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem.
- the software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein.
- Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
- the systems' and methods' data may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.).
- storage devices and programming constructs e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.
- data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
- a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code.
- the software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
Abstract
Description
- This application claims the benefit of U.S. Provisional Application No. 61/622,819 filed on Apr. 11, 2012, the entire contents of which is incorporated herein by reference.
- This document relates generally to identifying factual information and more particularly to computer implemented systems and methods for identifying factual information in a written document.
- Automated scoring of essays involves evaluating various aspects of the essay itself including, the grammar, usage, mechanics, organization and substantive content. For assessment of content, the focus has traditionally been on the topical appropriateness of the vocabulary. Recently, other aspects such as detection of sentiment or figurative language have also been considered. Although it is well known that a misleading premise, insufficient factual basis or an example that contradicts the reader's knowledge all detract from the quality of an essay, the effect that factual information in an essay has on the overall quality of the essay has not been addressed. It is believed that the use of factual information in an essay is correlated to the overall quality of the essay. Accordingly, identification and verification of factual information is important in a variety of contexts, including the scoring of essays and the like.
- In accordance with the teachings herein, systems and methods are provided for identifying factual information in a written document. For example, a computer implemented method for identifying factual information in a written document may include identifying one or more named entities in the written document and identifying one or more noun phrases in the written document that are associated with a corresponding one or more named entity. Using a noun entity and a noun phrase, at least one query may be built by combining the named entity with a respective noun phrase. Since it is unknown at this stage whether the query corresponds to a fact, the query corresponds to an assertion—i.e., a statement that is believed to correspond to a fact but has not been verified as a fact. The query is submitted for comparison with a fact repository and an assessment is made as to whether the query presents a factual assertion—i.e., whether the assertion represented in the query is present in the fact repository. If it is present, a match is returned.
- As another example, a system for identifying factual information in a written document may include one or more data processors and one or more computer readable mediums encoded with instructions for commanding the one or more data processors to perform processing steps. In the steps, one or more named entities in the written document may be identified and one or more noun phrases in the written document that are associated with a corresponding one or more named entity may also be identified. Using a noun entity and a noun phrase, at least one query may be built by combining the named entity with a respective noun phrase. Since it is unknown at this stage whether the query corresponds to a fact, the query corresponds to an assertion—i.e., a statement that is believed to correspond to a fact but has not been verified as a fact. The query is submitted for comparison with a fact repository and an assessment is made as to whether the query presents a factual assertion—i.e., whether the assertion represented in the query is present in the fact repository. If it is present, a match is returned.
- As a further example, a computer readable medium may be encoded with instructions for commanding one or more data processors to perform processing steps. In the steps, one or more named entities in the written document may be identified and one or more noun phrases in the written document that are associated with a corresponding one or more named entity may also be identified. Using a noun entity and a noun phrase, at least one query may be built by combining the named entity with a respective noun phrase. Since it is unknown at this stage whether the query corresponds to a fact, the query corresponds to an assertion—i.e., a statement that is believed to correspond to a fact but has not been verified as a fact. The query is submitted for comparison with a fact repository and an assessment is made as to whether the query presents a factual assertion—i.e., whether the assertion represented in the query is present in the fact repository. If it is present, a match is returned.
- In still further examples, noun phrase may be identified from the same sentence as the corresponding named entity and/or the noun phrase may by identified from a neighboring sentence to the named entity. For example, the noun phrase may be identified from a neighboring sentence if the corresponding named entity is a person and the neighboring sentence, from which the noun phrase is identified, includes at least one of an appropriate personal pronoun or a portion of the named entity.
- In still further examples, the noun phrases may be identified using a dependency path of sentence structure. For example, the dependency path may be an upward step followed by between one and four downward steps (e.g., 1, 2, 3, or 4 downward steps).
- In still further examples, the process may further comprise building variants of the query. For example, the variant of the query may be constructed by modifying the noun phrase. For example, a variant may be created by the removal of determiners and/or pre-modifiers from the noun phrase. A variant may be created by modifying the noun phrase to only include a sequence of nouns ending with the head noun. Another variant may be a noun phrase that is modified such that it comprises only the word from the identified noun phrase that has the lowest frequency of occurrence. A further example of a variant includes a noun phrase that is modified such that it comprises only the rightmost capitalized word of the identified noun phrase, if the identified noun phrase includes capitalized parts.
- In still further examples, the process may further comprise filtering matches to eliminate undesired matches. For example, the match may be filtered if the matched noun phrase in the fact repository comprises modal or hedged predicates. Additionally, the match may be filtered if the named entity or the noun phrase in the fact repository is more specific than the named entity or the noun phrase in the query. In a further example, the match may be filtered if any of a plurality of conditions are met. Such conditions may include, for example: (i) if a capitalized word follows the named entity or noun phrase in the fact repository but is not present in the portion of the written document from which the named entity or noun phrase are identified; (ii) if more than one capitalized or rare words precedes the named entity or noun phrase in the fact repository but is not present in the portion of the written document from which the named entity or noun phrase are identified and the capitalized or rare words are not honorifics; (iii) if the named entity or noun phrase in the fact repository is longer than eight words; or (iv) if more than three words follow the named entity or noun phrase in the fact repository. Additionally, the match may be filtered if the ratio of negative to positive predicates among a plurality of matches is greater than a predetermined threshold.
-
FIG. 1 is a flow diagram illustrating an example of a method for identifying factual information in written documents; -
FIG. 2 is a flow diagram illustrating another example of a method for identifying factual information in written documents; -
FIG. 3 is a flow diagram illustrating another example of a method for identifying factual information in written documents; and -
FIGS. 4A , 4B, and 4C are block diagrams illustrating an example systems for use in identifying factual information in written documents. - As discussed above, identification and verification of factual information may be important in a variety of contexts including the scoring of essays and the like. A fact can be understood in a number of different manners. For example, in the context of argumentation (e.g., an argumentative essay) the notion of a fact may be characterized as data which is common to several beings and for which there is agreement as to the correctness of that data. In some examples, a fact can be distinguished from a presumption which may be a statement about what is normal and/or likely. In particular, this distinction in the scope of required agreement may be related to the referential device used in a particular statement. If the reference is more rigid, that is, less prone to change in time and to indeterminacy of the boundaries, the scope of necessary agreement is likely to by more precise. For example, statements made in connection with proper names may be more rigid than others (e.g., “Barack Obama” selects for one, and the same, person in 2010 and 1990 but “current U.S. president” selects for different people at different times).
- In addition to identification of facts, it is also important to be able to verify that the identified statements are actually true. As discussed throughout this disclosure, the identified statements may be compared against a fact repository. For example, the fact repository may be an encyclopedia, the world wide web, based on Open Information Extraction (OIE), and/or the TextRunner system.
-
FIG. 1 is a flow diagram illustrating an example of a method for identifying factual information in written documents. As shown inFIG. 1 , the process begins 105 with identifying a named entity (NE) 110. For example, the named entity may comprise at least one of people, proper names, locations, organizations, government, awards, events, science and technology, and/or art. In general, the named entity may be the subject of a particular statement or sentence or the argument of the predicate of a sentence. In an example, the named entity may be identified from a written document by comparing the words and/or phrases in the written document with named entities contained within an existing set of data. For example, the named entities may are identified using the Stanford Named Entity Recognizer. - In addition to identifying a named
entity 110, the process continues with the identification of a corresponding noun phrase (NP) 115. A noun phrase is generally a word or phrase which includes a noun and the modifiers which distinguish it. Selection of the noun phrase may be based on, for example, a grammar-based approach. For example, noun phrase may be identified using a dependency path. In an example, the dependency paths may be obtained from the Stanford Dependency Parser. In particular, the dependency path may be an upward step followed by between one and four downward steps. For example, the it is believed that the most prolific family of paths starts with an upward step and then between 1-4 downward steps. The first upward step may connect the named entity to the predicate of which it is an argument. The downward step(s) may connect the predicate to the head of another argument (e.g., noun phrase) or to an argument's head's modifier. Some examples of statements with different dependency paths include: “a Nobel Prize in a science field” (one downward step); “Chaucer, in the 14th century . . . ” (one downward step); “the prestige of the Nobel Prize” (one upward step); “Kidman's talent” (one upward step); “Kroemer received the Nobel Prize” (one upward step followed by one downward step); and “Kroemer received the Nobel Prize for his work on the Heterojunction Bipolar Transistor” (one upward step followed by two downward steps). - In an example, the noun phrase may be contained within the same sentence as the corresponding named entity or it may be located in a neighboring sentence to the one with the named entity. For example, the noun phrase may be identified from a neighboring sentence if the corresponding named entity is a person and/or the neighboring sentence includes at least one of an appropriate personal pronoun and/or a portion of the named entity (e.g., just a last name of a person). In an example, the process may confirm that the gender of the pronoun matches that of the named entity and/or if the gender of the named entity cannot be confirmed, the process may not expand identification of the noun phrase into a neighboring sentence.
- In an example, the written document that the named entity and noun phrase are identified from is e.g., a test taker's essay and/or the identification of factual information is utilized in the scoring of the test taker's essay.
- The named entity and the noun phrase are used to build a
query 120. For example, the query may be structured as a 3-tuple query. For example, the structure of the query may be <NE, ?, NP>. In examples, the “?” may be the predicate that links the named entity with the noun phrase. - The query is submitted for comparison to a
fact repository 125. For example, the fact repository may be an encyclopedia, the world wide web, based on Open Information Extraction (OIE), and/or the TextRunner system. The comparison of the query with the fact repository assesses whether the query presents afactual assertion 130. In particular, the query is built with the belief that the assertion is factual but it is unknown whether the assertion is actually true. By comparing the query to the fact repository, the process determines whether there is a match within a data set that is believed to contain facts. If the query does match corresponding information within the fact repository, a match is returned 135. For example, the match may require that the fact repository contain a corresponding named entity and noun phrase to the ones in the query. In another example, the named entity may need to be contained within the fact repository but the noun phrase may not need to be exactly present. In another example, neither the named entity of the noun phrase in the fact repository would need to be exactly matched to the query as long at some predetermined criteria is met. In yet another example, the predicate in the query may or may not need to be matched. - After completing the matching process for the identified named entity and corresponding noun phrase, the process determines whether there are any additional named entities and/or
noun phrases 140. If there are, the process begins again and if there are not, the process terminates 145. -
FIG. 2 is a flow diagram illustrating another example of a method for identifying factual information in written documents.FIG. 2 is similar to the example illustrated inFIG. 1 except that an additional series ofsteps 200 are included to create variants of the query built at 125. In the example illustrated inFIG. 2 , the variants are built 200 before submitting any queries for comparison to the fact repository. In other example, the variants may be created after the initial query is submitted and then individually or collectively submitted for comparison with the fact repository. - As illustrated in
FIG. 2 , numerous variants of the query may be created by for example, modifying the noun phrase. The variants may assist in increasing the chances of finding a match for a particular named entity and noun phrase. In an example, one way a query may be modified is to remove determiners and/orpre-modifiers 210. For example, if the noun phrase was “a very beautiful photograph,” the modified phrase may be “beautiful photograph.” - In another example, the noun phrase can be modified to create a query variant that comprises a sequence of nouns ending with the
head noun 220. For example, using the same example above, the noun phrase may be modified to “photograph.” - In another example, the noun phrase can be modified to create a query variant that comprises only the word from the noun phrase that has the lowest frequency of occurrence. For example, capitalized words may be given the lowest frequency so that if the noun phrase contained any capitalized word the variant might contain the left most capitalized word (e.g., the first capitalized word) or if an out of vocabulary word was present in the noun phrase, the out of vocabulary word. Accordingly, in an example, if the noun phrase contained a name, the name may be split such that only the first name is taken in the variant. For example, in the noun phrase “that author Orhan Phamuk” the variant noun phrase may be “Orhan.” If no capitalized word exists, the variant may simply select the rarest word from within the phrase. For example, if the noun phrase was “category 3 hurricane” the variant noun phrase may be “hurricane.”
- In another example, the noun phrase can be modified to create a query variant that comprises only the rightmost capitalized word, if the noun phrase includes capitalized parts. For example, if the noun phrase was “the actress Nicole Kidman” the variant noun phrase would be “Kidman.” This variant may serve to select last names as a potential complement to the variant discussed above which potentially selects only first names.
- Although each of the four examples of variants are shown in
FIG. 2 serially, in an example, only one, only two, or only three of the variants may be included. -
FIG. 3 is a flow diagram illustrating another example of a method for identifying factual information in written documents.FIG. 3 is similar to the example illustrated inFIG. 1 except that an additional series of steps 320-370 are included to filter out potentially undesirable matches from those returned by the comparison with thefact repository 135. The filters illustrated inFIG. 3 may also be combined, for example, with the variants illustrated inFIG. 2 . In the example illustrated inFIG. 3 , the filtering is performed after each match is returned but could also be performed after some or all of the matches are returned. Filters such as those shown inFIG. 3 may be desirable in examples where matches are returned based on predetermined criteria that potentially yields some undesirable matches. - Matches may be filtered if the fact (e.g., named entity and/or noun phrase) in the fact repository comprises modal or hedged predicates 310. For example, matches based on predicates such as “might turn out to be” or “possibly attended” may be filtered out. Similarly, matches based on future tense predicates may be filtered out as well.
- Matches may be filtered if the fact in the fact repository is more specific than the one in the
query 320. For example, the match may be filtered if any of the following conditions are met. The match may be filtered if a capitalized word follows the fact in the fact repository but is not present in the sentence (or neighboring sentence) from which the query was identified 330. The match may be filtered if more than one capitalized or rare words precedes the fact in the fact repository but is not present in the sentence (or neighboring sentence) from which the query was identified and the capitalized or rare words are not honorifics 340. The match may be filtered if the fact in the fact repository is longer than eightwords 350. The match may be filtered if more than three words follow the fact in thefact repository 360. - Matches may also be filtered if the ratio of negative to positive predicates among a plurality of matches is greater than a
predetermined threshold 370. For example, a query such as <Barack Obama, ?, US citizen> may be filtered out based on the following pattern of matches: -
Count Predicate 10 is not 4 is 2 was always 1 is really 1 isn't 1 was not - Additionally, matches may be filtered if the matches themselves reflect a lack of consensus and/or an argumentative statement.
- Although each of the examples of filters are shown in
FIG. 3 serially, in an example, only one or only two of the filters may be included. Additionally, the filters may be configured such that more than one filter needs to be satisfied before a match is filtered out. For example, the filters may be configured such that a match is not filtered out unless the noun phrase comprises a modal or hedged predicate and the fact in the fact repository is more specific than the one in the query. - Examples have been used to describe the invention herein and the scope of the invention may include other examples.
FIGS. 4A , 4B, and 4C depict example systems for use in implementing recognition of phrasal terms. For example,FIG. 4A illustrates anexemplary system 400 that includes a standalone computer architecture where a processing system 402 (e.g., one or more computer processors located in a given computer or in multiple computers that may be separate and distinct from one another) includes afact identification engine 404 being executed on it. Theprocessing system 402 has access to at least one computer-readable memory 406 in addition to one ormore data stores 408. The one ormore data stores 408 may include the queries (and/or written documents) 410 as well as afact repository 412. -
FIG. 4B depicts asystem 420 that includes a client server architecture. One ormore user PCs 422 access one ormore servers 424 running a part offact recognition engine 426 on aprocessing system 427 via one ormore networks 428. The one ormore servers 424 may access a computerreadable memory 430 as well as one ormore data stores 432. The one ormore data stores 432 may contain queries (and/or written documents) 434 as well as afact repository 436. -
FIG. 4C shows a block diagram of exemplary hardware for astandalone computer architecture 450, such as the architecture depicted inFIG. 4A that may be used to contain and/or implement the program instructions of system embodiments of the present invention. Abus 452 may serve as the information highway interconnecting the other illustrated components of the hardware. Aprocessing system 454 labeled CPU (central processing unit) (e.g., one or more computer processors at a given computer or at multiple computers), may perform calculations and logic operations required to execute a program. A non-transitory processor-readable storage medium, such as read only memory (ROM) 456 and random access memory (RAM) 458, may be in communication with theprocessing system 454 and may contain one or more programming instructions for performing the method of implementing a part of speech pattern scoring engine. Optionally, program instructions may be stored on a non-transitory computer readable storage medium such as a magnetic disk, optical disk, recordable memory device, flash memory, or other physical storage medium. - A
disk controller 460 interfaces one or more optional disk drives to thesystem bus 452. These disk drives may be external or internal floppy disk drives such as 462, external or internal CD-ROM, CD-R, CD-RW or DVD drives such as 464, or external or internalhard drives 466. These various disk drives and disk controllers may be optional devices. - Each of the element managers, real-time data buffer, conveyors, file input processor, database index shared access memory loader, reference data buffer and data managers may include a software application stored in one or more of the disk drives connected to the
disk controller 460, theROM 456 and/or theRAM 458. Theprocessor 454 may access each component as required. - A
display interface 468 may permit information from thebus 452 to be displayed on adisplay 470 in audio, graphic, or alphanumeric format. Communication with external devices may optionally occur usingvarious communication ports 472. - In addition to the standard computer-type components, the hardware may also include data input devices, such as a
keyboard 473, orother input device 474, such as a microphone, remote control, pointer, mouse and/or joystick. - Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
- The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
- The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
- It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situation where only the disjunctive meaning may apply.
- While this document uses examples to disclose the inventions described herein, it will be obvious to those skilled in the art that patentable scope of the invention may include other examples as well. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
Claims (28)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/795,126 US20130275461A1 (en) | 2012-04-11 | 2013-03-12 | Computer-Implemented Systems and Methods for Identifying Factual Information in a Written Document |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201261622819P | 2012-04-11 | 2012-04-11 | |
US13/795,126 US20130275461A1 (en) | 2012-04-11 | 2013-03-12 | Computer-Implemented Systems and Methods for Identifying Factual Information in a Written Document |
Publications (1)
Publication Number | Publication Date |
---|---|
US20130275461A1 true US20130275461A1 (en) | 2013-10-17 |
Family
ID=49326047
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/795,126 Abandoned US20130275461A1 (en) | 2012-04-11 | 2013-03-12 | Computer-Implemented Systems and Methods for Identifying Factual Information in a Written Document |
Country Status (1)
Country | Link |
---|---|
US (1) | US20130275461A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10769962B1 (en) * | 2015-05-04 | 2020-09-08 | Educational Testing Service | Systems and methods for generating a personalization score for a constructed response |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040111408A1 (en) * | 2001-01-18 | 2004-06-10 | Science Applications International Corporation | Method and system of ranking and clustering for document indexing and retrieval |
US6996800B2 (en) * | 2000-12-04 | 2006-02-07 | International Business Machines Corporation | MVC (model-view-controller) based multi-modal authoring tool and development environment |
US20060149739A1 (en) * | 2004-05-28 | 2006-07-06 | Metadata, Llc | Data security in a semantic data model |
US20070055656A1 (en) * | 2005-08-01 | 2007-03-08 | Semscript Ltd. | Knowledge repository |
US20070230787A1 (en) * | 2006-04-03 | 2007-10-04 | Oce-Technologies B.V. | Method for automated processing of hard copy text documents |
US20070238084A1 (en) * | 2006-04-06 | 2007-10-11 | Vantage Technologies Knowledge Assessment, L.L.Ci | Selective writing assessment with tutoring |
US20080005090A1 (en) * | 2004-03-31 | 2008-01-03 | Khan Omar H | Systems and methods for identifying a named entity |
US7937265B1 (en) * | 2005-09-27 | 2011-05-03 | Google Inc. | Paraphrase acquisition |
EP2605150A1 (en) * | 2011-12-16 | 2013-06-19 | Presans | Method for identifying the named entity that corresponds to an owner of a web page |
US20130262086A1 (en) * | 2012-03-27 | 2013-10-03 | Accenture Global Services Limited | Generation of a semantic model from textual listings |
US20150193413A1 (en) * | 2012-02-22 | 2015-07-09 | Google Inc. | Correction of quotations copied from electronic documents |
-
2013
- 2013-03-12 US US13/795,126 patent/US20130275461A1/en not_active Abandoned
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6996800B2 (en) * | 2000-12-04 | 2006-02-07 | International Business Machines Corporation | MVC (model-view-controller) based multi-modal authoring tool and development environment |
US20040111408A1 (en) * | 2001-01-18 | 2004-06-10 | Science Applications International Corporation | Method and system of ranking and clustering for document indexing and retrieval |
US20080005090A1 (en) * | 2004-03-31 | 2008-01-03 | Khan Omar H | Systems and methods for identifying a named entity |
US20060149739A1 (en) * | 2004-05-28 | 2006-07-06 | Metadata, Llc | Data security in a semantic data model |
US20070055656A1 (en) * | 2005-08-01 | 2007-03-08 | Semscript Ltd. | Knowledge repository |
US7937265B1 (en) * | 2005-09-27 | 2011-05-03 | Google Inc. | Paraphrase acquisition |
US20070230787A1 (en) * | 2006-04-03 | 2007-10-04 | Oce-Technologies B.V. | Method for automated processing of hard copy text documents |
US20070238084A1 (en) * | 2006-04-06 | 2007-10-11 | Vantage Technologies Knowledge Assessment, L.L.Ci | Selective writing assessment with tutoring |
EP2605150A1 (en) * | 2011-12-16 | 2013-06-19 | Presans | Method for identifying the named entity that corresponds to an owner of a web page |
US20150193413A1 (en) * | 2012-02-22 | 2015-07-09 | Google Inc. | Correction of quotations copied from electronic documents |
US20130262086A1 (en) * | 2012-03-27 | 2013-10-03 | Accenture Global Services Limited | Generation of a semantic model from textual listings |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10769962B1 (en) * | 2015-05-04 | 2020-09-08 | Educational Testing Service | Systems and methods for generating a personalization score for a constructed response |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10769552B2 (en) | Justifying passage machine learning for question and answer systems | |
US20200184275A1 (en) | Method and system for generating and correcting classification models | |
US10347019B2 (en) | Intelligent data munging | |
US9621601B2 (en) | User collaboration for answer generation in question and answer system | |
Yin et al. | Answering questions with complex semantic constraints on open knowledge bases | |
US20170308571A1 (en) | Techniques for utilizing a natural language interface to perform data analysis and retrieval | |
US9275115B2 (en) | Correlating corpus/corpora value from answered questions | |
US9116985B2 (en) | Computer-implemented systems and methods for taxonomy development | |
US20150227505A1 (en) | Word meaning relationship extraction device | |
US10642928B2 (en) | Annotation collision detection in a question and answer system | |
CN106462604B (en) | Identifying query intent | |
US10146858B2 (en) | Discrepancy handler for document ingestion into a corpus for a cognitive computing system | |
Roberts | Semantic text analysis: On the structure of linguistic ambiguity in ordinary discourse | |
US11243924B2 (en) | Computing the need for standardization of a set of values | |
US20240096325A1 (en) | Processing Multi-Party Conversations | |
CN111506595B (en) | Data query method, system and related equipment | |
US10339826B1 (en) | Systems and methods for determining the effectiveness of source material usage | |
Tovar et al. | A metric for the evaluation of restricted domain ontologies | |
US20130275461A1 (en) | Computer-Implemented Systems and Methods for Identifying Factual Information in a Written Document | |
Greene | Spin: Lexical semantics, transitivity, and the identification of implicit sentiment | |
Nguyen et al. | A vietnamese natural language interface to database | |
US9208145B2 (en) | Computer-implemented systems and methods for non-monotonic recognition of phrasal terms | |
Wibisono et al. | Sentence extraction in recognition textual entailment task | |
US7899251B2 (en) | Balancing out-of-dictionary and in-dictionary recognition scores | |
Kumar et al. | Medical query expansion using UMLS |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: EDUCATIONAL TESTING SERVICE, NEW JERSEY Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BEIGMAN KLEBANOV, BEATA;HIGGINS, DERRICK;REEL/FRAME:030175/0271 Effective date: 20130314 |
|
AS | Assignment |
Owner name: EDUCATIONAL TESTING SERVICE, NEW JERSEY Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE STATE OF INCORPORATION INSIDE ASSIGNMENT DOCUMENT PREVIOUSLY RECORDED AT REEL: 030175 FRAME: 0271. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNORS:BEIGMAN KLEBANOV, BEATA;HIGGINS, DERRICK;REEL/FRAME:035717/0129 Effective date: 20130314 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |