WO2000026839A1 - Advanced model for automatic extraction of skill and knowledge information from an electronic document - Google Patents

Advanced model for automatic extraction of skill and knowledge information from an electronic document Download PDF

Info

Publication number
WO2000026839A1
WO2000026839A1 PCT/US1999/026083 US9926083W WO0026839A1 WO 2000026839 A1 WO2000026839 A1 WO 2000026839A1 US 9926083 W US9926083 W US 9926083W WO 0026839 A1 WO0026839 A1 WO 0026839A1
Authority
WO
WIPO (PCT)
Prior art keywords
skill
electronic document
information
knowledge
knowledge information
Prior art date
Application number
PCT/US1999/026083
Other languages
French (fr)
Other versions
WO2000026839A9 (en
WO2000026839A8 (en
Inventor
Prabhat K. Andleigh
Nagaraju Pappu
Vasudeva V. Kalindindi
Original Assignee
Infodream Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from PCT/US1998/027664 external-priority patent/WO1999034307A1/en
Application filed by Infodream Corporation filed Critical Infodream Corporation
Publication of WO2000026839A1 publication Critical patent/WO2000026839A1/en
Publication of WO2000026839A8 publication Critical patent/WO2000026839A8/en
Priority to GB0113250A priority Critical patent/GB2359168A/en
Publication of WO2000026839A9 publication Critical patent/WO2000026839A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q99/00Subject matter not provided for in other groups of this subclass

Definitions

  • This invention relates to the field of computer analysis of electronic documents.
  • Information to be sorted and stored in a computer database may reside in
  • employee for a specific job may have a specific job description.
  • the project manager must sift through several documents which contain the
  • project manager may have to read the documents several times and may have to review
  • a computerized system which can analyze and extract pertinent information
  • these documents may be prepared in a variety of different file formats, such as Microsoft Word 97, Rich Text Format, PDF, WordPerfect, ASCII files, and
  • HTML HyperText Markup Language
  • the present invention is an apparatus, method, and computer-readable medium
  • semantic network engine (216) for determining a skill level for the skill information
  • knowledge section processor (702) uses a non-monotonic reasoning principle to
  • the content analysis and semantic network engine (216) further comprises a
  • document (104) comprises the steps of: identifying skill and knowledge information in
  • the method further comprises the step of storing the skill information and qualitative
  • Figure 1 is a block diagram of a preferred embodiment of a system 100 in
  • FIG. 2 is a block diagram of a preferred embodiment of an extraction server
  • Figure 3 is a flow chart of a preferred embodiment of the steps performed by
  • Figure 4 is a block diagram of a preferred embodiment of a thesaurus. 221
  • Figure 5 is a block diagram of a preferred embodiment of a semantic network
  • Figure 6 is a flow chart of a preferred embodiment of the steps performed by
  • Figure 7 is a block diagram of a preferred embodiment of a system 700 in
  • Figure 8 is a flow chart of a preferred embodiment of the steps performed by
  • Figure 9 is a screen shot of a user interface of a preferred embodiment of a
  • target database 110 display for skill information.
  • a host computer 102 using the method
  • unstructured text As used herein "unstructured text"
  • Examples of documents containing unstructured text include, but are not limited to, a
  • the host computer 102 is
  • a conventional computer having a keyboard and mouse for input (not shown), and a
  • the electronic document 104 may be prepared in any electronic file
  • the electronic document 104 is processed by host computer 102 using the
  • host computer 102 uses extraction server 108 to extract data from external source 102 .
  • extraction server 108 uses extraction server 108 to extract data from external source 102 .
  • word groups are used to mean any text that may be derived from document 104
  • the extraction server 108 identifies the document type of the
  • the structure and operation of the extraction server 108 is
  • the target database 110 comprises predefined tables with predefined columns
  • a predefined table and predefined columns correspond to a
  • document 104 is a resume
  • predefined For example, if document 104 is a resume, then a predefined
  • document 104 is a patent document, then a predefined table for a document type called
  • pattern document may have predefined columns such as “inventors”, “company”,
  • present invention is not limited to a particular document type or a predefined table, but
  • the process of extraction performed by the extraction server 108 preferably
  • extraction server 108 assumes a reasonable default value. That default value is
  • the present invention advantageously allows a user to extract skill
  • the present invention analyzes an electronic copy of a text document
  • target database comprising predefined tables and columns associated with a particular
  • the target database can then be retrieved and manipulated by other computer program
  • the electronic document 104 may be any electronic
  • the electronic document 104 may be an electronic form of a hard copy of a
  • OCR OCR
  • Microsoft Word file 204 an ASCII text file 206 or
  • target database 110 information in target database 110 are also preferably stored in memory 106.
  • the extraction server 108 comprises a document preprocessor
  • heuristics engine 212 coupled to the document pre-processor 210, a morpho logical
  • analysis engine 214 coupled to the heuristics engine 212, a content analysis and
  • semantic network engine 216 coupled to the document preprocessor 210, and a database interface 222 coupled to the content analysis and semantic network engine
  • 216 preferably comprises section processors 218 and a semantic network 220.
  • the document pre-processor 210 retrieves the electronic document 104 from
  • memory 106 and performs the initial analysis of the electronic document 104.
  • the document pre-processor 210 performs the
  • the document pre-processor 210 identifies the file format of the electronic
  • the document pre-processor 210 filters out (304) any unnecessary and
  • processor 210 then stores (306) formatting information for the document 104 such as,
  • the document pre-processor 210 then performs paragraph identification
  • Paragraph characteristics include, but are not limited to, the number of
  • the document pre-processor 210 groups the paragraphs into sections.
  • the heuristic engine 212 applies a set of heuristics, that is a set of rules, to the
  • the set of heuristics which are applied to the electronic document 104 are associated
  • the morphological analysis engine 214 is used for target language analysis and
  • LinguistiX 2.0 application programming interface API
  • the LmguistiX 2.0 API is a language neutral
  • LinguistiX API can analyze documents in
  • LinguistiX API are external to and separate from the document pre-processor
  • the Heuristics Engine 212 uses the following features provided by
  • LinguistiX API tokenization, lexical analysis, tagging, and noun-phrase extraction.
  • LinguistiX tokenization includes the ability to recognize multi-word
  • the lexical analysis feature identifies the grammatical
  • the tagging feature identifies the
  • LinguistiX phrase extraction technology enables
  • semantic network 220 to identify the multi-word noun phrases.
  • the extraction server 108 may discover that a
  • the database interface 222 is a set of APIs that provide a mechanism for
  • the extraction server 108 can
  • database interface 222 provides the following mechanisms: a method
  • the content analyzer and semantic network engine 216 analyzes the content of
  • the electronic document 104 extracts words and word groups from the document 104, extracts words and word groups from the document 104, and
  • section processors 218 which extract information from a particular section
  • the semantic network 220 uses a thesaurus
  • the thesaurus 221 is shown.
  • the thesaurus 221 is a vocabulary database for the extraction
  • the server 108 and is organized by skills.
  • the thesaurus 221 groups all related terms 402
  • a "concept” or “skill” 404 comprises a
  • skills 404 connect all the different names for the same skill 404 that are
  • each skill 404 has a unique skill identifier (ConceptlD).
  • Concept ID the concept ID
  • terml 402 A may consist of 'MS VC++'
  • term2 402B may consist of
  • 'Microsoft Visual C++' and term3 402C may consist of 'MS Visual C++'. All these
  • document 104 uses any of the words or word groups 'MS VC++', 'Microsoft Visual
  • the thesaurus 221 allows the extraction server 108 to
  • term4, term5 and term ⁇ are respectively 'JDK 1.1', 'Symantec cafe',
  • the electronic document 104 uses any of the words or word groups 'JDK 1.1',
  • the thesaurus 221 allows the extraction server 108 to
  • the thesaurus 221 may also comprise other information such as the attributes
  • Attributes provide additional information that
  • thesaurus 221 also comprises relationships among skills 404. Preferably, these
  • subsumption refers to relationships that include related skills, co-occurring skills
  • thesaurus 221 are not limited to the examples given herein but may contain any
  • thesaurus facilitates the access to concept relationships and to
  • FIG. 5 a block diagram of a preferred embodiment of a
  • semantic network 220 is shown.
  • the semantic network 220 provides a way of
  • the semantic network 220 is of higher level knowledge-concepts and categories.
  • the semantic network 220 is of higher level knowledge-concepts and categories.
  • the semantic network 220 is configured to:
  • a category 504 is the highest level in the semantic network 222. Broad
  • categories 504 may be created according to a specific industry which fully subsume
  • the semantic network 220 categorizes
  • Knowledge-concepts 502 comprises
  • Each knowledge-concept 502 is
  • the semantic network 220 categorizes all terms 402 into skills 404. As
  • the entire semantic network 220 separate from the thesaurus 221, comprises
  • a single knowledge-concept 502 can comprise several skills 404 and a
  • knowledge-concepts 502 may comprises a category 504 and several categories may
  • the skill 404 'Visual C++' may also belong to the knowledge-concept 502
  • Programming Environment may also be linked to other skills 404 such as 'Visual
  • the semantic network 220 uses subsumption as the basis for the hierarchical
  • An object may also be subsumed by more than one higher level object.
  • the skill 404 'JDBC may be subsumed by at least two knowledge-concepts
  • sections are then analyzed (604) and information is extracted from the sections.
  • the extracted information is stored (606) in a predefined structure in the target database
  • the present invention advantageously extracts
  • the present invention provides a powerful semantic network and
  • the semantic network can stored information relating to any field, industry or
  • the section processors 218 extract information from sections of interest in an
  • network engine 216 comprises a section processor 218 for extracting words or word
  • Section processors 218 are configured to operate on a specific document type
  • type may comprise a cover letter section processor for extracting information from a
  • a contact information section processor for extracting contact information
  • a skills and experience section processor for extracting the skills
  • an education section processor for extracting educational
  • section processor for extracting any articles or documents published by a candidate.
  • Each section processor 218 analyzes a particular section in the electronic document
  • section processor 218 applies a set of heuristics to the particular section of interest in
  • present invention comprising a skills and knowledge information extractor 702.
  • the skills and knowledge information extractor 702 allows a
  • a "career profile” refers to any qualitative and quantitative
  • such information includes, but is not limited to, how long a candidate worked
  • “skill” or “skill information” refers to the skills 404 in the thesaurus 221 and
  • semantic network 220 which relate to those terms, and "knowledge” or “knowledge
  • a candidate may have used the terms "Microsoft Visual C++" or "MS
  • the present invention is able to determine that the candidate has "skill" in C++
  • the skill and knowledge information extractor 702 uses a non-monotonic
  • non- monotonic reasoning refers to the use of default assumptions which are made about the
  • extractor 702 is best illustrated using an example.
  • the present invention finds a skill, X, in a candidate's
  • X is refined. Additional knowledge that may be used to refine the skill level includes,
  • X is found in the Objective Section of a resume R, a positive numerical value, or
  • this weightage value is computed for all
  • associated skills are the skills related
  • W(Y) may also be added to the skill level.
  • W(LU) which is subtracted from the skill level.
  • SkillLevel(X') SkillLevel(X) + W(O) + ⁇ W(P j ) + W(K) + W(Y) - W(LU)
  • the weightage functions are computed using the total number of skill levels
  • extractor 702 assumes that a person has an average skill level for a particular skill such as C++. If the candidate's resume states that the candidate took a course in C++, that
  • knowledge information extractor 702 then maps the skill value to a scale for
  • the present invention allows a user to
  • scale may map the final skill value to a scale comprising numbers such as 1 to 5 or 1 to
  • a scale may map the final skill value to a scale comprising numbers and adjectives
  • the qualitative scale may be determined by the
  • the categories, knowledge, skills and terms are preferably set up in a relational
  • resume is evaluated (802) for a particular skill.
  • Window 902 displays the particular skills analyzed from a candidate's
  • the highlighted portion of window 902 indicates that the candidate has some
  • present invention advantageously allows a user to extract, determine, and display from
  • the present invention is designed as a set of Object Oriented Libraries and
  • the present invention may be implemented to run
  • Database tables may be used to define how information is represented in a relational or
  • any relational table is preferably represented as an object class.
  • object class any relational table
  • Table 1 holds the documents that are to be extracted. It holds the following information:
  • Table 2 holds information about the scheduled extraction tasks.
  • Table 3 holds the personal information like name of the person, contact address, current employer, resume summary etc.
  • the XtractionXpert automatically extracts the following information from the resume:
  • Table 16 provides information regarding the relationships between categories and knowledge information.
  • Table 17 provides knowledge information for semantic network 220.
  • Table 18 provides information relating to skills.
  • Table 19 provides information on relationships between skills and knowledge.
  • Table 20 provides information on terms.
  • Table 21 stores information about different languages to which the terms belong.

Abstract

An apparatus, method, and computer readable medium for analyzing and extracting skill and knowledge information from an electronic document (104) and for storing the extracted skill and knowledge information into predefined fields or tables in a target database (110) comprises a content analysis and semantic network engine (216) for analyzing and extracting skill and knowledge information from the electronic document (104). A skill and knowledge information extractor (702) is coupled to the content analysis and semantic network engine (216), for determining a skill level for the skill information extracted from the electronic document (104). In a preferred embodiment, the skill and knowledge section processor (702) uses a non-monotonic reasoning principle to determine a skill level for skill information extracted from the electronic document (104). The content analysis and semantic network engine (216) further comprises a thesaurus (221) for linking together terms (402) and skill information (404), and for defining relationships between and among the terms (402) and skill information (404), and a semantic network (220) coupled to the thesaurus (221), for organizing the terms (402) and skill information (404) in the thesaurus (221), along with knowledge information (502) and categories (504), in a hierarchical structure.

Description

ADVANCED MODEL FOR AUTOMATIC EXTRACTION OF SKILL AND KNOWLEDGE INFORMATION FROM AN ELECTRONIC DOCUMENT
RELATED APPLICATION
The subject matter of this application is a continuing application of and claims
priority from U.S. patent application Serial No. 09/380,219, filed August 27, 1999
descending in priority from PCT application PCT US98/27664, filed on December 28,
1998, and entitled "Xtraction Server" by Prabhat K. Andleigh, Nagaraju Pappu, and
Vasudeva Kalidindi. Said two earlier applications are commonly assigned with the
instant application.
The subject matter of this application is also related to and claims priority from
U.S. Provisional Application Serial No. 60/107,063, filed Novmeber 4, 1998, and
entitled "Advanced Model for Automatic Extraction of Content, Skills, and Knowledge
from Resumes" by Prabhat K. Andleigh, Nagaraju Pappu, and Vasudeva Kalidindi,
which application is commonly assigned with the instant application, and is
incorporated herein by reference in its entirety.
TECHNICAL FIELD
This invention relates to the field of computer analysis of electronic documents.
More specifically, it relates to the field of information retrieval to convert and store
information in documents written in a natural language into a predefined structure
which can be retrieved and manipulated by computer program applications.
BACKGROUND OF THE INVENTION
Information to be sorted and stored in a computer database may reside in
numerous electronic documents. For example, information about people and their specific talents and skills may reside in electronic documents, such as resumes,
performance appraisals, design documents, publications, books, patent documents, and
email messages. When an individual is trying to organize and sort out specific
information from such electronic documents, the individual usually has to open each
document separately and manually analyze, retrieve, and store the relevant data in the
particular database. For example, a project manager who would like to find the best
employee for a specific job may have a specific job description. When searching for
an employee whose skills, knowledge and talent are best suited for the specific job
description, the project manager must sift through several documents which contain the
necessary information. Such a process is time consuming and inefficient, because the
project manager may have to read the documents several times and may have to review
and type the information into a computer database in order to organize the various
pieces of information into a coherent summary.
A computerized system which can analyze and extract pertinent information
from different electronic documents would provide a more efficient solution to this
problem. However, such text documents are often written in unstructured natural
language text for other people to understand. Thus, computer programs such as
database applications cannot efficiently process documents written in natural language
texts. Rather, computer programs can process only information which has been stored
in a highly structured fashion in order to retrieve and manipulate that information.
Additionally, these documents may be prepared in a variety of different file formats, such as Microsoft Word 97, Rich Text Format, PDF, WordPerfect, ASCII files, and
HTML, and may be stored in different areas within a computer.
There are a variety of information retrieval programs such as Internet search
engines that can retrieve documents that match a set of keywords. Their scope is very
limited in the context of the above mentioned problem, because they cannot understand
the text, and certainly they cannot make any connection between the document and the
person who is related to that document. Another problem is that the 'information of
interest' will vary significantly from one organization to another. For example, a health
care organization will be interested in the skills and talents related to the medical field,
but the skills related to computers may not be of significant interest, whereas a
software development organization will be interested in the computer and software
related skills, but may not be interested in medical or first-aid related skills. The
keyword based search engines cannot address this problem of retrieving only the
'information of interest'. As a result, there is a vast amount of information about
people which cannot be easily processed by computer programs.
For example, in today's large corporations and government organizations, it is
not uncommon to receive hundreds of thousands of resumes of potential candidates in
a very short time. Recruiting the right candidates from such a vast pool of applicants is
a very complicated problem. It is crucial for organizations to find the people with the
right knowledge and skill set. In essence, managers have to deal with a vast number of
resumes, try to understand the content within the resumes, and short-list candidates
who have the right skills and knowledge. For example, if an organization wants to recruit a middle level manager with 5 to 8 years of experience to lead a development
project, the organization will need to sort through thousands of resumes and determine
from each one whether that particular candidate has the requisite knowledge and skill
level. It is not possible to find the best resumes using a standard full text search engine
because such search programs search for a particular input string and retrieve only
resumes which contain that particular input string. Such an approach is not that useful,
because a particular skill may be written using many different terms (e.g. Microsoft
Word, MS Word, Word 97, etc....) even though the terms all refer to the same or
similar ideas. Moreover, in addition to not being able to correctly identify a
candidate's skills, a typical search program cannot identify the type of experience with
that skill, the duration of that experience, or the overall knowledge gained by the
candidate in a specific skill group. Additionally, it is also very desirable to have a
system for determining not only the knowledge and skills of a candidate but also the
proficiency level of a candidate in a particular skill.
Therefore, what is needed is a system for analyzing and extracting information
from an electronic document and for storing the extracted information in a database.
Additionally, what is needed is a system for analyzing and extracting skill and
knowledge information from an electronic document and for determining a skill level
for skill information and for mapping such skill level information to a qualitative scale.
DISCLOSURE OF INVENTION
The present invention is an apparatus, method, and computer-readable medium
for analyzing and extracting skill and knowledge information from an electronic document (104) and for storing the extracted skill and knowledge information into
predefined fields or tables in a target database (110). The system for analyzing and
extracting skill and knowledge information from an electronic document (104)
comprises a content analysis and semantic network engine (216) for analyzing and
extracting skill and knowledge information from the electronic document (104), and a
skill and knowledge information extractor (702) coupled to the content analysis and
semantic network engine (216), for determining a skill level for the skill information
extracted from the electronic document (104). In a preferred embodiment, the skill and
knowledge section processor (702) uses a non-monotonic reasoning principle to
determine a skill level for skill information extracted from the electronic document
(104). The content analysis and semantic network engine (216) further comprises a
thesaurus (221) for linking together terms (402) and skill information (404) and for
defining relationships between and among the terms (402) and skill information (404),
and a semantic network (220) coupled to the thesaurus (221), for organizing the terms
(402) and skill information (404) in the thesaurus (221), knowledge information (502),
and categories (504) in a hierarchical structure.
A method for extracting skill and knowledge information from an electronic
document (104) comprises the steps of: identifying skill and knowledge information in
the electronic document (802); determining a skill level for skill information from the
electronic document (804); and mapping the skill level to a qualitative scale (806).
The method further comprises the step of storing the skill information and qualitative
skill level scale mapping in the target database (808). BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a block diagram of a preferred embodiment of a system 100 in
accordance with the present invention.
Figure 2 is a block diagram of a preferred embodiment of an extraction server
108 in accordance with the present invention.
Figure 3 is a flow chart of a preferred embodiment of the steps performed by
the document pre-processor 210.
Figure 4 is a block diagram of a preferred embodiment of a thesaurus. 221
Figure 5 is a block diagram of a preferred embodiment of a semantic network
220.
Figure 6 is a flow chart of a preferred embodiment of the steps performed by
the extraction server 108.
Figure 7 is a block diagram of a preferred embodiment of a system 700 in
accordance with the present invention.
Figure 8 is a flow chart of a preferred embodiment of the steps performed by
the skill and knowledge information extractor 702.
Figure 9 is a screen shot of a user interface of a preferred embodiment of a
target database 110 display for skill information.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Referring now to Figure 1 , a system 100 upon which a preferred embodiment
of the present invention operates is shown. A host computer 102, using the method
and system described herein, operates upon an electronic document 104, derived from a text document which contains unstructured text. As used herein "unstructured text"
refers to any document which has been written in a natural language such as English.
Examples of documents containing unstructured text include, but are not limited to, a
resume, performance appraisals, design documents, publications, books, patent
documents, and email messages. In a preferred embodiment, the host computer 102 is
a conventional computer having a keyboard and mouse for input (not shown), and a
conventional memory 106 associated with host computer 102 for storing the electronic
document 104. The electronic document 104 may be prepared in any electronic file
format, such as Microsoft Word 97, Rich Text Format, PDF, WordPerfect, ASCII files,
and HTML.
The electronic document 104 is processed by host computer 102 using the
present invention. Specifically, host computer 102 uses extraction server 108 to
analyze, retrieve and store words and word groups from the electronic document 104
into a predefined structure in target database 110. As used herein, the terms "words"
and "word groups" are used to mean any text that may be derived from document 104
including, but not limited to, individual words or numbers, phrases, whole sentences,
and blocks of text. The extraction server 108 identifies the document type of the
document 104 and determines which words and word groups are to be extracted from
the document 104. The structure and operation of the extraction server 108 is
described in more detail below with reference to Figures 2 through 6.
The target database 110 comprises predefined tables with predefined columns
for storing the word and word groups extracted from the electronic document 104. In a preferred embodiment, a predefined table and predefined columns correspond to a
particular document type. For example, if document 104 is a resume, then a predefined
table for a document type called "resume" may have predefined columns such as
"name and address", "education", and "skills and experience". As another example, if
document 104 is a patent document, then a predefined table for a document type called
"patent document" may have predefined columns such as "inventors", "company",
"patent number", and "field of search". The predefined tables and columns in target
database 110 are organized ahead of time, and one skilled in the art will realize that the
present invention is not limited to a particular document type or a predefined table, but
that many different compilations of predefined tables and columns may be stored in
target database 110 within the scope of this invention. The words and word groups
stored in the target database 110 can be stored in electronic form on any type of
computer data storage device or they may be printed out in a hard-copy printed format.
The process of extraction performed by the extraction server 108 preferably
uses a non-monotonic reasoning principle. As used herein, a "non-monotonic
reasoning principle" refers to a process whereby at every stage during extraction, the
extraction server 108 assumes a reasonable default value. That default value is
modified as further information becomes available. For example, a string '1987' is
first assumed to be a number, and if further information to qualify the string to be a
date is available ( for example in this case, that the string is preceded by another string
'Jan'), then the assumption is changed. If again further information becomes available
to negate the previous assumption, the assumption is changed again. Thus, the present invention advantageously allows a user to extract skill and
knowledge information from an electronic document directly into a database. More
specifically, the present invention analyzes an electronic copy of a text document and
extracts words and word groups relating to skill and knowledge information into a
target database comprising predefined tables and columns associated with a particular
document type. Moreover, the present invention operates upon electronic documents
in any electronic file format. The extracted skill and knowledge information stored in
the target database can then be retrieved and manipulated by other computer program
applications.
Referring now to Figure 2, a block diagram of a preferred embodiment of the
extraction server 108 is shown. The electronic document 104 may be any electronic
file stored in memory 106 which is accessible by the extraction server 108. For
example, the electronic document 104 may be an electronic form of a hard copy of a
document converted using a conventional optical scanner and Optical Character
Recognition (OCR) software 202, a Microsoft Word file 204, an ASCII text file 206 or
an email attachment 208. The database applications which manipulate the extracted
information in target database 110 are also preferably stored in memory 106. In a
preferred embodiment, the extraction server 108 comprises a document preprocessor
210 coupled to the memory 106 where the electronic document 104 is stored, a
heuristics engine 212 coupled to the document pre-processor 210, a morpho logical
analysis engine 214 coupled to the heuristics engine 212, a content analysis and
semantic network engine 216 coupled to the document preprocessor 210, and a database interface 222 coupled to the content analysis and semantic network engine
216 and to the target database 110. The content analysis and semantic network engine
216 preferably comprises section processors 218 and a semantic network 220.
The document pre-processor 210 retrieves the electronic document 104 from
memory 106 and performs the initial analysis of the electronic document 104.
Referring now to Figure 3, a flowchart of the steps of a preferred operation of the
document pre-processor 210 is shown. The document pre-processor 210 performs the
initial analysis and extraction of the electronic document 104 by first converting (302)
the electronic document 104 from its native file format into ASCII text. More
specifically, the document pre-processor 210 identifies the file format of the electronic
document 104 and extracts the ASCII text out the document 104. For example, if the
electronic document 104 is a Microsoft Word file, then the document pre-processor
210 identifies the file by the Microsoft Word signature and uses the Microsoft Object
Linking and Embedding Software Development Kit (Microsoft OLE 2.0 SDK) to
extract text from the Microsoft Word File.
Next, the document pre-processor 210 filters out (304) any unnecessary and
unwanted information such as, but not limited to, email headers, OCR headers, blank
pages, and unwanted characters. Preferably, any information that is not part of the
original document is treated as unnecessary information. For example, email headers,
non-ASCII characters at the beginning or at the end of the file, extra blank lines and
blank spaces are removed from the text. Additionally, if the text contains vertical
tables, these tables are preferably converted into horizontal tables. If the text contains multiple columns, it is preferably converted into single column. The document pre¬
processor 210 then stores (306) formatting information for the document 104 such as,
but not limited to, the fonts used, font sizes, section tittles, and subsections.
The document pre-processor 210 then performs paragraph identification
heuristics (308) on the electronic document 104. During this step, the beginning and
end of each paragraph is identified, and the paragraph characteristics are gathered. As
used herein, the phrase "paragraph characteristics" refers to the statistical properties of
the paragraph. Paragraph characteristics include, but are not limited to, the number of
words in the paragraph, the number of lines in the paragraph, the average number of
words per line, whether any line has a bullet as the starting character, and whether
there are any underlined sentences in the paragraph.
Finally, the document pre-processor 210 performs paragraph grouping
heuristics (310) on the electronic document 104. Once the paragraphs have been
identified, the document pre-processor 210 groups the paragraphs into sections.
During this step, the paragraphs are grouped into sections based on the paragraph
characteristics as well as using any section tittles that precede the paragraphs. Starting
at the beginning of the electronic document 104, the first heading or section title is
identified, and the following paragraphs until the next section title are grouped into one
section. If no section titles are found, then using the paragraph characteristics, all the
similar paragraphs are grouped into sections. Additionally, paragraphs that have same
or similar characteristics are grouped together into sections. The heuristic engine 212 applies a set of heuristics, that is a set of rules, to the
electronic document 104 for analyzing information in the electronic document 104.
The set of heuristics which are applied to the electronic document 104 are associated
with a particular document type. For example, if the document type is a "resume",
then the set of heuristics associated with the document type "resume" is applied to the
electronic document 104. Heuristics are described below in more detail in commonly
assigned U.S. patent application Serial No. 09/380,219 entitled "Extraction Server" by
Prabhat K. Andleigh, Nagaraju Pappu, and Vasudeva Kalidindi, which is incorporated
herein by reference in its entirety.
The morphological analysis engine 214 is used for target language analysis and
is preferably the LinguistiX 2.0 application programming interface (API) from InXight
Corporation in Palo Alto, CA. The LmguistiX 2.0 API is a language neutral
programming interface. In other words, the LinguistiX API can analyze documents in
any language such as English, French or German. Because the heuristics engine 212
and the LinguistiX API are external to and separate from the document pre-processor
210 and the content analysis and semantic network engine 216, the present invention
can extract information from documents in the English, French or German language,
and any other languages which will be supported by the LinguistiX API in future.
Preferably, the Heuristics Engine 212 uses the following features provided by
the LinguistiX API: tokenization, lexical analysis, tagging, and noun-phrase extraction.
Before text from the electronic document 104 can be analyzed in terms of its linguistic
roots and function, it must first be segmented into words, punctuation and idiomatic phrases. LinguistiX tokenization includes the ability to recognize multi-word
constructs such as HTML tags. The lexical analysis feature identifies the grammatical
features of a word in addition to its root forms. The tagging feature identifies the
grammatical category of words by their context. The noun-phrase extraction identifies
multi-word phrases in documents. LinguistiX phrase extraction technology enables
software to work with these larger concepts to provide improved information analysis
and retrieval. For example, 'Windows Programming' will be identified as one phrase,
instead of two distinct words Windows and Programming. This feature is used by the
semantic network 220 to identify the multi-word noun phrases.
These features of the LinguistiX API are used to implement the heuristics. For
example, by using the tagging feature, the extraction server 108 may discover that a
particular word is a proper noun. Whether that word is the name of the person or the
name of a company will depend on where the word occurred in a document. For
example, if the word occurs in a contact information section of a document, then it may
be the name of the person, or name of the street, city and so on. If the word occurs in
an experience section of a document, and if it is followed by the name of a city and
state, it may be a company name.
The database interface 222 is a set of APIs that provide a mechanism for
retrieving and storing information to and from the target database 110. This is done in
such a way that the underlying implementation of the target database 110 is hidden
from the application using the database interface. Thus, the extraction server 108 can
work with any industry standard relational database software such as Oracle or Microsoft SQL Server without having to change the software or its implementation.
Additionally, the database interface 222 provides the following mechanisms: a method
to connect to the target database, a method to maintain the connection to the database,
a transaction model to maintain the consistency of the database, and various methods
to retrieve, query, update, insert and delete information from the target database 110.
The content analyzer and semantic network engine 216 analyzes the content of
the electronic document 104, extracts words and word groups from the document 104,
and stores the extracted information in the appropriate tables in the target database 110.
In a preferred embodiment, the content analyzer and semantic network engine 216
comprises section processors 218 which extract information from a particular section
of interest, and a semantic network 220. The semantic network 220 uses a thesaurus
221 and a phrase extraction process to identify the meta-concepts and categories in the
electronic document 104 and extracts related words and word groups into the target
database 110. In a preferred embodiment, the present invention may be implemented
to run on a Windows NT Server and Oracle Database.
Referring now to Figure 4, a block diagram of a preferred embodiment of a
thesaurus 221 is shown. The thesaurus 221 is a vocabulary database for the extraction
server 108 and is organized by skills. The thesaurus 221 groups all related terms 402
in a language under a language independent concept 404. As used herein, a "term" 402
refers to all the individual words or word groups that belong to a particular language
along with their alternatives. As used herein, a "concept" or "skill" 404 comprises a
set of terms 402 that are language specific and alternatives to one another. However, the skill 404 itself is language independent. Skills 404 establish synonymous
relationships among all terms 402 in the thesaurus 221 that have the same meaning. In
other words, skills 404 connect all the different names for the same skill 404 that are
known to the thesaurus 221 and specify certain characteristics for each name.
Preferably, each skill 404 has a unique skill identifier (ConceptlD). The Concept ID
by itself has no intrinsic meaning. Each term 402 in each language in the thesaurus
221 has a unique term identifier. The same term 402 in different languages, for
example, in English and Spanish, will have a different term identifier for each
language.
To illustrate the relation between terms 402 and skills 404 consider an example
in which terml 402 A may consist of 'MS VC++', term2 402B may consist of
'Microsoft Visual C++' and term3 402C may consist of 'MS Visual C++'. All these
terms 402 are linked to the skill 404 'Visual C++'. In other words, if the electronic
document 104 uses any of the words or word groups 'MS VC++', 'Microsoft Visual
C++' or 'MS Visual C++', the thesaurus 221 allows the extraction server 108 to
recognize the words or word groups as being linked to the skill 1 404A 'Visual C++'. In
another example, term4, term5 and termό are respectively 'JDK 1.1', 'Symantec Cafe',
and 'JDBC, and all these terms 402 are linked to the skill2 404B called 'Java'. Thus, if
the electronic document 104 uses any of the words or word groups 'JDK 1.1',
'Symantec Cafe', and 'JDBC, the thesaurus 221 allows the extraction server 108 to
recognize the word or word group as being linked to the skill2 404B 'Java'. The thesaurus 221 may also comprise other information such as the attributes
of a skill 404 or attributes of a term 402. Attributes provide additional information that
helps to define the meaning of a skill 404 and explain how it may be used in a
document. In other words, the different senses of a particular word or word groups are
captured using the attributes.
In addition to the relationship between a skill 404 and a set of terms 402, the
thesaurus 221 also comprises relationships among skills 404. Preferably, these
relationships are non-subsumption relationships. As used herein, the term "non-
subsumption" refers to relationships that include related skills, co-occurring skills
and or associated expressions. In other words, non-subsumption refers to relationships
that are not based on subsumption. For example, C++ and Java are related, but neither
subsumes the other. All these relationships among skills 404 indicate that the skills
404 linked together are not exactly similar but are associated with each other in
different ways. One skilled in the art will realize that the terms and skills of the
thesaurus 221 are not limited to the examples given herein but may contain any
number of terms and skills which have been predefined and stored in the thesaurus 221
prior to the processing of the electronic document 104. Thus, the thesaurus
advantageously allows the present invention to link together terms and skills used in
specific industries, disciplines, and technologies for which the thesaurus is being used,
and preserves the meanings and hierarchical connections between those terms and
skills. Additionally, the thesaurus facilitates the access to concept relationships and to
term and skill attributes irrespective of the term used as a point of entry. Referring now to Figure 5, a block diagram of a preferred embodiment of a
semantic network 220 is shown. The semantic network 220 provides a way of
arranging all the skills 404 at the lowest level and then builds a taxonomy or network
of higher level knowledge-concepts and categories. The semantic network 220
comprises skills 404 at the lowest level, "knowledge" or knowledge-concepts 502 at a
second level, and categories 504 at the highest level. The semantic network 220
together with the thesaurus 221 provides a four level hierarchy of terms 402, skills 404,
knowledge-concepts 502 and categories 504.
A category 504 is the highest level in the semantic network 222. Broad
categories 504 may be created according to a specific industry which fully subsume
other knowledge-concepts 502 and skills 404. The semantic network 220 categorizes
all knowledge-concepts 502 into categories 504. Knowledge-concepts 502 comprises
the next level in the semantic network 220 hierarchy. Each knowledge-concept 502 is
a collection of skills 404 that add to the body of knowledge. The semantic network
220 categorizes all skills 404 into knowledge-concepts 502. As described earlier with
reference to Figure 4, skills 404 are generic and language independent from all related
terms 402. The semantic network 220 categorizes all terms 402 into skills 404. As
described earlier with reference to Figure 4, terms 402 comprise language dependent
strings that are found in the electronic document 104. Terms 402 comprise the lowest
level in the semantic network 220 hierarchy.
The entire semantic network 220, separate from the thesaurus 221, comprises
language independent knowledge that is arranged as a taxonomy. Preferably, the relationships between skills 404 and knowledge-concepts 502 as well as the
relationships between knowledge-concepts 502 and categories 504 are many to many.
In other words, a single knowledge-concept 502 can comprise several skills 404 and a
single skill 404 can be linked to several knowledge-concepts 502. Similarly, several
knowledge-concepts 502 may comprises a category 504 and several categories may
have links to a single knowledge-concept 502.
To illustrate the terms 402, skills 404, knowledge-concepts 502, and categories
504 of a semantic network 220, the two concepts discussed earlier with reference to
Figure 4, namely 'Visual C++' and 'Java', will be used. Both these skills 404 may be
grouped under a knowledge-concept 502 'Object Oriented programming languages'.
Additionally, the skill 404 'Visual C++' may also belong to the knowledge-concept 502
'Visual Programming Environment'. The knowledge-concept 502 "Visual
Programming Environment" may also be linked to other skills 404 such as 'Visual
Basic'.
The semantic network 220 uses subsumption as the basis for the hierarchical
organization of skills 404, knowledge-concepts 502, and categories 504. In other
words, the relationship between skills 404 and knowledge-concepts 502 and
knowledge-concepts 502 and categories 504 in the semantic network 220 are based on
conceptual subsumption, where a more general object 'subsumes' a more specific
object. The concept of subsumption is more general than the concept of synonymy.
An object is subsumed by another object if the subsuming object is much more general
than any other subsumed objects and effectively summarizes the subsumed objects. Truly synonymous objects mutually subsume each other. If only synonymous based
relationships are allowed, then the granularity between the objects cannot be captured
effectively as there are not many truly synonymous objects. The difference between the
shades of meaning will not allow correct retrieval in a synonym-based network. The
subsumption-based network removes these drawbacks and aids in retrieving related
concepts more accurately, since a subsumption is more general compared to a
synonym. For example, the object 'JDBC is subsumed by a more general object called
'Java Programming Language' (a knowledge-concept 502), which is further subsumed
by an even more generic object 'Software Engineering' (a category 504).
An object may also be subsumed by more than one higher level object. For
example, the skill 404 'JDBC may be subsumed by at least two knowledge-concepts
502 such as 'Java Programming Language' and 'Database Connectivity Library'. Each
of these knowledge-concepts 502 may in turn be subsumed by several categories 504.
Hence, the conceptual subsumption also allows many-to-many relationships between
skills 404 and knowledge-concepts 502 and between knowledge-concepts 502 and
categories 504.
Referring now to Figure 6, a flowchart of the steps of a preferred embodiment
of a method performed by the content analysis and semantic network engine 216 is
shown. First, identification heuristics are performed (602) on the electronic document
104 to identify the beginning and end of the known sections of interest. The sections of
interest are configured by the user when the extraction server 108 is first installed. The
sections are then analyzed (604) and information is extracted from the sections. The extracted information is stored (606) in a predefined structure in the target database
110. Using the semantic network 220, words and word groups are analyzed (608) and
the relationships between the different words and word groups are determined and
stored in the target database 110. Thus, the present invention advantageously extracts
meaningful information from electronic documents, and stores them in a predefined
structure in a target database. The extracted information stored in the target database
can then be retrieved and manipulated by computer program applications accessing the
database. Moreover, the present invention provides a powerful semantic network and
thesaurus for defining terms, concepts, meta-concepts, and categories and the
relationship between and among such terms, concepts, meta-concepts, and categories.
Thus, the semantic network can stored information relating to any field, industry or
technology, and allows the extraction server 108 to process various types of documents
pertaining to such fields, industries or technologies.
The section processors 218 extract information from sections of interest in an
electronic document 104. The particular sections of interest from which information is
extracted is determined by the document type. The content analysis and semantic
network engine 216 comprises a section processor 218 for extracting words or word
groups from each section of interest in an electronic document.
Section processors 218 are configured to operate on a specific document type
and may contain one or several section processors 218. For example, resumes
typically contain several sections such as a cover letter, contact information, an
objective section, an experience section, an education section, a patents section, a publications section, an awards and honors received section, and a courses attended
section. In a preferred embodiment, section processors 218 for a resume document
type may comprise a cover letter section processor for extracting information from a
cover letter, a contact information section processor for extracting contact information
for a candidate, a skills and experience section processor for extracting the skills and
experience of a candidate, an education section processor for extracting educational
information from a candidate, an awards and honors section processor for extracting
any awards and honors received by a candidate, a patents section processor for
extracting information about patents obtained by a candidate, and a publications
section processor for extracting any articles or documents published by a candidate.
Each section processor 218 analyzes a particular section in the electronic document
104 and extracts specific words and word groups from that section into a specific
record in the target database 110. Additionally, as described in more detail in
commonly assigned U. S. Patent Application Serial No. 09/380,219 entitled "Xtraction
Server" by Prabhat K. Andleigh, Nagaraju Pappu, and Vasudeva Kalidindi, each
section processor 218 applies a set of heuristics to the particular section of interest in
order to analyze and extract the desired information.
Referring now to Figure 7, there is shown a preferred embodiment of the
present invention comprising a skills and knowledge information extractor 702. The
skills and knowledge information extractor 702 allows the system to automatically
extract from a document, such as a resume, the skills of a candidate, the candidate's
knowledge in a particular area, and to determine the proficiency level of the candidate in any given skill. Thus, the skills and knowledge information extractor 702 allows a
user to automatically determine a "career profile" of a candidate from his or her
resume. As used herein, a "career profile" refers to any qualitative and quantitative
information about a candidate's work history, experience, and proficiency. For
example, such information includes, but is not limited to, how long a candidate worked
in a particular profession, when, where, and at what depth did the candidate gain
experience in a particular skill, what is the candidate's overall knowledge level in a
particular area, how much management experience a candidate has, etc.
As used herein, "terms" refers to the actual word or words which are found in a
resume, "skill" or "skill information" refers to the skills 404 in the thesaurus 221 and
semantic network 220 which relate to those terms, and "knowledge" or "knowledge
information" refers to the knowledge-concepts 502 relating to the skills. For example,
in a resume, a candidate may have used the terms "Microsoft Visual C++" or "MS
VC++". The present invention would identify these terms as belonging to the skill
"C++", which in turn is related to the knowledge "object oriented programming" which
in turn may be related to the category "Software." Thus, although the only terms
actually used in and extracted from the resume were "Microsoft Visual C++" and "MS
VC++", the present invention is able to determine that the candidate has "skill" in C++
and has "knowledge" of object oriented programming even though the words C++ and
object oriented programming were never used in the document.
The skill and knowledge information extractor 702 uses a non-monotonic
reasoning principle to determining a candidate's skill level. As described above, non- monotonic reasoning refers to the use of default assumptions which are made about the
state of unknown factors. These default assumptions may be changed as new
information or evidence becomes available. Additionally, default assumptions may be
changed due to the absence of certain information or evidence. The operation of the
non-monotonic reasoning approach used by the skill and knowledge for information
extractor 702 is best illustrated using an example.
During operation, the present invention finds a skill, X, in a candidate's
resume, R. In the absence of any other knowledge, the skill and knowledge
information extractor 702 assumes that the skill level of the candidate for the skill X is
average. As the skill and knowledge information extractor 702 obtains additional
information from the resume R about skill X, the assumption of the skill level for skill
X is refined. Additional knowledge that may be used to refine the skill level includes,
but is not limited to, the section in which the skill X is found. For example, if the skill
X is found in the Objective Section of a resume R, a positive numerical value, or
objective weightage factor W(O), will be added to the skill level. Additionally, a
positive weight for each project in which the skill X is used, represented here by W(Pj),
may be added to the skill level. Preferably, this weightage value is computed for all
projects in resume R. The number of associated skills that are also used, W(K), may
also be added to the skill level. As used herein, associated skills are the skills related
to the main skill; knowing a main skill implies that a person also knows all associated
skills. For example, if one is an expert in the skill "database programming" or
"database administration," this person must be knowledgeable in the associated skill "SQL." Associated skills can be determined using the semantic network 220 and the
thesaurus 221. For a given skill x, all its associated skills (X, ... Xn) are linked with x
through the semantic network 220 and thesaurus 221. For example, a thesaurus 221
entry for the "skill database administration" would contain links to the "skills database
server administration," "database user management," and/or "SQL." Also, the number
of years of experience for the skill X, W(Y) may also be added to the skill level.
Moreover, the number of years since the skill X was used may represent a negative
factor, W(LU), which is subtracted from the skill level. Thus, in a preferred
embodiment, a summation of the weights described above gives a specific skill level
for the skill X. A mathematical representation for determining the skill level of a
particular skill is as follows:
SkillLevel(X') = SkillLevel(X) + W(O) + ∑ W(Pj) + W(K) + W(Y) - W(LU)
The weightage functions are computed using the total number of skill levels
that are defined, and the distance from the current skill level to the next skill level.
One skilled in the art will realize that the weightage factors used to adjust the skill
level are not limited to those listed in the above example but can comprise any number
of factors to be determined by the system creator.
The computation of the skill level of a particular skill for a candidate can also
be demonstrated using an example. Initially, the skill and knowledge information
extractor 702 assumes that a person has an average skill level for a particular skill such as C++. If the candidate's resume states that the candidate took a course in C++, that
fact would add a positive weightage factor to the skill level, thus adjusting the average
skill level to a higher value. If the candidate's resume also states that the candidate has
two years of work experience in C++, that fact would add another positive weightage
factor to the skill level and adjust the average skill level to another higher value. The
values by which the average skill level is adjusted for the C++ course and the two
years of work experience are not necessarily the same but may reflect the value
attributed by the system creator. Each mention of C++ in the resume would this be
used to adjust the skill level either up or down. Additionally, the user of terms in the
resume which are related in the semantic network 220 and thesaurus 221 to the concept
or skill C++ could also be used to adjust the skill level of the candidate. After all the
relevant terms in the candidate's resume have been extracted and evaluated, the skill
and knowledge information extractor 702 determines a single value for the skill level
for the candidate for the particular skill.
After a final skill value for a particular skill has been determined, the skill and
knowledge information extractor 702 then maps the skill value to a scale for
qualitatively illustrating the proficiency of the candidate in that particular skill. For
example, if a final skill value for a particular candidate has been determined to be the
number 6.8, that number may map to a rating of "good" on a scale of 1 to 10, with 1
being poor and 10 being excellent. Thus, the present invention allows a user to
determine the proficiency of a candidate's skill level for a particular skill and to ascribe
a qualitative value to that proficiency level. One skilled in the art will realize that the qualitative scales used to describe a particular skill value may be any type of scale with
a range of numerical values and/or adjective descriptors. For example, a qualitative
scale may map the final skill value to a scale comprising numbers such as 1 to 5 or 1 to
10. A scale may map the final skill value to a scale comprising numbers and adjectives
such as 1 (poor) to 10 (excellent). The qualitative scale may be determined by the
system creator.
The categories, knowledge, skills and terms are preferably set up in a relational
database prior to the extraction process. As described above with reference to Figures 4
and 5, in a preferred embodiment, the relationship between categories and knowledge
is many-many, the relationship between knowledge and skills is many-to-many, and
the relationship between skills and terms is one-to-many.
Referring now to Figure 8, there is shown a flow chart of a preferred
embodiment of a method for the present invention. In a preferred embodiment, a
resume is evaluated (802) for a particular skill. The skill level for that particular skill
is then determined (804) using the above described techniques. After a final skill level
value is determined, the skill level is mapped (806) to a qualitative scale. Finally, the
skill and the qualitative scale value of the skill level is stored (808) in the target
database. More specifically, the categories, knowledge, skills and terms (i.e. the
semantic network) are loaded into main memory. The electronic document text is then
passed to the skill and knowledge information extractor 702. In a preferred
embodiment, knowledge, skills, skill levels and number of years are extracted from the
electronic document in the following manner: first, all the terms in the database are checked against the document, then an initial scan of the document collects all the
terms. The frequency of appearance of the term is recorded. Afterwards, the weightage
factors for the skill level calculation are applied. A second scan of the electronic
document analyses the document and a running list is maintained for all terms to
calculate the experience duration where the term is maintained. On completion of the
second scan, all the terms are rolled up into skills according to the semantic network
and thesaurus, all the skills are rolled up into knowledge according to the semantic
network, and all the knowledge items are rolled up into categories. Additionally,
categories specifically mentioned are added. Thus, based on this information, the skill
levels and years of experience are computed as described above.
Referring now to Figure 9, there is shown a screen shot of a user interface of a
preferred embodiment of a target database for a skill and knowledge information
extractor 702. Window 902 displays the particular skills analyzed from a candidate's
resume, the qualitative level determined by the skill and knowledge information
extractor 702, and the years of experience the candidate has for the particular skill. For
example, the highlighted portion of window 902 indicates that the candidate has some
skill as an analyst, that the qualitative proficiency of the candidate's skill as an analyst
is "excellent", and that the candidate has 4 years of experience as an analyst. Thus, the
present invention advantageously allows a user to extract, determine, and display from
a candidate's resume the proficiency of a particular skill of the candidate.
The present invention is designed as a set of Object Oriented Libraries and
contains the following major Object Libraries:
Figure imgf000030_0001
In a preferred embodiment, the present invention may be implemented to run
on a Windows NT Server and any relational database such as Oracle Database.
Database tables may be used to define how information is represented in a relational or
object-oriented database. In an object-oriented implementation, any relational table is preferably represented as an object class. The following section describes a preferred
embodiment of the content and type of the fields that are extracted into a relational
database, and also the definitions of the categories, knowledge, skills and terms. The
supporting tables are also explained. One skilled in the art will realize that these tables
are not limited to the specific information illustrated therein but may be created as
needed, depending on the document type being processed.
Table 1
AutoEntryDocuments
Table 1 holds the documents that are to be extracted. It holds the following information:
Figure imgf000031_0001
Table 2 AutoEntrySchedule
Table 2 holds information about the scheduled extraction tasks.
Figure imgf000031_0002
Table 3 Candidate
Table 3 holds the personal information like name of the person, contact address, current employer, resume summary etc. The XtractionXpert automatically extracts the following information from the resume:
Figure imgf000031_0003
Figure imgf000032_0001
Figure imgf000033_0001
Figure imgf000034_0001
Table 9
Kno wledgeRecord
Figure imgf000035_0001
RECandidateDate End Date association ended
Table 13 Courses
COCourseName Name ofthe Course taken
CandidatelD Database Id ofthe Candidate
CODDateDD Date course taken (date)
CODateMM Date course taken (month)
CODateYYYY Date course taken (year)
CoNotes Description ofthe course
Table 14 AwardsHonors
Awhighlight Name and highlight ofthe Award or Honor
CandidatelD Database Id ofthe candidate
AWNotes Description ofthe Award or Honor
Table 1
Miscellenouslnformation
Figure imgf000036_0001
Table 16 Category
Table 16 provides information regarding the relationships between categories and knowledge information.
Figure imgf000036_0002
Table 17 MetaConcept
Table 17 provides knowledge information for semantic network 220.
Figure imgf000036_0003
Table 18 Concept
Table 18 provides information relating to skills.
Figure imgf000037_0001
Table 19 Con ceptRelation
Table 19 provides information on relationships between skills and knowledge.
Figure imgf000037_0002
Table 20 Term
Table 20 provides information on terms.
Figure imgf000037_0003
Table 21 Language
Table 21 stores information about different languages to which the terms belong.
Figure imgf000037_0004
Table 23 CaWordList
Figure imgf000038_0001
Table 24 Ca WordPosition
Figure imgf000038_0002
From the above description, it will be apparent that the invention disclosed
herein provides a novel and advantageous system and method for extracting and
analyzing skill and knowledge information from an electronic document. The
foregoing discussion discloses and describes merely exemplary methods and
embodiments ofthe present invention. As will be understood by those familiar with
the art, the invention may be embodied in other specific forms without departing from
the spirit ofthe invention or essential characteristics thereof. Accordingly, the
disclosure ofthe present invention is intended to be illustrative, but not limiting, ofthe
scope ofthe invention, which is set forth in the following claims.

Claims

1. An apparatus for extracting skill and knowledge information from an
electronic document and for storing skill and knowledge information into a target
database, the apparatus comprising:
a content analysis and semantic network engine for analyzing and extracting
skill and knowledge information from the electronic document; and
a skill and knowledge information extractor, coupled to the content analysis
and semantic network engine, for determining a skill level for the skill information
extracted from the electronic document and for storing the skill level in the target
database.
2. The apparatus of claim 1 wherein the skill and knowledge information
extractor also maps the skill level for the skill to a qualitative scale.
3. The apparatus of claim 1 wherein the content analysis and semantic
network engine further comprises:
a thesaurus for linking together terms and skills; and
a semantic network, coupled to the thesaurus, for organizing terms and skills of
the thesaurus, knowledge, and categories, and for defining relationships between and
among the terms, skills, knowledge, and categories.
4. The apparatus of claim 1 wherein the skill and knowledge information
extractor determines a skill level, at least in part, by using the mathematical equation: SkillLevel(X) = SkillLevel(X) + W(O) + ∑ W(P,) + W(K) + W(Y) -
W(LU)
5. The apparatus of claim 1 wherein the skill and knowledge information
extractor determines a skill level using a non-monotonic and default reasoning
approach.
6. The apparatus of claim 2 wherein a skill extracted from the electronic
document and the skill mapping to a qualitative scale are displayed on a computer.
7. An apparatus for analyzing and extracting skill and knowledge
information from an electronic document into a target database having predefined
fields, the apparatus comprising:
a thesaurus for linking together terms and skills and for defining relationships
between and among the terms and skills; and
a semantic network coupled to the thesaurus for organizing terms and skills in
the thesaurus, knowledge, and categories in a hierarchical structure;
wherein the thesaurus and semantic network are used to analyze skill and
knowledge information in the electronic document.
8. The apparatus of claim 7 further comprising:
a document pre-processor coupled to the semantic network for classifying the
electronic document as a document type and for performing an initial analysis on the
electronic document.
9. The apparatus of claim 7 further comprising:
a heuristics engine coupled to the semantic network for applying a set of
heuristics to the electronic document.
10. The apparatus of claim 7 further comprising:
a skill and knowledge information extractor for extracting skill and knowledge
information from the electronic document and for determining a skill level for skill
information extracted from the electronic document.
11. The apparatus of claim 10 further comprising:
a target database coupled to the semantic network for storing skill and skill
level information in predefined fields in the target database.
12. A method for determining a skill level for skill information extracted
from an electronic document, the method comprising the steps of:
identifying skill and knowledge information in the electronic document;
extracting the skill and knowledge information from the electronic document;
and
determining a skill level for skill information extracted from the electronic
document.
13. The method of claim 12 wherein the step of determining a skill level is
performed by a skill and knowledge information extractor.
14. The method of claim 12 wherein the step of identifying skill and
knowledge information is performed using a semantic network.
15. A method for processing skill and knowledge information from an
electronic document, the method comprising the steps of:
identifying skill and knowledge information in the electronic document;
extracting the skill and knowledge information from the electronic document;
determining a skill level for skill information extracted from the electronic
document; and
mapping the skill level to a qualitative scale.
16. A computer implemented method for extracting and displaying skill and
knowledge information from an electronic document, the method comprising the steps
of:
identifying skill and knowledge information in the electronic document;
extracting the skill and knowledge information from the electronic document;
determining a skill level for skill information extracted from the electronic
document; and
mapping the skill level to a qualitative scale.
17. A computer-readable medium for extracting and displaying skill and
knowledge information from an electronic document, the computer-readable medium
comprising code for performing the steps of:
identifying skill and knowledge information in the electronic document; extracting the skill and knowledge information from the electronic document;
determining a skill level for skill information extracted from the electronic
document; and
mapping the skill level to a qualitative scale.
PCT/US1999/026083 1998-11-04 1999-11-03 Advanced model for automatic extraction of skill and knowledge information from an electronic document WO2000026839A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0113250A GB2359168A (en) 1998-11-04 2001-05-31 Advanced model for automatic extraction of skill and knowledge information from an electronic document

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US10706398P 1998-11-04 1998-11-04
US60/107,063 1998-11-04
PCT/US1998/027664 WO1999034307A1 (en) 1997-12-29 1998-12-28 Extraction server for unstructured documents
USPCT/US98/27664 1998-12-28
US38021999A 1999-08-27 1999-08-27
US09/380,219 1999-08-27

Publications (3)

Publication Number Publication Date
WO2000026839A1 true WO2000026839A1 (en) 2000-05-11
WO2000026839A8 WO2000026839A8 (en) 2000-10-12
WO2000026839A9 WO2000026839A9 (en) 2001-08-02

Family

ID=26804347

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/026083 WO2000026839A1 (en) 1998-11-04 1999-11-03 Advanced model for automatic extraction of skill and knowledge information from an electronic document

Country Status (1)

Country Link
WO (1) WO2000026839A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005024692A1 (en) * 2003-09-03 2005-03-17 Yahoo! Inc. Automatically identifying required job criteria
EP1706845A2 (en) * 2003-12-02 2006-10-04 Unisys Corporation Improved cargo handling security handling system and method
EP1920364A2 (en) * 2005-07-27 2008-05-14 John Harney System and method for providing profile matching with an unstructured document
US8021163B2 (en) * 2006-10-31 2011-09-20 Hewlett-Packard Development Company, L.P. Skill-set identification
US9779390B1 (en) 2008-04-21 2017-10-03 Monster Worldwide, Inc. Apparatuses, methods and systems for advancement path benchmarking
US9959525B2 (en) 2005-05-23 2018-05-01 Monster Worldwide, Inc. Intelligent job matching system and method
US9996523B1 (en) 2016-12-28 2018-06-12 Google Llc System for real-time autosuggestion of related objects
US10181116B1 (en) 2006-01-09 2019-01-15 Monster Worldwide, Inc. Apparatuses, systems and methods for data entry correlation
US10387839B2 (en) 2006-03-31 2019-08-20 Monster Worldwide, Inc. Apparatuses, methods and systems for automated online data submission
US10607273B2 (en) 2016-12-28 2020-03-31 Google Llc System for determining and displaying relevant explanations for recommended content
US10997560B2 (en) 2016-12-23 2021-05-04 Google Llc Systems and methods to improve job posting structure and presentation
CN113240400A (en) * 2021-06-02 2021-08-10 北京金山数字娱乐科技有限公司 Candidate determination method and device based on knowledge graph

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5197004A (en) * 1989-05-08 1993-03-23 Resumix, Inc. Method and apparatus for automatic categorization of applicants from resumes
US5297039A (en) * 1991-01-30 1994-03-22 Mitsubishi Denki Kabushiki Kaisha Text search system for locating on the basis of keyword matching and keyword relationship matching
US5416694A (en) * 1994-02-28 1995-05-16 Hughes Training, Inc. Computer-based data integration and management process for workforce planning and occupational readjustment
WO1998039716A1 (en) * 1997-03-06 1998-09-11 Electronic Data Systems Corporation System and method for coordinating potential employers and candidates for employment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5197004A (en) * 1989-05-08 1993-03-23 Resumix, Inc. Method and apparatus for automatic categorization of applicants from resumes
US5297039A (en) * 1991-01-30 1994-03-22 Mitsubishi Denki Kabushiki Kaisha Text search system for locating on the basis of keyword matching and keyword relationship matching
US5416694A (en) * 1994-02-28 1995-05-16 Hughes Training, Inc. Computer-based data integration and management process for workforce planning and occupational readjustment
WO1998039716A1 (en) * 1997-03-06 1998-09-11 Electronic Data Systems Corporation System and method for coordinating potential employers and candidates for employment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NESTOROV S ET AL: "Inferring structure in semistructured data", SIGMOD RECORD,US,SIGMOD, NEW YORK, NY, vol. 26, no. 4, May 1997 (1997-05-01), pages 39 - 45-43, XP002099175, ISSN: 0163-5808 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005024692A1 (en) * 2003-09-03 2005-03-17 Yahoo! Inc. Automatically identifying required job criteria
EP1706845A2 (en) * 2003-12-02 2006-10-04 Unisys Corporation Improved cargo handling security handling system and method
EP1706845A4 (en) * 2003-12-02 2008-08-06 Unisys Corp Improved cargo handling security handling system and method
US9959525B2 (en) 2005-05-23 2018-05-01 Monster Worldwide, Inc. Intelligent job matching system and method
EP1920364A4 (en) * 2005-07-27 2010-10-13 John Harney System and method for providing profile matching with an unstructured document
EP1920364A2 (en) * 2005-07-27 2008-05-14 John Harney System and method for providing profile matching with an unstructured document
US10181116B1 (en) 2006-01-09 2019-01-15 Monster Worldwide, Inc. Apparatuses, systems and methods for data entry correlation
US10387839B2 (en) 2006-03-31 2019-08-20 Monster Worldwide, Inc. Apparatuses, methods and systems for automated online data submission
US8021163B2 (en) * 2006-10-31 2011-09-20 Hewlett-Packard Development Company, L.P. Skill-set identification
US9779390B1 (en) 2008-04-21 2017-10-03 Monster Worldwide, Inc. Apparatuses, methods and systems for advancement path benchmarking
US9830575B1 (en) 2008-04-21 2017-11-28 Monster Worldwide, Inc. Apparatuses, methods and systems for advancement path taxonomy
US10387837B1 (en) 2008-04-21 2019-08-20 Monster Worldwide, Inc. Apparatuses, methods and systems for career path advancement structuring
US10997560B2 (en) 2016-12-23 2021-05-04 Google Llc Systems and methods to improve job posting structure and presentation
US9996523B1 (en) 2016-12-28 2018-06-12 Google Llc System for real-time autosuggestion of related objects
US10607273B2 (en) 2016-12-28 2020-03-31 Google Llc System for determining and displaying relevant explanations for recommended content
CN113240400A (en) * 2021-06-02 2021-08-10 北京金山数字娱乐科技有限公司 Candidate determination method and device based on knowledge graph

Also Published As

Publication number Publication date
WO2000026839A9 (en) 2001-08-02
WO2000026839A8 (en) 2000-10-12

Similar Documents

Publication Publication Date Title
Chu Information representation and retrieval in the digital age
US5794236A (en) Computer-based system for classifying documents into a hierarchy and linking the classifications to the hierarchy
Kowalski Information retrieval systems: theory and implementation
US7890533B2 (en) Method and system for information extraction and modeling
US7257530B2 (en) Method and system of knowledge based search engine using text mining
Witten Text Mining.
US5819259A (en) Searching media and text information and categorizing the same employing expert system apparatus and methods
US7333984B2 (en) Methods for document indexing and analysis
Hatzigeorgiu et al. Design and Implementation of the Online ILSP Greek Corpus.
CA2471592C (en) Systems, methods and software for hyperlinking names
WO1999034307A1 (en) Extraction server for unstructured documents
JP2004110200A (en) Text sentence comparing device
Ellis et al. In search of the unknown user: indexing, hypertext and the World Wide Web
WO2000026839A1 (en) Advanced model for automatic extraction of skill and knowledge information from an electronic document
Nanba et al. Bilingual PRESRI-Integration of Multiple Research Paper Databases.
Feldman et al. Text mining via information extraction
Abascal et al. X-tract: Structure extraction from botanical textual descriptions
Tursunov Description of the management system programs of the national corpus of the uzbek language
Lama Clustering system based on text mining using the K-means algorithm: news headlines clustering
Milić-Frayling Text processing and information retrieval
Ayele Text Mining Technique for Driving Potentially Valuable Information from Text
Jadhav et al. A Survey on Text Mining-Techniques, Application
Aladağ The Potential of GPT in Ottoman Studies: Computational Analysis of Evliya Çelebi’s Travelogue with NLP and Text Mining and Digital Edition with TEI
Heryono et al. Word Frequencies in Linguistic Articles Published in SINTA Indexed Journals
Kuhns A survey of information retrieval vendors

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref country code: US

Ref document number: 1999 380219

Date of ref document: 19991112

Kind code of ref document: A

Format of ref document f/p: F

AK Designated states

Kind code of ref document: A1

Designated state(s): CA GB IN US

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: C1

Designated state(s): CA GB IN US

CFP Corrected version of a pamphlet front page
CR1 Correction of entry in section i
ENP Entry into the national phase

Ref country code: GB

Ref document number: 200113250

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 09831064

Country of ref document: US

AK Designated states

Kind code of ref document: C2

Designated state(s): CA GB IN US

COP Corrected version of pamphlet

Free format text: PAGES 1-36, DESCRIPTION, REPLACED BY NEW PAGES 1-33; PAGES 37-41, CLAIMS, REPLACED BY NEW PAGES 34-37; PAGES 1/8-8/8, DRAWINGS, REPLACED BY NEW PAGES 1/9-9/9; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE