DE19955717A1

DE19955717A1 - Converting unstructured data into structured data involves suggesting data structure element for selected input data segment that can be structured, allocating structure element as target element

Info

Publication number: DE19955717A1
Application number: DE19955717A
Authority: DE
Inventors: Frank Leymann; Dieter Roller
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1998-11-11
Filing date: 1999-11-04
Publication date: 2000-08-24

Abstract

The method is implemented by a computer system and involves a data selection step, in which at least one data segment is selected, whereby this data segment contains part of the input data and can be converted into a structured data segment. At least one data structure element is suggested in a suggestion step. An allocation step involves allocating a data structure element as the target structure element for storing the selected data segment. The segment is then extracted from the input data and stored in the target data structure element. Independent claims are also included for a system for implementing the method, for a data processing program and for a computer program.

Description

1. Background of the Invention 1.1 Scope of application of the present invention

Die vorliegende Erfindung bezieht sich auf ein Verfahren auf dem Gebiet der Informations-Mining. Genauer gesagt bezieht sich die vorliegende Erfindung auf ein Verfahren zur Behandlung unstrukturierter Eingabedaten.The present invention relates to a method the field of information mining. More specifically, relates The present invention relates to a method for Treatment of unstructured input data.

1.2 Description and disadvantages of the previous situation

Organisationen erzeugen und erfassen große Datenmengen, die sie in ihren täglichen Abläufen verwenden. Dennoch sind zahlreiche Unternehmen nicht in der Lage, das volle Potential dieser Daten auszuschöpfen, da der Informationsgehalt dieser Daten nicht einfach zu erkennen ist. Die in Verwendung befindlichen Systeme zeichnen Transaktionen genauso auf, wie sie eingehen, also Tag und Nacht, und speichern die Transaktionsdaten in Dateien und Datenbanken ab. Dokumente werden erstellt und in gemeinsamen Dateien oder in von Dokumentverwaltungen bereitgestellten Ablagesystemen abgelegt. Die zunehmende Verbreitung des Internet und seine wachsende weltweite Akzeptanz als Hauptkanal sowohl für die Kommunikation zwischen einzelnen Personen als auch für die Abwicklung von Geschäftsabläufen (beispielsweise durch email) haben die Informationsquellen und somit die Chancen zur Erlangung von Wettbewerbsvorteilen vervielfacht. Business Intelligence Solutions ist ein Begriff, der die Prozesse beschreibt, die insgesamt verwendet werden, um eine bessere Entscheidungsfindung zu erreichen. Die Informations-Mining bezeichnet den Prozeß des Daten-Mining und/oder des Text-Mining. Dabei wird eine moderne Technologie verwendet, mit der wertvolle Einblicke in diese Quellen erreicht werden, die es dem geschäftlichen Benutzer ermöglichen, die richtigen Entscheidungen zu treffen und somit einen Wettbewerbsvorteil zu erlangen, der nötig ist, um in der modernen Wettbewerbsumgebung erfolgreich zu sein. Das Informations-Mining erzeugt aus jeder Quelle im allgemeinen zuvor unbekannte, gut verständliche und belangbare Daten wie beispielsweise Transaktionen, Dokumente, email, Web-Seiten usw. Diese Daten können die Grundlage für wichtige Geschäftsentscheidungen darstellen.Organizations generate and collect large amounts of data use them in their daily routines. Still are numerous companies are unable to get the full To exploit the potential of this data as the Not easily recognizable information content of this data is. Draw the systems in use Transactions exactly as they are received, i.e. day and Night, and save the transaction data in files and Databases. Documents are created and shared Files or in document management services Filing systems filed. The increasing spread of the Internet and its growing worldwide acceptance as Main channel for both communication between individuals People as well as for the handling of business processes (for example by email) have the information sources and thus the chances of gaining competitive advantages multiplied. Business Intelligence Solutions is a Term that describes the processes in total used to make better decision making to reach. The information mining referred to the process of Data mining and / or text mining. Doing so uses modern technology with valuable insights can be reached in these sources, which it is the business Allow users to make the right decisions meet and thus gain a competitive advantage that is necessary to in the modern competitive environment to be succesfull. Information mining generates from any source in general previously unknown, well understandable and relevant data such as Transactions, documents, email, web pages etc. This data can be the basis for important business decisions represent.

Daten bilden dabei den Rohstoff. Es kann sich hierbei um eine Gruppe diskreter Fakten über Ereignisse handeln, und in diesem Fall spricht man nützlicherweise von strukturierten Aufzeichnungen von Transaktionen, die normalerweise in alphanumerischer Form vorliegen. Doch Dokumente und Web- Seiten sind auch eine Quelle unstrukturierter Daten, die als Bitstrom bereitgestellt und zu Textwörtern und -sätzen einer bestimmten Landessprache dekodiert werden.Data is the raw material. It can be act a group of discrete facts about events, and in in this case one speaks of structured Transaction records, usually in alphanumeric form. Yet documents and web Pages are also a source of unstructured data that can be viewed as Bitstream provided and one to text words and sentences certain national language can be decoded.

Industrieanalysen gehen davon aus, daß unstrukturierte Daten 80% aller Daten in einem Unternehmen ausmachen und nur 20% strukturiert sind; diese Daten haben unterschiedliche Quellen, beispielsweise Text, Bild, Video und Audio. Der überwiegende Anteil der strukturierten Daten liegt allerdings in Textform vor.Industry analyzes assume that unstructured Data make up 80% of all data in a company and only 20% are structured; this data has different Sources such as text, images, video and audio. The most of the structured data lies however in text form.

Das Daten-Mining nutzt die Infrastruktur gespeicherter Daten (also die Metainformationen der zu verarbeitenden Daten, beispielsweise das Layout der Daten, bestimmte Kennzeichnungen, Beziehungen usw.), um weitere nützliche Informationen zu erlangen. Durch Daten-Mining einer Kundendatenbank könnte man beispielsweise die Erkenntnis gewinnen, daß jeder, der das Produkt A kauft, auch die Produkte B und C kauft, lediglich sechs Monate später.Data mining uses the infrastructure of stored data (i.e. the meta information of the data to be processed, for example, the layout of the data, certain Markings, relationships, etc.) for more useful To obtain information. By data mining one Customer database, for example, could be the knowledge win that everyone who buys product A also the Buys products B and C just six months later.

Nur wenn die von einer Anwendung zu verarbeitenden Eingabedaten eine vordefinierte Struktur einhalten, die in der Anwendung bekannt ist, kann diese Anwendung die Eingabedaten be- und verarbeiten.Only if those to be processed by an application Input data adhere to a predefined structure, which in the application is known, this application can Edit and process input data.

Da die Verfügbarkeit strukturierter Daten Voraussetzung für jede weitere Verarbeitung der möglichen Komponenten, die die Bestandteile der unstrukturierten Daten ausmachen, sind, wurde beispielsweise das Text-Mining entwickelt. Text-Mining ist die Anwendung des Prinzips des Daten-Mining auf unstrukturierte oder geringfügig strukturierte Textdateien. Das Text-Mining muß im Gegensatz zum Daten-Mining in einer weniger strukturierten Umgebung erfolgen. Die Dokumente haben nur selten eine starke interne Infrastruktur (und wenn das der Fall ist, dann bezieht sich diese Infrastruktur meistens auf das Dokumentformat und weniger auf den Inhalt). Bei dem Text-Mining werden Metadaten über Dokumente aus den Dokumenten extrahiert. Die Metadaten stellen eine Möglichkeit dar, den Inhalt eines Dokuments anzureichern, und zwar so, daß die Mining-Software dieses Dokument anschließend manipulieren kann. Die Text-Mining Technik ist eine Methode zur Ausweitung des Daten-Mining auf die immensen und immer weiter wachsenden Mengen gespeicherter Texte in einem automatischen Prozeß, in dem strukturierte Daten erstellt werden, die Dokumente beschreiben. Innerhalb des Text-Mining gibt es viele verschiedene Technologien zur Erzeugung von Metadaten für ein Dokument, mit dem Ziel, die Art eines Dokuments zu bestimmen, seine Struktur abzuleiten, usw. Hier einige Beispiele: Since the availability of structured data is a prerequisite for any further processing of the possible components that the Make up parts of the unstructured data are For example, text mining was developed. Text mining is the application of the principle of data mining unstructured or slightly structured text files. In contrast to data mining, text mining must be in one less structured environment. The documents rarely have a strong internal infrastructure (and if if this is the case, then this infrastructure relates mostly on the document format and less on the content). With text mining, metadata about documents from the Documents extracted. The metadata represents one Possibility to enrich the content of a document and so that the mining software this document can then manipulate. The text mining technique is a method of extending data mining to the immense and ever increasing amounts of stored Texts in an automatic process in which structured Data are created that describe documents. Within Text mining has many different technologies Generation of metadata for a document, with the aim of Determine the nature of a document, derive its structure, etc. Here are some examples:

Merkmalsextraktion (feature extraction): dient zum Suchen und Extrahieren von Informationen oder Wissen aus Textdokumenten.Feature extraction: used for searching and extracting information or knowledge from it Text documents.

Cluster-Technologie (clustering technology): dient zum Sortieren von Dokumenten nach Themen, ermöglicht die Suche nach Schwerpunktthemen in einer Dokumentensammlung usw.Clustering technology: is used for Sorting documents by topic enables Search for key topics in a document collection etc.

Sämtliche dieser Technologien sind bis zu einem gewissen Grad effektiv und ermöglichen eine Orientierung unter dieser riesigen Anzahl unstrukturierter Informationsquellen. Letztendlich können sie jedoch nicht auf zuverlässige Weise und automatisch den strukturierten Informationsgehalt aus einem unstrukturierten Eingabedokument herausextrahieren. Sie können nur bestimmte Angaben zur Art der Eingabedaten liefern und bieten keine Instrumente zur Umwandlung der unstrukturierten Eingabedaten in strukturierte Eingabedaten an.All of these technologies are to a certain extent Degree effectively and allow orientation under this huge number of unstructured information sources. Ultimately, however, they cannot be reliable and automatically the structured information content extract an unstructured input document. You can only enter certain information about the type of input data supply and offer no tools to convert the unstructured input data into structured input data on.

1.3 Objective of the present invention

Das Prinzip der vorliegenden Erfindung beruht auf dem Ziel, ein Verfahren zum Herausfiltern strukturierter Daten aus einer unstrukturierten Eingabe bereitzustellen und auf diese Weise eine Anwendung zu unterstützen, die für ihre Verarbeitung strukturierten Eingabedaten benötigt.The principle of the present invention is based on the aim a method for filtering out structured data to provide an unstructured input and on it Way to support an application for their Processing structured input data required.

2. Summary and advantages of the present invention

Die Ziele der vorliegenden Erfindung werden gemäß der Ausführung von Anspruch 1 erreicht.The objects of the present invention are achieved according to the Execution of claim 1 achieved.

Das Prinzip der vorliegenden Erfindung bezieht sich auf ein von einem Computersystem ausgeführten Verfahren zur Umwandlung unstrukturierter Eingabedaten in strukturierte Ausgabedaten. Das Verfahren der vorliegenden Erfindung umfaßt einen Schritt der Datenauswahl, bei dem mindestens ein Datensegment ausgewählt wird, wobei das genannte Datensegment einen Teil der genannten Eingabedaten umfaßt und das genannte Datensegment in ein Datenstrukturelement umgewandelt werden kann. Das Verfahren der vorliegenden Erfindung umfaßt weiterhin einen Schritt, bei dem mindestens ein Datenstrukturelement vorgeschlagen wird. Schließlich umfaßt das Verfahren der vorliegenden Erfindung einen Schritt der Zuweisung, bei dem ein Datenstrukturelement als Zieldatenstrukturelement zur Speicherung des genannten ausgewählten Datensegments zugewiesen wird und bei dem das genannte ausgewählte Datensegment aus den genannten Eingabedaten extrahiert und im genannten Zieldatenstrukturelement gespeichert wird.The principle of the present invention relates to Methods executed by a computer system for Convert unstructured input data to structured Output data. The method of the present invention comprises a data selection step in which at least a data segment is selected, said Data segment comprises part of the input data mentioned and said data segment into a data structure element can be converted. The method of the present Invention further comprises a step in which at least a data structure element is proposed. Finally the method of the present invention comprises one Assignment step in which a data structure element as Target data structure element for storing the named selected data segments and where the named selected data segment from the named Input data extracted and in the named Target data structure element is saved.

Das Prinzip der vorliegenden Erfindung ermöglicht es, die riesigen immer weiter wachsenden Mengen unstrukturierter elektronischer Daten zu bewältigen. Aus Sicht des Benutzers besteht der Vorteil der vorliegenden Erfindung darin, die Aufgabe der Extraktion von Daten aus unstrukturierten Eingabedaten für Anwendungen zu vereinfachen, die strukturierte Daten erwarten. Die für die Datenextraktion benötigte Zeit wird deutlich reduziert, und das fehleranfällige Abtippen wird nicht länger benötigt. Der Benutzer kann ein Datensegment für die Extraktion beliebig auswählen und ist dabei nicht durch das System eingeschränkt. Der Benutzer muß nicht länger im voraus die potentiellen Datenstrukturen kennen. Stattdessen bietet das Verfahren der vorliegenden Erfindung im Vorschlagsschritt die möglichen Datenstrukturen zur Auswahl an. The principle of the present invention enables the huge steadily growing amounts of unstructured manage electronic data. From the user's perspective The advantage of the present invention is that Task of extracting data from unstructured Simplify input data for applications that expect structured data. The one for data extraction the time required is significantly reduced, and that Typing that is prone to errors is no longer required. The User can use any data segment for extraction select and is not through the system limited. The user no longer needs to know the know potential data structures. Instead, it offers Method of the present invention in the proposal step the possible data structures to choose from.

Entsprechend einem weiteren Ausführungsbeispiel der vorliegenden Erfindung enthält das Verfahren auch einen Schritt zur Speicherung, in dem das genannte Zieldatenstrukturelement dauerhaft gespeichert wird.According to a further embodiment of the The present invention also includes a method Storage step in which said Target data structure element is saved permanently.

Die dauerhafte Speicherung der erfaßten Eingabedaten ermöglicht es, daß jede beliebige Anwendung zu einem späteren Zeitpunkt darauf zugreifen kann.The permanent storage of the input data allows any application to become one can access it later.

Entsprechend einem weiteren Ausführungsbeispiel der vorliegenden Erfindung werden im Vorschlagsschritt Datenstrukturelemente und/oder Datenstrukturen vorgeschlagen, wobei die genannten Datenstrukturen ein oder mehrere Datenstrukturelemente und/oder ein oder mehrere weitere Datenstrukturen enthalten.According to a further embodiment of the present invention will be in the proposal step Data structure elements and / or data structures proposed, said data structures one or several data structure elements and / or one or more contain further data structures.

Dieses weitere Ausführungsbeispiel der vorliegenden Erfindung vermeidet Beschränkungen hinsichtlich dem Layouts der beteiligten Datenstrukturen. Jede Datenstruktur kann sich aus atomischen Datenelementen und/oder zusätzlichen Datenstrukturen (mit derselben Substruktur) zusammensetzen. Das Prinzip der vorliegenden Erfindung erlegt hinsichtlich des rekursiven Layouts keine Beschränkungen auf.This further embodiment of the present Invention avoids layout restrictions of the data structures involved. Any data structure can are made up of atomic data elements and / or additional ones Assemble data structures (with the same substructure). The principle of the present invention applies to of the recursive layout.

Entsprechend einem weiteren Ausführungsbeispiel der vorliegenden Erfindung geht dem Schritt der Datenauswahl ein Schritt zur Bestimmung des Anwendungskontextes voraus, bei dem mindestens eine Zielanwendung festgelegt wird, um gegebenenfalls die strukturierten Ausgabedaten zu verarbeiten. Im Schritt zur Bestimmung des Anwendungskontextes können die genannten Eingabedaten automatisch vom genannten Computersystem klassifiziert und mindestens einer Zielanwendung zugewiesen werden. Ersatzweise oder zusätzlich kann im genannten Schritt zur Bestimmung des Anwendungskontextes ein Benutzer aus einer Gruppe von Anwendungen mindestens eine Zielanwendung auswählen. Schließlich werden im genannten Vorschlagsschritt nur solche Datenstrukturelemente vorgeschlagen, die sich auf die genannte Zielanwendung beziehen.According to a further embodiment of the The present invention adopts the data selection step Step ahead to determine the application context at which has at least one target application set to if necessary, the structured output data to process. In the step of determining the The given input data can be used in the context of the application automatically classified by said computer system and assigned to at least one target application. Alternatively or additionally, in the step mentioned for Determining the application context of a user from a Group of applications at least one target application choose. Finally, in the suggested step mentioned proposed only those data structure elements that refer to get the mentioned target application.

Die Möglichkeit, einen Anwendungskontext auszuwählen, gestattet eine deutliche Reduzierung potentieller Zieldatenstrukturen im Vorschlagsschritt. Die Klassifizierung der Eingabedaten auf automatische Weise führt zu weiteren Vorteilen. Die Klassifizierung kann in einem automatisch zugewiesenen oder einem zuvor ausgewählten Anwendungskontext resultieren. Im letzteren Fall läßt sich der zuvor ausgewählte Anwendungskontext vom Benutzer weiter verfeinern.The ability to choose an application context allows a significant reduction in potential Target data structures in the proposal step. The Classification of input data in an automatic way leads to further advantages. The classification can be in an automatically assigned or a previously selected one Application context result. In the latter case, the previously selected application context by the user refine.

Entsprechend einem weiteren Ausführungsbeispiel der vorliegenden Erfindung analysiert im genannten Datenauswahlschritt ein Parser die genannten Eingabedaten, klassifiziert potentielle Datensegmente und wählt im voraus Datensegmente aus, die für eine Zielanwendung möglicherweise relevant sind.According to a further embodiment of the present invention analyzed in the above Data selection step a parser the input data mentioned, classifies potential data segments and selects in advance Segments of data that may be appropriate for a target application are relevant.

Auf der Basis dieser Vorgehensweise vereinfacht das Prinzip der vorliegenden Erfindung auch den Auswahlprozeß. Das Verfahren schlägt im voraus ausgewählte Datensegmente vor, die ein Benutzer übernehmen oder aufgrund seines zusätzlichen Wissens ergänzen kann.The principle is simplified based on this procedure the present invention also the selection process. The Process suggests selected data segments in advance, that a user inherits or because of his can add additional knowledge.

Entsprechend einem weiteren Ausführungsbeispiel der vorliegenden Erfindung wird vorgeschlagen, das Verfahren in einer Zielanwendung und/oder in einem Mailing-System und/oder in einem Textverarbeitungsprogramm zu integrieren. According to a further embodiment of the The present invention proposes the method in a target application and / or in a mailing system and / or to integrate in a word processing program.

Der Vorteil besteht darin, daß das Verfahren an denjenigen Stellen in einem System verfügbar gemacht wird, wo unstrukturierte Daten im System eingehen. Deshalb findet die Umwandlung in strukturierte Daten so früh wie möglich statt, was es allen Anwendungen, die zu einem späteren Zeitpunkt ausgeführt werden, ermöglicht, von den strukturierten Daten zu profitieren.The advantage is that the process on those Make available in a system where receive unstructured data in the system. Therefore, the Conversion to structured data as early as possible, what it all applications that later to be executed from the structured data to benefit.

Brief description of the drawings

Fig. 1 zeigt ein Beispiel einer manuellen Datenerfassung unter Verwendung von Formularen gemäß dem Stand der Technik. Fig. 1 shows an example of a manual data collection using forms according to the prior art.

Fig. 2 ist eine Abbildung der Erfassung strukturierter Ausgabedaten aus unstrukturierten Eingabedaten in Übereinstimmung mit der vorliegenden Erfindung. Fig. 2 is an illustration of the collection of structured data output from unstructured input data in accordance with the present invention.

Fig. 3 veranschaulicht das weitere Ausführungsbeispiel eines Anwendungskontextes zur Begrenzung der Gruppe potentieller Zieldatenstrukturen. Fig. 3 shows the further embodiment illustrates an application context to limit the group of potential target data structures.

Fig. 4 ist eine Zusammenfassung des Verfahrens der vorliegenden Erfindung. Figure 4 is a summary of the method of the present invention.

4. Description of the preferred embodiment

Wenn in der Beschreibung der vorliegenden Erfindung von elektronischen Daten oder einem elektronischen Dokument usw. die Rede ist, dann sind damit alle Datenarten gemeint.When in the description of the present invention of electronic data or an electronic document etc. then all types of data are meant.

4.1 Introduction

Die im Einsatz befindlichen Systeme (Systeme also, die die täglichen Abläufe eines Unternehmens steuern) arbeiten mit strukturierten Daten. Die Zusammensetzung solcher Daten aus einfachen atomischen Datentypen (in einfachen Fällen aus Ganzzahlen, Strings usw.) ist vordefiniert und in den Systemen, die diese Daten verarbeiten, bekannt. Ohne solche Metadaten funktioniert keine der klassischen Anwendungen oder gar Algorithmen: Daten, die nicht strukturiert wurden, können grundsätzlich nicht verarbeitet werden (zumindest nicht hinsichtlich ihrer potentiellen Bestandteile).The systems in use (systems that are the daily business processes) work with structured data. The composition of such data simple atomic data types (in simple cases from Integers, strings, etc.) is predefined and in the Systems that process this data are known. Without such Metadata doesn't work in any of the classic applications or even algorithms: data that has not been structured, can basically not be processed (at least not in terms of their potential components).

Wenn Menschen miteinander kommunizieren, werden Daten hauptsächlich in 'unstrukturierter' Form verwendet.When people communicate with each other, data becomes mainly used in 'unstructured' form.

Beispiele für solche Daten sind Text, Bild und Sprache, die zwischen Menschen ausgetauscht werden, die in Briefen, Telefax-Nachrichten, e-mails, Telefongesprächen usw. miteinander kommunizieren. Diese Daten besitzen keine Struktur, die für die Anwendungen oder Algorithmen verfügbar ist. Folglich muß der Mensch aus diesen unstrukturierten Eingaben die relevanten Daten herausfiltern und sie entsprechend den Anforderungen der Anwendung strukturieren, wenn die unstrukturierte Eingabe sonst auf die Abläufe in einem Unternehmen negative Auswirkungen hätte.Examples of such data are text, image and language, which exchanged between people who are in letters, Fax messages, emails, phone calls, etc. communicate with each other. This data has no Structure available for the applications or algorithms is. Consequently, man must leave this unstructured Input the relevant data and filter them out structure according to the requirements of the application, if the unstructured input otherwise affects the processes in would have a negative impact on a company.

Fig. 1 zeigt, was heutzutage in solchen Situationen normalerweise gemacht wird: Die unstrukturierte Eingabe (100) (in diesem Fall ein Textverarbeitungsanhang an einem e-mail) wird von einem Menschen gelesen. Der Mensch versteht den Brief und weiß, welche Art von Daten die Zielanwendung benötigt. Der Mensch filtert also die erforderlichen Daten aus der unstrukturierten Eingabe heraus und gibt sie (Feld für Feld) in ein Formular (101) ein - Schritt 1 in Fig. 1. Dieses Formular könnte beispielsweise bereits in der Schnittstelle der Zielanwendung dargestellt werden. Sobald das Formular vollständig ausgefüllt ist, teilt er dies der Anwendung mit, die dann die inzwischen strukturierte Eingabe dazu verwendet, beispielsweise eine Datenbank (oder eine Datei usw.) (102) zu manipulieren - Schritt 2 in Fig. 1. Fig. 1 shows what is normally done in such situations nowadays: the unstructured input ( 100 ) (in this case a word processing attachment to an e-mail) is read by a person. The person understands the letter and knows what kind of data the target application needs. The human therefore filters out the necessary data from the unstructured input and enters it (field by field) in a form ( 101 ) - step 1 in FIG. 1. This form could, for example, already be displayed in the interface of the target application. As soon as the form has been completely filled out, he notifies the application, which then uses the now structured input to manipulate, for example, a database (or a file, etc.) ( 102 ) - step 2 in FIG. 1.

Diese Vorgehensweise ist nicht nur mühsam und zeitaufwendig, sondern bekanntermaßen auch fehleranfällig: Der Mensch muß sich an die Daten aus der unstrukturierten Eingabe erinnern (beispielsweise an den Namen eines Kunden und seine Schreibweise) und sie in das Formular der Anwendung eingeben. Natürlich kann er diese Daten auch handschriftlich auf ein Blatt Papier schreiben oder einfach nur die Datenquelle und das Zielformular in zwei verschiedenen Fenstern gleichzeitig auf dem Bildschirm anzeigen lassen, doch bleiben die Fehleranfälligkeit und der Aufwand hoch.This approach is not only tedious and time consuming, but, as is known, also prone to errors: humans must remember the data from the unstructured input (for example, the name of a customer and his Notation) and put them in the application form enter. Of course, he can also write this data by hand write on a piece of paper or just that Data source and the target form in two different Display windows on the screen at the same time, however, the susceptibility to errors and the effort remain high.

4.2 The solution

Die Lösung in Übereinstimmung mit der vorliegenden Erfindung wird in Fig. 2 dargestellt.The solution in accordance with the present invention is shown in FIG .

Das Prinzip der vorliegenden Erfindung beruht auf der Verwendung von Menüs, die die Erfassung strukturierter Daten anhand von unstrukturierten Eingaben unterstützen. Solche Menüs könnten beispielsweise in Form von stufenweisen Kontextmenüs angeboten werden. Zu diesem Zweck könnte die Software, die zum Durchforsten der unstrukturierten Eingabe (201) verwendet wird (beispielsweise ein Textverarbeitungsprogramm), durch eine Implementierung des vorgeschlagenen Verfahrens in Übereinstimmung mit der vorliegenden Erfindung erweitert werden.The principle of the present invention is based on the use of menus which support the acquisition of structured data on the basis of unstructured inputs. Such menus could be offered, for example, in the form of step-by-step context menus. To this end, the software used to crawl the unstructured input ( 201 ) (e.g., a word processor) could be expanded by implementing the proposed method in accordance with the present invention.

Das beschriebene Verfahren gestattet in einem Auswahlschritt die Auswahl (beispielsweise durch Markierung) eines Teils der Eingabe (202), die von einem Menschen als relevant für eine Anwendung angegeben wurde, die eine strukturierte Eingabe erfordert. Als zusätzliche Ergänzung könnte das Verfahren einzelne Elemente des Daten-Minings verwenden. Wenn man beispielsweise die "Merkmalsextraktionstechnologie" und die "Klassifikationstechnologie" einsetzt, kann ein Parser potentielle Datensegmente, die für eine Zielanwendung relevant sind, automatisch erkennen und klassifizieren. Auf der Grundlage dieses Parser-Schritts könnten erkannte Datensegmente, die möglicherweise relevant sind, bereits durch das beschriebene Verfahren im voraus ausgewählt werden. Ein Benutzer könnte daraufhin während des Verfahrensablaufs diese im voraus ausgewählten Datensegmente verwenden oder Ergänzungen in die durch dieses Verfahren beschriebene Vorauswahl einfügen.The described method allows in a selection step the selection (for example by marking) a part of the input ( 202 ) that has been indicated by a person as relevant for an application that requires a structured input. As an additional supplement, the process could use individual elements of data mining. For example, using "feature extraction technology" and "classification technology", a parser can automatically identify and classify potential data segments relevant to a target application. Based on this parser step, identified data segments that may be relevant could already be selected in advance by the described method. A user could then use these preselected data segments during the course of the method or insert additions into the preselection described by this method.

Als nächstes wird in einem Vorschlagsschritt ein Kontextmenü geöffnet, das alle möglichen für die Datenarten relevanten Datenstrukturen der Zielanwendung auflistet (Gesamtmenü (203) in Fig. 2, Schritt 1); die Öffnung des Kontextmenüs könnte in der heute für Textverarbeitungsprogramme üblichen Weise erfolgen, indem man beispielsweise die rechte Maustaste drückt, während man mit dem Mauspfeil auf den markierten Menüpunkt zeigt. Es sei darauf hingewiesen, daß es hierzu verschiedene Alternativen gibt (die sich auch kombinieren lassen):Next, in a suggestion step, a context menu is opened which lists all possible data structures of the target application relevant for the data types (overall menu ( 203 ) in FIG. 2, step 1 ); The context menu could be opened in the usual way for word processing programs today, for example by pressing the right mouse button while pointing the highlighted menu item with the mouse arrow. It should be noted that there are various alternatives (which can also be combined):

Das Gesamtmenü könnte alle Datenstrukturen aufführen, die für das Unternehmen relevant sind.The main menu could list all data structures that are relevant to the company.

Das Gesamtmenü könnte alle Datenstrukturen aufführen, die für eine bestimmte Zielanwendung oder eine bestimmte Gruppe von Zielanwendungen relevant sind.The main menu could list all data structures that for a specific target application or a specific Group of target applications are relevant.

Wenn man das Klassifizierungsergebnis eines Parsers nutzt, der die Eingabedaten analysiert hat, könnte das Gesamtmenü alle Datenstrukturen aufführen, die für den Dokumenttyp relevant sind, zu dem die Eingabedaten gehören.If you use a parser's classification result, who analyzed the input data could do that Complete menu list all data structures for the Relevant document type for which the input data belong.

An dieser Stelle wird die Zieleingabestruktur ausgewählt und die Auflistung der Datenstrukturen der nächsten Ebene (204), die das Ziel darstellen, angezeigt (Schritt 2 in Fig. 2). Der Einfachheit halber gehen wir davon aus, daß die letzteren Datenstrukturen bereits atomisch sind (sich selbst also nicht in weitere Subelemente abbauen lassen), so daß keine weitere Verfeinerung notwendig ist: Deshalb wird der zweite Kasten in Fig. 2 als Attributmenü bezeichnet. Ansonsten geht die Verfeinerung durch Öffnen zusätzlicher Listenkästen weiter, das heißt, die Strukturelemente, die selbst eine Struktur bilden, könnten eine Struktur mit weiteren Strukturelementen darstellen, usw. Durch Auswahl eines Menüpunkts aus dem Attributmenü wird der ausgewählte Teil aus der unstrukturierten Eingabe als Wert für dieses Attribut zugeordnet, was den Zuweisungsschritt des aktuellen Verfahrens vervollständigt.At this point, the destination input structure is selected and the listing of the next level data structures ( 204 ) representing the destination is displayed (step 2 in Fig. 2). For the sake of simplicity, we assume that the latter data structures are already atomic (that is, they cannot be broken down into further sub-elements), so that no further refinement is necessary: Therefore, the second box in Fig. 2 is referred to as the attribute menu. Otherwise, the refinement continues by opening additional list boxes, that is, the structure elements that themselves form a structure could represent a structure with further structure elements, etc. By selecting a menu item from the attribute menu, the selected part from the unstructured input becomes the value for assigned this attribute, which completes the assignment step of the current method.

Natürlich wäre es auch möglich, die potentiellen Datenstrukturen und die Teilsubstrukturen in einem einzigen Dialog darzustellen. Dies wirkt sich lediglich auf die Verwendbarkeit des beschriebenen Verfahrens aus. Die Entscheidung, ob man eine Abstufung verwendet, hängt von der Komplexität der beteiligten Datenstrukturen ab.Of course it would also be possible to identify the potential Data structures and the sub-structures in one To represent dialogue. This only affects the Usability of the method described. The Deciding whether to use a gradation depends on the Complexity of the data structures involved.

Auf diese Weise werden Datenstrukturen, die den Eingabeformularen der Zielanwendungen entsprechen, gefüllt (das heißt, als Instanz behandelt). Die Instanzen dieser strukturierten Daten könnten im Speicher aufbewahrt werden (Flüchtiger Cache (205), Schritt 3 in Fig. 2), bis der Mensch angibt, daß alle erforderlichen Daten erfaßt wurden. In this way, data structures that correspond to the input forms of the target applications are filled (that is, treated as an instance). The instances of this structured data could be kept in memory (volatile cache ( 205 ), step 3 in Fig. 2) until human indicates that all the necessary data has been collected.

Als nächstes werden die Cache-Instanzen an die Zielanwendung weitergeleitet (Schritt 4 in Fig. 2), um die erfaßten und strukturierten Ausgabedaten dauerhaft zu speichern (206). Durch diesen Speicherschritt wird das Verfahren vervollständigt.Next, the cache instances are passed to the target application (step 4 in FIG. 2) to permanently store ( 206 ) the captured and structured output data. This step completes the process.

Um die Verwendbarkeit der Datenerfassung weiter zu verbessern, schlägt die Beschreibung der vorliegenden Erfindung vor, die im Gesamtmenü (203) angezeigte Liste in Untergruppen zu unterteilen, indem der geeignete Anwendungskontext ausgewählt wird. Unter einem Anwendungskontext kann man sich eine oder mehrere Anwendungen vorstellen, die in der Lage sind, Daten zu verarbeiten, die in der unstrukturierten Eingabe enthalten sind. Wie in Fig. 3 dargestellt ist, könnte diese Auswahl dadurch unterstützt werden, daß man in die Menüleiste der Software, die zum Durchsuchen der unstrukturierten Eingabe (301) dient, ein Anwendungskontextmenü hinzufügt.In order to further improve the usability of data acquisition, the description of the present invention proposes to subdivide the list displayed in the overall menu ( 203 ) into sub-groups by selecting the appropriate application context. An application context can be thought of as one or more applications that are able to process data contained in the unstructured input. As shown in Fig. 3, this selection could be supported by adding an application context menu to the menu bar of the software used to search the unstructured input ( 301 ).

Bei der Auswahl des Anwendungskontextmenüs wird eine Liste (302) der verfügbaren Gesamtmenüs angezeigt. Wenn ein Menüpunkt aus dieser Liste ausgewählt wird, wird nur die entsprechende Untergruppe an Datenstrukturen im oben beschriebenen Gesamtmenü (203) angezeigt. Dadurch wird der Schritt zur Bestimmung des Anwendungskontextes in Übereinstimmung mit der vorliegenden Erfindung vervollständigt. Ein Nebeneffekt der Auswahl eines Menüpunkts aus dem Anwendungskontextmenü ist, daß die Browser-Software weiß, daß nicht die Standard-Kontextmenüs (also das Textmenü, wenn es sich beim ausgewählten Menüpunkt um einen Textteil in einem Textverarbeitungsprogramm handelt) für ausgewählte Teile der unstrukturierten Eingabe Pop-Up-Menüs sein müssen. Stattdessen werden die Kontextmenüs, die zur ausgewählten Anwendung gehören, angezeigt.When the application context menu is selected, a list ( 302 ) of the available overall menus is displayed. If a menu item is selected from this list, only the corresponding subgroup of data structures is displayed in the overall menu ( 203 ) described above. This completes the application context determination step in accordance with the present invention. A side effect of selecting a menu item from the application context menu is that the browser software knows that the standard context menus (i.e. the text menu if the selected menu item is a text part in a word processing program) do not pop for selected parts of the unstructured input -Up menus must be. Instead, the context menus that belong to the selected application are displayed.

Wie beim Vorschlagsschritt ist auch hier anzumerken, daß hinsichtlich dieser Komponente des Verfahrens verschiedene andere Möglichkeiten existieren (die sich ebenfalls kombinieren lassen):As with the proposal step, it should also be noted here that different with respect to this component of the process other options exist (which are also combine):

Das Anwendungskontextmenü könnte alle Anwendungskontexte aufführen, die für das Unternehmen relevant sind.The application context menu could be all application contexts list that are relevant to the company.

Wenn man das Klassifizierungsergebnis eines Parsers nutzt, der die Eingabedaten analysiert hat, könnte das Anwendungskontextmenü nur diejenigen Anwendungskontexte aufführen, die für den Dokumenttyp relevant sind, zu dem die Eingabedaten gehören.If you use a parser's classification result, who analyzed the input data could do that Application context menu only those application contexts that are relevant to the document type for which the input data belongs.

Das Prinzip der vorliegenden Erfindung ließe sich so umsetzen, daß die Ersteller von Anwendungen Komponenten (beispielsweise Java Beans) bereitstellen, die Anwendungskontexte oder Kontextmenüs kreieren. Auf der Grundlage dieser Komponenten könnte die Browser-Software dann die Menüleisten und Kontextmenüs zusammensetzen. Wenn die resultierende Browser-Software auch solche Komponenten (beispielsweise über Referenzierung) einschließt, dann unterstützt die Software sofort die Datenerfassung.The principle of the present invention could be so implement that application creator components (such as Java Beans) that Create application contexts or context menus. On the The browser software could be the basis of these components then put together the menu bars and context menus. If the resulting browser software also includes such components (for example via referencing) then the software supports data acquisition immediately.

Zusammenfassend enthält das beschriebene Verfahren die folgenden Schritte, die in Fig. 4 dargestellt sind:
In summary, the described method contains the following steps, which are shown in FIG. 4:

1. In an optional application context determination step ( 401 ), one or more applications that would be able to process the unstructured data are selected. The number of possible and described application contexts can be predefined by a modern document classification or determined dynamically. In the course of the described method, this selected application context is used to limit the number of possible data structures in the following steps.
2. In the next step, the data selection step ( 402 ), data segments within the input can be selected as elements of a data structure. As a possible extension, a parser that has a feature extraction can be used to preprocess the input with already selected data segments.
3. In the proposal step ( 403 ) possible data structures are suggested which could contain the selected data segment. The group of proposed data structures could be limited by the application context or the classification result of a parser that uses data mining based on the complete unstructured input document or the selected data segment.
4. In the assignment step ( 404 ), a target data structure element is assigned and used to store the selected data segment mentioned. The method extracts the selected data segment from the input data mentioned and stores it in the target data structure element.
5. Finally, the recorded and structured output data are saved permanently; that is, the process is completed by the storing step ( 405 ).

4.3 advantage

Immer mehr Kommunikation wird durch Geräte und Software unterstützt, die unstrukturierte Daten (wie beispielsweise Text) erzeugen. Zum Beispiel verbreitet sich e-mail zusehends: Nicht nur Versorgungsketten sondern auch Kundenwertschöpfungsketten und Kundenbetreuungssysteme werden davon berührt. Andererseits erwarten bestehende Anwendungen, die die jeweiligen Geschäftsprozesse bereits unterstützen, strukturierte Daten. Es fehlt also eine Übereinstimmung zwischen der Informationsquelle, die unstrukturierte Daten erzeugt, und dem Informationsziel (Datenverarbeitungsanwendungen), das strukturierte Daten benötigt.Devices and software are increasingly used for communication supports the unstructured data (such as Text). For example, email is spreading increasingly: Not only supply chains but also Customer value chains and customer care systems are touched by it. On the other hand, existing ones expect Applications that already have the respective business processes support structured data. So one is missing Match between the source of information that generated unstructured data, and the information goal (Data processing applications) that structured data needed.

Aus der Sicht eines Benutzers besteht der Vorteil der vorliegenden Erfindung darin, das Extrahieren von Daten aus unstrukturierten Eingaben für solche Anwendungen, die strukturierte Daten erwarten, zu vereinfachen: Die Zeit für die Datenextraktion wird verringert, und Fehler durch erneute Eingabe werden vermieden. Dies setzt sich direkt in Einsparungen für den Arbeitgeber des Benutzers um.From a user's perspective, there is the advantage of present invention in extracting data from unstructured inputs for those applications that structured data expect to simplify: the time for data extraction is reduced, and errors by repeated entries are avoided. This goes straight in Savings for the user's employer.

Aus der Sicht des Programmierers einer Software für das Durchsuchen unstrukturierter Daten bietet das Prinzip der vorliegenden Erfindung eine einfache Möglichkeit, ein Verfahren zur Datenerfassung umzusetzen.From the perspective of the programmer of a software for the Searching unstructured data offers the principle of present invention an easy way to Implement data collection procedures.

Claims

1. A method carried out by a computer system for converting unstructured input data into structured output data,
said method comprising a data selection step in which at least one data segment is selected, said data segment containing part of said input data and said data segment can be converted into a data structure element; and
said method comprising a proposal step in which at least one data structure element is proposed; and
said method comprising an assignment step in which a data structure element is assigned as a target data structure element for storing said selected data segment, and wherein said selected data segment is extracted from said input data and stored in said target data structure element.

2. Process for converting unstructured input data in structured output data according to claim 1, said method by a Storage step is completed in which the said Target data structure element is saved permanently.

3. A method for converting unstructured input data into structured output data according to claim 1 or 2, wherein in said proposal step
either data structure elements and / or
Data structures are proposed, the data structures mentioned comprising one or more data structure elements and / or one or more further data structures.

4. A method for converting unstructured input data into structured output data according to claim 1, 2 or 3,
wherein said data selection step is preceded by an application context determination step for determining at least one target application for potential processing of said structured output data,
wherein in said application context determination step said input data is automatically classified by said computer system and assigned to at least one target application; and or
wherein in said application context determination step, a user selects at least one target application from a group of applications;
and wherein only those data structure elements that are related to said target application are proposed in said proposal step.

5. A method for converting unstructured input data into structured output data according to claim 1, 2, 3 or 4,
wherein in said data selection step a parser analyzes said input data, this parser classifies potential data segments and pre-selects data segments that may be relevant to a target application.

6. A method for converting unstructured import data into structured output data according to claim 1, 2, 3, 4 or 5,
said method being integrated in a target application; and or
said method being integrated in a mailing system; and or
said method being integrated in a word processing program.

7. A system with an agent designed to the steps according to the principle of the present Invention according to one of claims 1 to 6 to execute.

8. A data processing program that in one Data processing system can be run and Software code parts for executing a method according to one of claims 1 to 6.

9. A computer program stored on a storage medium stored that can be read by a computer can, this program with a computer readable Program means is combined, which is a computer causes a method according to any one of claims 1 to run 6.