US20150248379A1 - Formatting module, system and method for formatting an electronic character sequence - Google Patents
Formatting module, system and method for formatting an electronic character sequence Download PDFInfo
- Publication number
- US20150248379A1 US20150248379A1 US14/428,972 US201314428972A US2015248379A1 US 20150248379 A1 US20150248379 A1 US 20150248379A1 US 201314428972 A US201314428972 A US 201314428972A US 2015248379 A1 US2015248379 A1 US 2015248379A1
- Authority
- US
- United States
- Prior art keywords
- rules
- language
- sequence
- rule
- character sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G06F17/22—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
-
- G06F17/30507—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/163—Handling of whitespace
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/274—Converting codes to words; Guess-ahead of partial word inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/263—Language identification
Definitions
- the present invention relates to the formatting of spaces in an electronic character sequence.
- it relates to a formatting module, system and method for formatting spaces in an electronic character sequence.
- Punctuation marks are symbols that indicate the structure and organization of written language, as well as intonation and pauses to be observed when reading aloud. The appearance and usage of punctuation marks varies between languages and scripts but in most cases they are vital to disambiguate the meaning of sentences. The use and interpretation of punctuation marks can be heavily context-dependent. For example, a full stop “.” can be used as sentence-ending punctuation, an abbreviation indicator, a decimal point, and so on. Punctuation is also present in mathematical and scientific formulae.
- Some punctuators appear in pairs and one cannot exist without the other. For example, left parenthesis ‘(’ and right parenthesis ‘)’. However, in some scenarios a single character is used to represent two punctuators, creating ambiguity, for example in the case of the single quote mark: ‘.
- a space is a blank area, often used to separate words, letters, numbers, and punctuation.
- Conventions for the formatting of spaces vary among languages. For example, the correct formatting of spaces around a question mark “?” in English is “word?”, with no space between the word and the question mark, and a space following the question mark. However, in French the convention is “word ?”, where a space is inserted either side of the question mark.
- a number of current-market text input systems exhibit some form of space formatting. For example, when a user enters one of the following characters [ ? ! : ; , . ] after entering a space, the Android default keyboard formats spaces either side of the punctuation mark by removing the leading space and adding a trailing space, irrespective of the language in which the text is being entered.
- a formatting module supporting at least one language and configured to format spaces in an electronic character sequence written in a supported language, the formatting module comprising:
- formatting spaces in the electronic character sequence comprises inserting and/or deleting spaces in the electronic character sequence.
- the character identifier comprises:
- the comparison mechanism is preferably configured to compare each rule of one of the at least one set of rules to the electronic character sequence only when a supported language is identified.
- the formatting module supports a plurality of languages and the language identifier is configured further to identify the most likely language of the supported languages that the electronic character sequence is written in.
- the character identifier may be configured to identify a punctuation mark and the formatting module may be configured to format the spaces either side of the punctuation mark on the basis of the punctuation mark.
- the character identifier may be configured to identify a particular context in the electronic character sequence and the formatting module may be configured to format the spaces in the electronic character sequence on the basis of the context.
- the character identifier may be configured to identify a punctuation mark in the electronic character sequence
- the formatting module may be configured to format the spaces either side of the punctuation mark on the basis of the category of punctuation mark.
- the one or more actions may comprise a sequence of actions, wherein when a rule is found to be applicable, the comparison mechanism is configured to apply the sequence of actions to the electronic character sequence.
- the character identifier preferably comprises a plurality of sets of rules, one set of rules for each language that is supported, where the comparison mechanism is configured to compare each rule of the set of rules that corresponds to the most likely language to the electronic character sequence.
- the formatting module may comprise sets of rules relating to each language, each family of languages, and all languages in the world, wherein the rules are applied in a hierarchal structure such that, once a supported language has been identified, the comparison mechanism first compares each rule from the set of rules specific to that language, followed by each rule from the set of rules applicable to the family of languages to which that language belongs, followed by each rule of the set of rules which are applicable to all languages until an applicable rule is identified or no applicable rule is identified and all rules are exhausted.
- the comparison mechanism is preferably configured to compare the rules in a specific predetermined order.
- the set of rules preferably comprises context rules, character rules and category rules and the comparison mechanism is preferably configured to compare the rules in the following order until an applicable rule is identified or no applicable rule is identified and all rules are exhausted: context rules, character rules, and then category rules.
- a formatting module supporting at least one language and configured to format spaces in an electronic character sequence, the formatting module comprising:
- a system for inputting text into an electronic device comprising:
- a system for inputting text into an electronic device comprising:
- a formatting module supporting at least one language and having a character identifier, spaces in an electronic character sequence, the method comprising:
- the formatting module may comprise a language identifier to identify whether the electronic character sequence is written in a language supported by the formatting module.
- the formatting module supports a plurality of languages and the method further comprises identifying with the language identifier the most likely language of the electronic character sequence.
- the most likely language of the electronic character sequence may be identified by a text prediction engine, where the method further comprises transmitting the most likely language to the formatting module which identifies whether the most likely language is supported by the formatting module.
- the language identifier preferably comprises at least one set of rules and a comparison mechanism, each rule defining the formatting of spaces in the electronic character sequence, wherein the method further comprises:
- the comparison mechanism compares each rule of one of the at least one set of rules to the electronic character sequence only when a supported language is identified.
- Each rule may relate to a particular character or sequence of characters to be identified and each rule is associated with one or more actions which describe the format of spaces to be applied by the formatting module to the electronic character sequence given a supported language and the particular character or sequence of characters.
- the step of applying the applicable rule preferably comprises applying the one or more actions associated with that applicable rule to the electronic character sequence.
- Identifying a particular character may comprise identifying a punctuation mark and formatting the spaces in the electronic character sequence may comprise formatting the spaces either side of the punctuation mark on the basis of the form of the punctuation mark.
- Identifying a particular sequence of characters may comprise identifying a particular context and formatting the spaces in the electronic character sequence may comprise formatting the spaces on the basis of the context.
- Identifying a particular character may comprise identifying a punctuation mark and formatting the spaces in the electronic character sequence may comprise formatting the spaces either side of the punctuation mark on the basis of the category of punctuation mark.
- the one or more actions may comprise a sequence of actions, wherein the sequence of actions is applied sequentially to the electronic character sequence.
- the language identifier may comprise a plurality of sets of rules, one set of rules for each language supported, and comparing each rule to the electronic character sequence comprises comparing each rule of the set of rules that corresponds to the most likely language.
- the formatting module may comprise sets of rules relating to each supported language, each family of languages, and all languages in the world, and the method comprises applying the rules in a hierarchal structure such that, once a language has been identified, the comparison mechanism first compares each rule from the set of rules specific to that language, followed by each rule from the set of rules applicable to the family of languages to which that language belongs, followed by each rule of the set of rules which are applicable to all languages until an applicable rule is identified or no applicable rule is identified and all rules are exhausted.
- the comparison mechanism preferably compares the rules in a specific predetermined order.
- the set of rules may comprise context rules, character rules and category rules, and the method preferably comprises comparing the rules in the following order until an applicable rule is identified or no applicable rule is identified and all rules are exhausted: context rules, character rules, and then category rules.
- a computer program product comprising a computer readable medium having stored thereon computer program means for causing a processor to carry out a method as described above.
- FIG. 1 is a schematic of a system comprising a prediction engine and a formatting module in accordance with the present invention
- FIG. 2 is a schematic of a formatting module in accordance with the present invention.
- FIG. 3 is a schematic of the formatting module of FIG. 2 shown in greater detail
- FIG. 4 is an illustration of a structure of specific types of rules within a set of rules for a given language, and shows the order in which a comparison mechanism compares the rules, in accordance with the present invention
- FIG. 5 is an illustration of how the rules are structured for the English language and the order in which the comparison mechanism compares the rules, in accordance with the present invention.
- the present invention provides a formatting module that is configured to format the spaces for a particular sentence on the basis of the conventions for the language in which the sentence is written.
- the formatting module formats the spaces by inserting and/or deleting spaces in the electronic character sequence.
- the formatting module 10 is part of a system, such as an electronic device 100 , comprising a text prediction engine 30 , as shown in FIG. 1 .
- the electronic device is preferably a mobile device, such as a PDA, tablet, laptop computer or mobile phone.
- the formatting module may be used to format the spaces in an electronic character sequence entered by a user for a text message.
- the user interacts with a text entry system 50 of the electronic device 100 by entering text via an input mechanism such as a virtual keyboard.
- the text prediction engine 30 may be configured to correct mistyped or misspelt words and may also be configured to predict what the user is going to write next, thus improving the performance and quality of the text input into the device.
- An example of such a text prediction engine 30 is described in PCT/GB2011/001419, which is hereby incorporated by reference in its entirety.
- a character sequence is input into the device 100 .
- the character sequence is passed to a text prediction engine 30 which may modify that character sequence to correct misspelt words and/or to predict words.
- the character sequence, so modified by the text prediction engine 30 is passed to the formatting module 10 .
- the formatting module 10 is configured to output a space formatted version of the modified character sequence, as shown in FIGS. 1 and 2 .
- the formatting module formats the spaces of a character sequence by inserting and/or deleting spaces in the sequence.
- the formatting module 10 formats the spaces for an electronic character sequence, if the language in which that character sequence is written is supported by the formatting module 10 .
- a formatting module 10 in accordance with the present invention is shown in FIG. 2 .
- the formatting module 10 is configured to support at least one language.
- the formatting module 10 comprises a language identifier 20 configured to identify whether an electronic character sequence is written in a language supported by the formatting module 10 .
- the language identifier 20 makes use of one or more statistical language models, the general properties of which are known in the art, in order to identify whether the electronic character sequence is written in a language supported by the formatting module 10 .
- the formatting module 10 supports a plurality of languages.
- the language identifier 20 comprises a plurality of statistical languages models, each statistical language model corresponding to a different language supported by the formatting module 10 , and the language identifier 20 is configured further to identify the most likely supported language of the electronic character sequence.
- the formatting module 10 is configured to maintain a list of “active languages”, each of which is associated with a language model.
- One process for identifying the most likely current language is to maximize the probability of a language, given a context, i.e. maximizing P(language
- language) As the absolute values of P(language
- context) are not important, since only the ranking of languages matters, the term P(context), which does not depend on language, may be dropped from the expression. Additionally, a uniform prior over languages, P(language) k, may also be dropped since it is constant with respect to language.
- language) the only quantity that the language identifier is required to estimate is P(context
- context is just a sequence of words, therefore to estimate P(context
- Each language is therefore separately modelled by a smoothed n-gram language model (known in the art and as described in WO 2012/042217), capable of estimating the probability of a word, given local context.
- a smoothed n-gram language model known in the art and as described in WO 2012/042217
- HMM Hidden Markov Model
- SVM support vector machine
- the language identifier 20 uses a tokenizer as is known in the art.
- the prediction engine 30 may comprise a language identifier, rather than it being provided in the formatting module 10 .
- the language identifier will comprise a tokeniser and a plurality of language models, which may already be present in the prediction engine, such as the prediction engine described in WO 2012/042217, which is hereby incorporated by reference in its entirety.
- the language identifier 20 is configured to calculate the likelihood of the context in each language which is supported in turn, and selects the language with the maximum likelihood.
- the likelihood of the context (a sequence of terms) is the product of the probability of each term, given preceding terms, which is computed by a smoothed n-gram model, as has been described in relation to a text prediction engine in WO 2012/042217.
- the formatting of the spaces around the punctuation marks may differ between the sentences, dependent on the language in which it is written, e.g. “Bonjour mon ami ! How are you doing? Talk to you soon.”
- the language identifier 20 is preferably configured to limit the amount of context used to make the estimate of the most likely language. This provides a basic form of recency in the model for identifying the most likely language—languages used more recently are intuitively more likely than languages used much earlier in a document. For instance, in one embodiment, the language identifier 20 may use the six most recent words of context. However, the number of most recent words of context could be chosen dependent on the frequency at which a user switches between languages and the length of their input stream in any given language.
- the language identifier 20 is preferably configured to identify whether the language in which the electronic character sequence is written is supported by the formatting module 10 .
- the language identifier 20 may identify that the electronic character sequence is written in an unsupported language if none of the context terms of the sequence are present in one of the language identifier's language models, where each language model corresponds to a supported language. Thus, if one or more of the context terms are determined to be present in one of the language models, the language identifier determines that the electronic character sequence is written in a supported language.
- a variation on this example is one in which the language identifier 20 is configured to identify whether a certain fraction or ratio of the context words are present in a language model, e.g.
- the character identifier 40 preferably comprises a set of rules 70 , each rule relating to a character or particular sequence of characters to be identified, and a comparison mechanism 60 configured to compare each rule of the set of rules 70 to the electronic character sequence to determine whether a rule is applicable. If the rule is applicable, then a character or particular sequence of characters is identified, e.g. if the rule relates to a particular punctuation mark and the rule is found to be applicable, it is because that punctuation mark is within the electronic character sequence.
- the electronic character sequence is preferably passed to the formatting module 10 sequentially, e.g. a character at a time, with the comparison mechanism 60 comparing each rule to the last character or last sequence of characters received.
- the character identifier 40 uses the rules to identify when a particular character or sequence of characters, such as a punctuation mark, occurs in the electronic character sequence. Furthermore, the rules define, by one or more actions associated with the rule, the space formatting to apply to the electronic character sequence, i.e. whether spaces should be inserted and/or deleted. Thus, once a rule has been found to be applicable to a particular character or sequence of characters, the actions associated with that rule are applied to the electronic character sequence to format the spaces within the electronic character sequence, e.g. in the case of the particular character being a punctuation mark, the actions may define the formatting of the spaces either side of the punctuation mark, as will be described in more detail below.
- the set of rules 70 preferably comprises a plurality of sets of rules, a set of rules for each language supported by the formatting module 10 .
- the comparison mechanism 60 is configured to compare the set of rules relating to the language identified by the language identifier 2 as the most likely supported language.
- the comparison mechanism 60 comprises a single set of rules 70 corresponding to that language, and the comparison mechanism 60 is configured to compare the set of rules 70 to the electronic character sequence if the language of the character sequence is identified as being the supported language. If the language of the character sequence is not identified as a supported language, the comparison mechanism 60 does not search for applicable rules.
- the formatting module 10 is configured such that a system designer is able to manually add new rules, with associated actions, to the formatting module.
- the rules and associated actions can be updated without affecting the other components of the formatting module.
- a rule is preferably defined by a four-tuple, as follows: Rule :: (C, s, A, S)
- C is a condition taking the form of a regular expression, implementing a function of type F :: [character] ⁇ true, false ⁇ , e.g. taking the incoming character sequence and returning a boolean denoting whether or not a rule is applicable and thus whether or not to apply the sequence of actions associated with that rule.
- the comparison mechanism 60 identifies a particular character or sequence of characters in an electronic character sequence by implementing the function of the type F :: [character] ⁇ true, false ⁇ . This field is therefore essential and is never empty.
- s represents a state that allows the system to “remember” previous rule applications in some cases.
- the state may be “None” when the system is not required to maintain a status, or the state may be “Open” or “Close” where punctuators appear in pairs and one cannot exist without the other, e.g. left parenthesis ‘(’ and right parenthesis ‘)’.
- A is a sequence of Actions, i.e. A :: [Action]. In special cases this could be an empty sequence represented by [ ].
- Actions are the means by which the formatting module 10 describes the space formatting that should be applied to, for example, a punctuation mark given a particular character sequence context (e.g. where the punctuation mark is found in the context of a mathematical equation).
- a punctuation mark of the electronic character sequence is determined by the comparison mechanism 60 to match one of the rules, each action held by the rule is applied, preferably sequentially, to the punctuation mark to ensure the correct formatting of the spaces either side of the punctuation mark.
- the Action might be to delete the space before the full stop (if such a space is present) and to insert a space after the full stop (if such a space is missing), where the most likely language is English.
- the formatting module may comprise: type A and type B.
- An action of type A is a function that operates on a sequence of characters and returns a formatted sequence of characters, without changing the sequence of characters, other than by formatting them:
- An action of type B is a function that given a sequence of characters returns a code that represents the state of the system, without changing the sequence of characters:
- Action B [character] ⁇ new state
- the new state is any of the possible states that the system might be in, e.g. the shift state to define whether the next character should be capitalised or not, e.g. “Word.” ⁇ “shift state of system”.
- S is a recursive sequence of rules, known as “secondary rules”, i.e. S :: [Rule].
- Rule does not describe any secondary rules, S will be represented by ⁇ .
- the secondary rules will be checked before the actions of the parent rules are applied, allowing an alternative behaviour for condition C depending on factors described by the secondary rules.
- the input for the secondary rules is the same electronic character sequence as for the parent rules; however, the focus of the condition C for the secondary rule is the character in the sequence that precedes the character that triggered the parent rule.
- the comparison mechanism compares each parent rule to the last character received. If a parent rule is found to be applicable, and that parent rule comprises at least one secondary rule, the comparison mechanism compares the at least one secondary rule to the penultimate character in the sequence (since the condition C for the parent rule is focused on the final character, whereas for the secondary rule the focus is on the penultimate character).
- the sequence of actions associated with a rule can be selected by a designer from a predetermined set of candidate actions.
- the sequence of actions may contain any number of the candidate actions in any order and with any number of repetitions.
- the formatting module 10 allows a system designer of the formatting module 10 , to manually extend and adapt the associated actions to the requirements of the languages or the text entry system.
- the formatting module comprises three specialisations of the Rule described above: Context Rules, Category Rules, and Character Rules.
- Context Rules a specialisation of the Rule described above: Context Rules, Category Rules, and Character Rules.
- the specialised rules provide a powerful tool to capture the way punctuation is used in natural language.
- a context rule is a rule of the form: Context Rule :: (C, None, A, ⁇ ).
- the regular expression present in C is applied only to the context, e.g. the regular expression corresponds to a particular character sequence in the context of the electronic character sequence, for example “www”. Since the state is “None”, a Context Rule will never have or maintain state.
- the Context Rules have no “secondary rules”.
- a context rule is a rule for URLs which states that when “www” is in the context, no spaces should be inserted automatically on either site of the punctuator “.”, e.g. “www.site.com”
- a Category Rule preferably takes the form: Category Rule :: (C, None, A, S)
- This rule will match the Unicode category of the character in the electronic character sequence to the Unicode category defined by the rule, e.g. the Unicode category of a punctuation mark.
- C is a regular expression that is limited to matching the Unicode category of the punctuation mark. Therefore, this type of Rule is only applied to a single character.
- S is a sequence of secondary rules, e.g. a context rule, a character rule or a category rule. Alternatively this field can be empty, ⁇ , in the case where no secondary rules are defined.
- P corresponds to a category of punctuation marks, e.g. a category that includes ‘!’ and ‘?’ because they should be formatted with the same spaces.
- characters within the Unicode standard have a range of properties associated with them.
- One of these properties is the category to which a character belongs.
- the condition C of a category rule can relate to a Unicode category.
- the General Category value for a character serves as a basic classification of that character, based on its primary usage.
- the property extends the widely used subdivision of ASCII characters into letters, digits, punctuation, and symbols—a useful classification that needs to be elaborated and further subdivided to remain appropriate for the larger and more comprehensive scope of the Unicode standard.
- Each Unicode code point is assigned a normative General Category value.
- Each value of the General Category is given a two-letter property value alias, where the first letter gives information about a major class and the second letter designates a subclass of that major class.
- the subclass “other” merely collects the remaining characters of the major class.
- the subclass “No” “Number, other) includes all characters of the Number class that are not a decimal digit or letter. These characters may have little in common besides their membership in the same major class.
- a character rule preferably takes the form: Character Rule :: (C, s, A, S). This rule matches a character defined by the rule to a character in the electronic character sequence. Therefore, in this type of rule, C can only contain a single character.
- C is a regular expression consisting of a single character matched against a character of the electronic character sequence.
- C may define the Unicode for the particular character of interest, this Unicode being matched to the Unicode in the electronic character sequence.
- s preferably defines two new states, in addition to the None state: ⁇ Open, Close, None ⁇ . s therefore dictates actions for ambiguous pairs that might be in an Open or Close state, and also includes no state for non-ambiguous characters.
- the system will define two rules. e.g. for the English language to format the following sentence correctly:
- rule1 Character Rule :: (‘“’, Open, [InsertSpaceBefore, DeleteSpaceAfter], ⁇ )
- rule2 ⁇ Character Rule :: (‘”’, Close, [DeleteSpaceBefore, InsertSpaceAfter], ⁇ )
- S is a sequence of secondary rules, which may include any of the three types of rules: Context, Category or Character. It can also define no further rules, in which case this field is denoted by ⁇ .
- rule1 Character Rule :: (‘!’, None, [InsertSpaceBefore, InsertSpaceAfter], [rule2])
- rule2 ⁇ Category Rule :: (‘P’, None, [DeleteSpaceBefore, InsertSpaceAfter], ⁇ )
- P relates to Category punctuation in accordance with the Unicode standard which occurs prior to the current character of interest e.g. if the formatting module 10 receives “!” in the sequence “?””, P is the category that encompasses “?”.
- the first ‘!’ rule1 when the user types the first ‘!’ rule1 will be triggered. Within the trigger routine of rule1 all the secondary rules will be checked but no matches will happen, since there is no punctuation mark preceding the ‘!’, so the default actions for rule1 will be applied by the formatting module 10 . On subsequent insertions of the exclamation mark, the step for matching secondary rules will trigger rule2 as ‘!’ is within the Category Punctuation by the Unicode standard and the actions defined in rule2 will be applied.
- the formatting module 10 can succinctly specify the formatting patterns of the spaces for a given language.
- a formatting module comprises two rules: a first category rule that states that all the characters in a Maths category will have spaces on either side of the Maths character, and a second character rule that defines that the character “ ⁇ ” (minus) will not have any spaces either side of it, because it is most likely to be used as a hyphen. If the user were to insert ‘ ⁇ ’, and the category rule was prioritised over the character rule, the character rule would never be triggered. Thus, to format the sequence correctly, the character rule should be prioritised over the category rule.
- the rule that is prioritised is applied, and the comparison mechanism 60 stops the search for applicable rules.
- a different rule type may be applied as part of a secondary set of rules.
- the comparison mechanism is configured to compare, and the formatting module is configured to apply, the rules for an individual language in accordance with the following prioritisation structure:
- the comparison mechanism 60 is preferably configured to identify the type of rule.
- the comparison mechanism 60 can be configured to identify the rule type by any suitable means.
- each rule can be labelled with its rule type, where the comparison mechanism 60 is configured to identify all of the rules of a first rule type before comparing those rules to the electronic character sequence to see if one of them is applicable.
- the rules of a given type can be placed in a container, so that the comparison mechanism 60 compares all rules in a given container, before moving on to the next container.
- the comparison mechanism 60 may comprise code to identify the different rule types.
- the rules themselves could be ordered according to the prioritisation structure, e.g. listed in accordance with the prioritisation structure.
- the comparison mechanism 60 finds that a rule is applicable, it does not continue through the prioritisation structure, e.g. if the category rule is found to be applicable to www.site.com, then the character rule is not compared or applied, since the comparison mechanism 60 has stopped searching for applicable rules. Otherwise this character rule, if applied after the context rule, would result in the incorrect formatting “www. Site. Com” as described above.
- the rule that is applied may comprise secondary rules of the other rule types, e.g. the formatting module dealing with repeated punctuation where the triggered character rule comprises a secondary category rule:
- rule1 Character Rule :: (‘!’, None, [InsertSpaceBefore, InsertSpaceAfter], [Rule2])
- rule2 ⁇ Category Rule :: (‘P *’, None, [DeleteSpaceBefore, InsertSpaceAfter], ⁇ )
- Rules may be applicable to a particular language, e.g. English, and the family of languages to which that language belongs, e.g. Latin, or to all languages in the world. There are multiple conventions for punctuation that are common to a number of languages. For example, in all languages URLs are written the same way and therefore they all must have the necessary rules for the correct formatting of these elements.
- the language identifier 20 is configured to pass the identified language to the comparison mechanism 60
- the comparison mechanism 60 is configured to compare the rules from the set of rules 70 that are relevant given the particular language so identified.
- the set of rules is preferably ordered into a hierarchal structure, in order to avoid repeating the same rules.
- comparison mechanism 60 is further configured to compare the rules in a particular order of increasing generality:
- the comparison mechanism 60 is preferably configured to identify the language generalisation rule i.e. whether the rule is a language specific rule, a language family rule or a worldwide rule.
- the comparison mechanism 60 may be coded to recognise the language generalisation rule or each rule may be labelled to identify the type of language generalisation rule, and containers may be used, as explained above when discussing the rule type prioritisation structure. As stated above, an alternative could be to order the rules into the generalisation structure.
- the comparison mechanism 60 is preferably configured to identify the rule type and the language generalisation rule, e.g. context rule applicable to French (language specific rule).
- the comparison mechanism 60 compares the rules in accordance with the priority system described above, until a rule is found to be applicable: first all the “context rules” will be compared in order of increasing generality of language, e.g. the context rules are checked first for language specific rules, then for family rules, and then for worldwide rules; the comparison mechanism 60 then proceeds to compare the next type of rule, character rules, through increasing generality in language, and then compares the category rules in the same way, until a rule is found to be applicable, at which point the comparison mechanism 60 stops the search for an applicable rule. Alternatively, the comparison mechanism 60 compares all of the rules to find that no rule is applicable and all rules are exhausted.
- the comparison mechanism 60 is configured to compare each of the rules to each character in the electronic character sequence in turn.
- the comparison mechanism 60 discovers that a rule is applicable to a character of the character sequence, the formatting module 10 applies this rule to the electronic character sequence to format the spaces of the electronic character sequence, and the comparison mechanism moves on to comparing the rules to the next character in the character sequence.
- the comparison mechanism 60 moves on to comparing the rules to the next character in the electronic character sequence.
- the language identifier 20 is configured to identify whether the language in which the electronic character sequence is being written is supported and, preferably, which is the most likely supported language.
- the language identifier 20 may be configured to identify the current language periodically, e.g. for every term (where the electronic character sequence is converted into a sequence of terms or words by a tokeniser) or for example every three terms, in order to identify whether the language has been switched by the user and thus to change the set of rules that are being compared by the comparison mechanism 60 to the electronic character sequence. Any other frequency of checking may be used. If the language identifier 20 determines that the language of the character sequence is not supported by the formatting module 10 , the comparison mechanism 60 stops searching for an applicable rule.
- a formatting module 10 or system 100 comprising a formatting module 10 in accordance with the present invention provides language detection and rule mechanisms that provide automatic dynamic punctuation. Unlike existing systems which neglect the possibility of having different behaviours for the same punctuation mark depending on the context in which the punctuation mark occurs, the formatting module 10 of the present invention is able to format the spaces either side of a punctuation mark on the basis of the context of the punctuation mark.
- the formatting module 10 of the present invention is therefore able to increase the productivity of the user by reducing the interaction required to produce correctly formatted punctuation appropriate to the target language.
- the formatting module 10 is preferably able to automatically adjust the space formatting to the language currently being entered. This allows the user to focus on the message being delivered rather than formatting conventions specific to various target languages.
- the formatting module 10 of the present invention provides a separate layer that defines the behaviour of the formatting of the spaces for the punctuation, i.e. the rules and their associated actions. This allows independent manual updates of the rules and their associated actions for a particular language, to change the space formatting for that language, without affecting the space formatting for the other languages or requiring an upgrade of the entire formatting module 10 .
- the present invention also provides a corresponding method for formatting spaces in an electronic character sequence that has preferably been entered by a user.
- the method comprises identifying whether the electronic character sequence is written in a language supported by the formatting module; identifying, with the character identifier 40 (see FIG. 2 ), a particular character or a particular sequence of characters in the electronic character sequence; and formatting, with the formatting module 10 if a supported language is identified, spaces in the electronic character sequence on the basis of the language identified and the particular character or sequence of characters identified.
- the formatting module preferably supports a plurality of languages, and the most likely supported language can be identified by a language identifier 20 of the formatting module 10 or a language identifier of the prediction engine 30 of the system 100 .
- the formatting module comprises a language identifier to identify whether the electronic character sequence is written in a language supported by the formatting module and to identify the most likely language of the electronic character sequence.
- the method by analogy to the formatting module, will also comprise selecting, with a comparison mechanism 10 , the set of rules that correspond to the most likely language identified, etc.
- the present invention also provides a computer program product comprising a computer readable medium having stored thereon computer program means for causing a processor to carry out the method according to the present invention.
- the computer program product may be a data carrier having stored thereon computer program means for causing a processor external to the data carrier, i.e. a processor of an electronic device, to carry out the method according to the present invention.
- the computer program product may also be available for download, for example from a data carrier or from a supplier over the internet or other available network, e.g. downloaded as an app onto a mobile device (such as a mobile phone) or downloaded onto a computer, the mobile device or computer comprising a processor for executing the computer program means once downloaded.
Abstract
Description
- The present invention relates to the formatting of spaces in an electronic character sequence. In particular, it relates to a formatting module, system and method for formatting spaces in an electronic character sequence.
- Punctuation marks are symbols that indicate the structure and organization of written language, as well as intonation and pauses to be observed when reading aloud. The appearance and usage of punctuation marks varies between languages and scripts but in most cases they are vital to disambiguate the meaning of sentences. The use and interpretation of punctuation marks can be heavily context-dependent. For example, a full stop “.” can be used as sentence-ending punctuation, an abbreviation indicator, a decimal point, and so on. Punctuation is also present in mathematical and scientific formulae.
- Some punctuators appear in pairs and one cannot exist without the other. For example, left parenthesis ‘(’ and right parenthesis ‘)’. However, in some scenarios a single character is used to represent two punctuators, creating ambiguity, for example in the case of the single quote mark: ‘.
- A space is a blank area, often used to separate words, letters, numbers, and punctuation. Conventions for the formatting of spaces vary among languages. For example, the correct formatting of spaces around a question mark “?” in English is “word?”, with no space between the word and the question mark, and a space following the question mark. However, in French the convention is “word ?”, where a space is inserted either side of the question mark.
- A number of current-market text input systems exhibit some form of space formatting. For example, when a user enters one of the following characters [ ? ! : ; , . ] after entering a space, the Android default keyboard formats spaces either side of the punctuation mark by removing the leading space and adding a trailing space, irrespective of the language in which the text is being entered.
- It is an object of the present invention to provide a means for formatting automatically the spaces in an electronic character sequence, such that a user can concentrate on the content of a message without worrying about whether the spaces are correctly formatted in the electronic character sequence. It is also an object of the invention to provide a means for correctly formatting spaces in an electronic character sequence on the basis of the conventions of the language in which the electronic character sequence is written.
- In a first aspect of the present invention, there is provided a formatting module supporting at least one language and configured to format spaces in an electronic character sequence written in a supported language, the formatting module comprising:
-
- a language identifier configured to identify whether the electronic character sequence is written in a supported language;
- a character identifier configured to identify a particular character or a particular sequence of characters in the electronic character sequence;
- wherein the formatting module is configured to format spaces in the electronic character sequence on the basis of the language identified and the particular character or sequence of characters identified, when a supported language is identified.
- Preferably, formatting spaces in the electronic character sequence comprises inserting and/or deleting spaces in the electronic character sequence.
- In a preferred embodiment, the character identifier comprises:
-
- at least one set of rules, each rule relating to a particular character or sequence of characters to be identified in the electronic character sequence; and
- a comparison mechanism configured to compare each rule of one of the at least one set of rules to the electronic character sequence to identify whether a rule is applicable;
- wherein each rule is associated with one or more actions which describe the format of spaces to be applied by the formatting module to the electronic character sequence given a supported language and the particular character or sequence of characters; and
- wherein the formatting module is configured to format spaces in the electronic character sequence by applying the one or more actions associated with the applicable rule to the electronic character sequence.
- The comparison mechanism is preferably configured to compare each rule of one of the at least one set of rules to the electronic character sequence only when a supported language is identified.
- Preferably, the formatting module supports a plurality of languages and the language identifier is configured further to identify the most likely language of the supported languages that the electronic character sequence is written in.
- The character identifier may be configured to identify a punctuation mark and the formatting module may be configured to format the spaces either side of the punctuation mark on the basis of the punctuation mark.
- The character identifier may be configured to identify a particular context in the electronic character sequence and the formatting module may be configured to format the spaces in the electronic character sequence on the basis of the context.
- The character identifier may be configured to identify a punctuation mark in the electronic character sequence, and the formatting module may be configured to format the spaces either side of the punctuation mark on the basis of the category of punctuation mark.
- The one or more actions may comprise a sequence of actions, wherein when a rule is found to be applicable, the comparison mechanism is configured to apply the sequence of actions to the electronic character sequence.
- When the formatting module is configured to support a plurality of languages, the character identifier preferably comprises a plurality of sets of rules, one set of rules for each language that is supported, where the comparison mechanism is configured to compare each rule of the set of rules that corresponds to the most likely language to the electronic character sequence.
- The formatting module may comprise sets of rules relating to each language, each family of languages, and all languages in the world, wherein the rules are applied in a hierarchal structure such that, once a supported language has been identified, the comparison mechanism first compares each rule from the set of rules specific to that language, followed by each rule from the set of rules applicable to the family of languages to which that language belongs, followed by each rule of the set of rules which are applicable to all languages until an applicable rule is identified or no applicable rule is identified and all rules are exhausted.
- The comparison mechanism is preferably configured to compare the rules in a specific predetermined order. The set of rules preferably comprises context rules, character rules and category rules and the comparison mechanism is preferably configured to compare the rules in the following order until an applicable rule is identified or no applicable rule is identified and all rules are exhausted: context rules, character rules, and then category rules.
- In a second aspect of the invention there is provided a formatting module supporting at least one language and configured to format spaces in an electronic character sequence, the formatting module comprising:
-
- a punctuation mark identifier configured to identify a punctuation mark in the electronic character sequence;
- wherein the formatting module is configured to format spaces in the electronic character sequence on the basis of the language in which the electronic character sequence is written, the punctuation mark identified, and a context of the punctuation mark, when a supported language is identified,
- In a third aspect of the invention there is provided a system for inputting text into an electronic device comprising:
-
- a text prediction engine configured to receive an electronic character sequence as input and configured to generate and output a corrected electronic character sequence; and
- a formatting module as described above, wherein the formatting module is configured to receive the modified electronic character sequence as input, and to generate a formatted character sequence by formatting spaces in the modified electronic character sequence when a supported language is identified.
- In a fourth aspect of the invention there is provided a system for inputting text into an electronic device comprising:
-
- a text prediction engine configured to receive an electronic character sequence as input, the text prediction engine comprising:
- a language identifier configured to identify which language the electronic character sequence is most likely written in, and to correct the electronic character sequence on the basis of the identified language;
- wherein the text prediction engine is configured to generate and output a corrected electronic character sequence and to output the language identified;
- a formatting module supporting at least one language and configured to receive the language identified and the corrected electronic character sequence, and configured to format spaces in the electronic character sequence when the identified language is supported, the formatting module comprising:
- a character identifier configured to identify a particular character or a particular sequence of characters in the electronic character sequence;
- wherein, the formatting module is configured to format spaces in the electronic character sequence on the basis of the language identified and the particular character or the particular sequence of characters identified.
- a text prediction engine configured to receive an electronic character sequence as input, the text prediction engine comprising:
- In a fifth aspect of the invention there is provided a method of formatting, with a formatting module supporting at least one language and having a character identifier, spaces in an electronic character sequence, the method comprising:
-
- identifying whether the electronic character sequence is written in a language supported by the formatting module;
- identifying, with the character identifier, a particular character or a particular sequence of characters in the electronic character sequence;
- formatting, with the formatting module, spaces in the electronic character sequence on the basis of the language identified and the particular character or sequence of characters identified, when a supported language is identified.
- The formatting module may comprise a language identifier to identify whether the electronic character sequence is written in a language supported by the formatting module. Preferably, the formatting module supports a plurality of languages and the method further comprises identifying with the language identifier the most likely language of the electronic character sequence.
- The most likely language of the electronic character sequence may be identified by a text prediction engine, where the method further comprises transmitting the most likely language to the formatting module which identifies whether the most likely language is supported by the formatting module.
- The language identifier preferably comprises at least one set of rules and a comparison mechanism, each rule defining the formatting of spaces in the electronic character sequence, wherein the method further comprises:
-
- comparing, with the comparison mechanism, each rule of one of the at least one set of rules to the electronic character sequence to identify whether a rule is applicable to the character sequence;
- identifying, with the comparison mechanism, that a particular rule is applicable to the character sequence; and
- applying the applicable rule to the electronic character sequence to format the spaces in the electronic character sequence.
- Preferably, the comparison mechanism compares each rule of one of the at least one set of rules to the electronic character sequence only when a supported language is identified.
- Each rule may relate to a particular character or sequence of characters to be identified and each rule is associated with one or more actions which describe the format of spaces to be applied by the formatting module to the electronic character sequence given a supported language and the particular character or sequence of characters. In the method, the step of applying the applicable rule preferably comprises applying the one or more actions associated with that applicable rule to the electronic character sequence.
- Identifying a particular character may comprise identifying a punctuation mark and formatting the spaces in the electronic character sequence may comprise formatting the spaces either side of the punctuation mark on the basis of the form of the punctuation mark.
- Identifying a particular sequence of characters may comprise identifying a particular context and formatting the spaces in the electronic character sequence may comprise formatting the spaces on the basis of the context.
- Identifying a particular character may comprise identifying a punctuation mark and formatting the spaces in the electronic character sequence may comprise formatting the spaces either side of the punctuation mark on the basis of the category of punctuation mark.
- Where each rule is associated with one or more actions, the one or more actions may comprise a sequence of actions, wherein the sequence of actions is applied sequentially to the electronic character sequence.
- Where the formatting module supports a plurality of languages, the language identifier may comprise a plurality of sets of rules, one set of rules for each language supported, and comparing each rule to the electronic character sequence comprises comparing each rule of the set of rules that corresponds to the most likely language.
- The formatting module may comprise sets of rules relating to each supported language, each family of languages, and all languages in the world, and the method comprises applying the rules in a hierarchal structure such that, once a language has been identified, the comparison mechanism first compares each rule from the set of rules specific to that language, followed by each rule from the set of rules applicable to the family of languages to which that language belongs, followed by each rule of the set of rules which are applicable to all languages until an applicable rule is identified or no applicable rule is identified and all rules are exhausted.
- The comparison mechanism preferably compares the rules in a specific predetermined order.
- The set of rules may comprise context rules, character rules and category rules, and the method preferably comprises comparing the rules in the following order until an applicable rule is identified or no applicable rule is identified and all rules are exhausted: context rules, character rules, and then category rules.
- In a sixth aspect of the invention there is provided a computer program product comprising a computer readable medium having stored thereon computer program means for causing a processor to carry out a method as described above.
- The present invention will now be described in detail with reference to the accompanying drawings, in which:
-
FIG. 1 is a schematic of a system comprising a prediction engine and a formatting module in accordance with the present invention; -
FIG. 2 is a schematic of a formatting module in accordance with the present invention; -
FIG. 3 is a schematic of the formatting module ofFIG. 2 shown in greater detail; -
FIG. 4 is an illustration of a structure of specific types of rules within a set of rules for a given language, and shows the order in which a comparison mechanism compares the rules, in accordance with the present invention; -
FIG. 5 is an illustration of how the rules are structured for the English language and the order in which the comparison mechanism compares the rules, in accordance with the present invention. - The present invention provides a formatting module that is configured to format the spaces for a particular sentence on the basis of the conventions for the language in which the sentence is written. The formatting module formats the spaces by inserting and/or deleting spaces in the electronic character sequence.
- Preferably, but not necessarily, the
formatting module 10 is part of a system, such as anelectronic device 100, comprising atext prediction engine 30, as shown inFIG. 1 . The electronic device is preferably a mobile device, such as a PDA, tablet, laptop computer or mobile phone. The formatting module may be used to format the spaces in an electronic character sequence entered by a user for a text message. The user interacts with atext entry system 50 of theelectronic device 100 by entering text via an input mechanism such as a virtual keyboard. In the particular case of a predictive text entry system, thetext prediction engine 30 may be configured to correct mistyped or misspelt words and may also be configured to predict what the user is going to write next, thus improving the performance and quality of the text input into the device. An example of such atext prediction engine 30 is described in PCT/GB2011/001419, which is hereby incorporated by reference in its entirety. - As can be seen from
FIG. 1 , a character sequence is input into thedevice 100. The character sequence is passed to atext prediction engine 30 which may modify that character sequence to correct misspelt words and/or to predict words. The character sequence, so modified by thetext prediction engine 30, is passed to theformatting module 10. Theformatting module 10 is configured to output a space formatted version of the modified character sequence, as shown inFIGS. 1 and 2 . The formatting module formats the spaces of a character sequence by inserting and/or deleting spaces in the sequence. Theformatting module 10 formats the spaces for an electronic character sequence, if the language in which that character sequence is written is supported by theformatting module 10. - A
formatting module 10 in accordance with the present invention is shown inFIG. 2 . Theformatting module 10 is configured to support at least one language. Theformatting module 10 comprises alanguage identifier 20 configured to identify whether an electronic character sequence is written in a language supported by theformatting module 10. Thelanguage identifier 20 makes use of one or more statistical language models, the general properties of which are known in the art, in order to identify whether the electronic character sequence is written in a language supported by theformatting module 10. - In a preferred embodiment, the
formatting module 10 supports a plurality of languages. Thus, in the preferred embodiment, thelanguage identifier 20 comprises a plurality of statistical languages models, each statistical language model corresponding to a different language supported by theformatting module 10, and thelanguage identifier 20 is configured further to identify the most likely supported language of the electronic character sequence. At any given stage, theformatting module 10 is configured to maintain a list of “active languages”, each of which is associated with a language model. - One process for identifying the most likely current language is to maximize the probability of a language, given a context, i.e. maximizing P(language|context), according to the following expression (using Bayes rule):
-
- As the absolute values of P(language|context) are not important, since only the ranking of languages matters, the term P(context), which does not depend on language, may be dropped from the expression. Additionally, a uniform prior over languages, P(language)=k, may also be dropped since it is constant with respect to language. With these assumptions, the only quantity that the language identifier is required to estimate is P(context|language). Typically context is just a sequence of words, therefore to estimate P(context|language), the language identifier preferably uses a ‘chain’ of conditional probability estimates, making a ‘Markovian’ conditional independence assumption:
-
- Each language is therefore separately modelled by a smoothed n-gram language model (known in the art and as described in WO 2012/042217), capable of estimating the probability of a word, given local context.
- There are other ways of estimating P(context|language), using different types of language models, e.g. those that include syntactic and/or semantic information. Another possibility would be to use a Hidden Markov Model (HMM) to estimate a progression of unobserved language “states”. A further possibility would be to use a supervised discriminative classification model to predict language, e.g. a support vector machine (SVM) or neural network.
- To transform the incoming sequence of characters into a sequence of terms the
language identifier 20 uses a tokenizer as is known in the art. - In a system such as that illustrated in
FIG. 1 , theprediction engine 30 may comprise a language identifier, rather than it being provided in theformatting module 10. As described above, the language identifier will comprise a tokeniser and a plurality of language models, which may already be present in the prediction engine, such as the prediction engine described in WO 2012/042217, which is hereby incorporated by reference in its entirety. - To estimate the most likely language given context, the
language identifier 20 is configured to calculate the likelihood of the context in each language which is supported in turn, and selects the language with the maximum likelihood. The likelihood of the context (a sequence of terms) is the product of the probability of each term, given preceding terms, which is computed by a smoothed n-gram model, as has been described in relation to a text prediction engine in WO 2012/042217. - If the user switches languages whilst typing, the formatting of the spaces around the punctuation marks may differ between the sentences, dependent on the language in which it is written, e.g. “Bonjour mon ami ! How are you doing? Talk to you soon.”
- To provide a
formatting module 10 that is capable of identifying a change in language, for example where a user has switched languages between sentences, thelanguage identifier 20 is preferably configured to limit the amount of context used to make the estimate of the most likely language. This provides a basic form of recency in the model for identifying the most likely language—languages used more recently are intuitively more likely than languages used much earlier in a document. For instance, in one embodiment, thelanguage identifier 20 may use the six most recent words of context. However, the number of most recent words of context could be chosen dependent on the frequency at which a user switches between languages and the length of their input stream in any given language. - The
language identifier 20 is preferably configured to identify whether the language in which the electronic character sequence is written is supported by theformatting module 10. By way of a non-limiting example, thelanguage identifier 20 may identify that the electronic character sequence is written in an unsupported language if none of the context terms of the sequence are present in one of the language identifier's language models, where each language model corresponds to a supported language. Thus, if one or more of the context terms are determined to be present in one of the language models, the language identifier determines that the electronic character sequence is written in a supported language. A variation on this example is one in which thelanguage identifier 20 is configured to identify whether a certain fraction or ratio of the context words are present in a language model, e.g. a quarter, two-thirds or any other fraction or ratio of the context terms are present in one of the language models, in order to determine that the electronic character sequence is written in a supported language. Any other suitable method for determining whether the language of the electronic character sequence is supported can be used. - As shown in
FIG. 3 , thecharacter identifier 40 preferably comprises a set ofrules 70, each rule relating to a character or particular sequence of characters to be identified, and acomparison mechanism 60 configured to compare each rule of the set ofrules 70 to the electronic character sequence to determine whether a rule is applicable. If the rule is applicable, then a character or particular sequence of characters is identified, e.g. if the rule relates to a particular punctuation mark and the rule is found to be applicable, it is because that punctuation mark is within the electronic character sequence. The electronic character sequence is preferably passed to theformatting module 10 sequentially, e.g. a character at a time, with thecomparison mechanism 60 comparing each rule to the last character or last sequence of characters received. - Thus, the
character identifier 40 uses the rules to identify when a particular character or sequence of characters, such as a punctuation mark, occurs in the electronic character sequence. Furthermore, the rules define, by one or more actions associated with the rule, the space formatting to apply to the electronic character sequence, i.e. whether spaces should be inserted and/or deleted. Thus, once a rule has been found to be applicable to a particular character or sequence of characters, the actions associated with that rule are applied to the electronic character sequence to format the spaces within the electronic character sequence, e.g. in the case of the particular character being a punctuation mark, the actions may define the formatting of the spaces either side of the punctuation mark, as will be described in more detail below. - The set of
rules 70 preferably comprises a plurality of sets of rules, a set of rules for each language supported by theformatting module 10. Thecomparison mechanism 60 is configured to compare the set of rules relating to the language identified by the language identifier 2 as the most likely supported language. In an embodiment in which thelanguage identifier 20 supports a single language, thecomparison mechanism 60 comprises a single set ofrules 70 corresponding to that language, and thecomparison mechanism 60 is configured to compare the set ofrules 70 to the electronic character sequence if the language of the character sequence is identified as being the supported language. If the language of the character sequence is not identified as a supported language, thecomparison mechanism 60 does not search for applicable rules. - The
formatting module 10 is configured such that a system designer is able to manually add new rules, with associated actions, to the formatting module. The rules and associated actions can be updated without affecting the other components of the formatting module. - A rule is preferably defined by a four-tuple, as follows: Rule :: (C, s, A, S)
- :: is an operator that can be read “has type of”.
- C is a condition taking the form of a regular expression, implementing a function of type F :: [character]→{true, false}, e.g. taking the incoming character sequence and returning a boolean denoting whether or not a rule is applicable and thus whether or not to apply the sequence of actions associated with that rule. The
comparison mechanism 60 identifies a particular character or sequence of characters in an electronic character sequence by implementing the function of the type F :: [character]→{true, false}. This field is therefore essential and is never empty. - s represents a state that allows the system to “remember” previous rule applications in some cases. For example, the state may be “None” when the system is not required to maintain a status, or the state may be “Open” or “Close” where punctuators appear in pairs and one cannot exist without the other, e.g. left parenthesis ‘(’ and right parenthesis ‘)’.
- A is a sequence of Actions, i.e. A :: [Action]. In special cases this could be an empty sequence represented by [ ]. Actions are the means by which the
formatting module 10 describes the space formatting that should be applied to, for example, a punctuation mark given a particular character sequence context (e.g. where the punctuation mark is found in the context of a mathematical equation). When a punctuation mark of the electronic character sequence is determined by thecomparison mechanism 60 to match one of the rules, each action held by the rule is applied, preferably sequentially, to the punctuation mark to ensure the correct formatting of the spaces either side of the punctuation mark. For example, if the punctuation mark is a full stop, the Action might be to delete the space before the full stop (if such a space is present) and to insert a space after the full stop (if such a space is missing), where the most likely language is English. - There are two types of actions that the formatting module may comprise: type A and type B.
- An action of type A is a function that operates on a sequence of characters and returns a formatted sequence of characters, without changing the sequence of characters, other than by formatting them:
- Action A :: [character]→[character]
- For example, in the case of “word.word”→“word. word”
- An action of type B is a function that given a sequence of characters returns a code that represents the state of the system, without changing the sequence of characters:
- Action B :: [character]→new state
- The new state is any of the possible states that the system might be in, e.g. the shift state to define whether the next character should be capitalised or not, e.g. “Word.”→“shift state of system”.
- S is a recursive sequence of rules, known as “secondary rules”, i.e. S :: [Rule]. When the Rule does not describe any secondary rules, S will be represented by Ø. The secondary rules will be checked before the actions of the parent rules are applied, allowing an alternative behaviour for condition C depending on factors described by the secondary rules. The input for the secondary rules is the same electronic character sequence as for the parent rules; however, the focus of the condition C for the secondary rule is the character in the sequence that precedes the character that triggered the parent rule.
- For example, in the preferred embodiment where the electronic character sequence is passed to the formatting module sequentially, e.g. a character at a time, the comparison mechanism compares each parent rule to the last character received. If a parent rule is found to be applicable, and that parent rule comprises at least one secondary rule, the comparison mechanism compares the at least one secondary rule to the penultimate character in the sequence (since the condition C for the parent rule is focused on the final character, whereas for the secondary rule the focus is on the penultimate character).
- The application of secondary rules will be described in more detail below. Since secondary rules are not essential, the general form of the Rule could omit this field.
- When designing the
formatting module 10, the sequence of actions associated with a rule can be selected by a designer from a predetermined set of candidate actions. The sequence of actions may contain any number of the candidate actions in any order and with any number of repetitions. As stated above, theformatting module 10 allows a system designer of theformatting module 10, to manually extend and adapt the associated actions to the requirements of the languages or the text entry system. - In a preferred embodiment, the formatting module comprises three specialisations of the Rule described above: Context Rules, Category Rules, and Character Rules. The specialised rules provide a powerful tool to capture the way punctuation is used in natural language.
- A context rule is a rule of the form: Context Rule :: (C, None, A, Ø). The regular expression present in C is applied only to the context, e.g. the regular expression corresponds to a particular character sequence in the context of the electronic character sequence, for example “www”. Since the state is “None”, a Context Rule will never have or maintain state. The Context Rules have no “secondary rules”.
- An example of a context rule is a rule for URLs which states that when “www” is in the context, no spaces should be inserted automatically on either site of the punctuator “.”, e.g. “www.site.com”
- Thus, an example of a context rule is:
- Context Rule :: (‘www’, None, [DeleteSpaceBefore, DeleteSpaceAfter], Ø).
- A Category Rule preferably takes the form: Category Rule :: (C, None, A, S)
- This rule will match the Unicode category of the character in the electronic character sequence to the Unicode category defined by the rule, e.g. the Unicode category of a punctuation mark.
- C is a regular expression that is limited to matching the Unicode category of the punctuation mark. Therefore, this type of Rule is only applied to a single character. S is a sequence of secondary rules, e.g. a context rule, a character rule or a category rule. Alternatively this field can be empty, Ø, in the case where no secondary rules are defined.
- An example of a category rule is:
- Category Rule :: (‘P’, [DeleteSpaceBefore, InsertSpaceAfter], Ø) where P corresponds to a category of punctuation marks, e.g. a category that includes ‘!’ and ‘?’ because they should be formatted with the same spaces.
- As is known in the art, characters within the Unicode standard have a range of properties associated with them. One of these properties is the category to which a character belongs. The condition C of a category rule can relate to a Unicode category. The General Category value for a character serves as a basic classification of that character, based on its primary usage. The property extends the widely used subdivision of ASCII characters into letters, digits, punctuation, and symbols—a useful classification that needs to be elaborated and further subdivided to remain appropriate for the larger and more comprehensive scope of the Unicode standard.
- Each Unicode code point is assigned a normative General Category value. Each value of the General Category is given a two-letter property value alias, where the first letter gives information about a major class and the second letter designates a subclass of that major class. In each class, the subclass “other” merely collects the remaining characters of the major class. For example, the subclass “No” (Number, other) includes all characters of the Number class that are not a decimal digit or letter. These characters may have little in common besides their membership in the same major class.
- A character rule preferably takes the form: Character Rule :: (C, s, A, S). This rule matches a character defined by the rule to a character in the electronic character sequence. Therefore, in this type of rule, C can only contain a single character. C is a regular expression consisting of a single character matched against a character of the electronic character sequence. C may define the Unicode for the particular character of interest, this Unicode being matched to the Unicode in the electronic character sequence.
- s preferably defines two new states, in addition to the None state: {Open, Close, None}. s therefore dictates actions for ambiguous pairs that might be in an Open or Close state, and also includes no state for non-ambiguous characters. By this definition, if for one punctuation mark two different sequences of Actions are required for different states, the system will define two rules. e.g. for the English language to format the following sentence correctly:
- And he said “Goodbye” and left. It was surprising.
- The character rules that define the formatting of spaces in this sentence are as follows:
- rule1→Character Rule :: (‘“’, Open, [InsertSpaceBefore, DeleteSpaceAfter], Ø)
- rule2→Character Rule :: (‘”’, Close, [DeleteSpaceBefore, InsertSpaceAfter], Ø)
- rule3→Character Rule :: (‘.’, None, [DeleteSpaceBefore, InsertSpaceAfter], Ø)
- S is a sequence of secondary rules, which may include any of the three types of rules: Context, Category or Character. It can also define no further rules, in which case this field is denoted by Ø.
- An example to explain the interaction of the rules and, in particular, how secondary rules are applied is now provided. In French, a space is placed either side of an exclamation mark “!” or a question mark “?”, e.g. Bonjour ! a va ? . However, there is an exception to this rule, when another exclamation mark precedes the current one, e.g. Bonjour !!! a va ? . In order for the system to deal with this situation properly secondary rules can be defined:
- rule1→Character Rule :: (‘!’, None, [InsertSpaceBefore, InsertSpaceAfter], [rule2])
- rule2→Category Rule :: (‘P’, None, [DeleteSpaceBefore, InsertSpaceAfter], Ø)
- In this example, P relates to Category punctuation in accordance with the Unicode standard which occurs prior to the current character of interest e.g. if the
formatting module 10 receives “!” in the sequence “?!”, P is the category that encompasses “?”. In the example above, when the user types the first ‘!’ rule1 will be triggered. Within the trigger routine of rule1 all the secondary rules will be checked but no matches will happen, since there is no punctuation mark preceding the ‘!’, so the default actions for rule1 will be applied by theformatting module 10. On subsequent insertions of the exclamation mark, the step for matching secondary rules will trigger rule2 as ‘!’ is within the Category Punctuation by the Unicode standard and the actions defined in rule2 will be applied. -
- For a given language it is generally required to have multiple rules defined to ensure correct formatting of spaces in an electronic character sequence. The different types of rules, i.e. context, category and character, are preferably applied using a priority scheme, such that the
formatting module 10 can succinctly specify the formatting patterns of the spaces for a given language. - A couple of examples which demonstrate why it is preferable to prioritise the application of the types of rules are provided below.
- For the specific case of URLs, assume that there are two rules: a context rule which defines that when “www” is in the context, no space should be inserted automatically; and a character rule that says that when the full stop “.” punctuation mark is introduced, a space should be inserted afterwards. In this situation, if the character rule is applied first and the user enters “www.site.com”, the result from the punctuator will be “www. site. com”, because the character rule for the full stop will have preference. To format such a URL correctly, the context rule should have preference over the character rule and should therefore be applied first.
- In another example, where a formatting module comprises two rules: a first category rule that states that all the characters in a Maths category will have spaces on either side of the Maths character, and a second character rule that defines that the character “−” (minus) will not have any spaces either side of it, because it is most likely to be used as a hyphen. If the user were to insert ‘−’, and the category rule was prioritised over the character rule, the character rule would never be triggered. Thus, to format the sequence correctly, the character rule should be prioritised over the category rule.
- The rule that is prioritised is applied, and the
comparison mechanism 60 stops the search for applicable rules. However, as described above, a different rule type may be applied as part of a secondary set of rules. - Thus, in the preferred embodiment, as illustrated in
FIG. 4 , the comparison mechanism is configured to compare, and the formatting module is configured to apply, the rules for an individual language in accordance with the following prioritisation structure: - Context Rules→Character Rules→Category Rules
- To implement the prioritisation structure, the
comparison mechanism 60 is preferably configured to identify the type of rule. Thecomparison mechanism 60 can be configured to identify the rule type by any suitable means. For example, each rule can be labelled with its rule type, where thecomparison mechanism 60 is configured to identify all of the rules of a first rule type before comparing those rules to the electronic character sequence to see if one of them is applicable. The rules of a given type can be placed in a container, so that thecomparison mechanism 60 compares all rules in a given container, before moving on to the next container. In another embodiment, thecomparison mechanism 60 may comprise code to identify the different rule types. - Alternatively, the rules themselves could be ordered according to the prioritisation structure, e.g. listed in accordance with the prioritisation structure.
- As will be apparent from the description above, if the
comparison mechanism 60 finds that a rule is applicable, it does not continue through the prioritisation structure, e.g. if the category rule is found to be applicable to www.site.com, then the character rule is not compared or applied, since thecomparison mechanism 60 has stopped searching for applicable rules. Otherwise this character rule, if applied after the context rule, would result in the incorrect formatting “www. Site. Com” as described above. However, the rule that is applied may comprise secondary rules of the other rule types, e.g. the formatting module dealing with repeated punctuation where the triggered character rule comprises a secondary category rule: - rule1→Character Rule :: (‘!’, None, [InsertSpaceBefore, InsertSpaceAfter], [Rule2])
- rule2→Category Rule :: (‘P *’, None, [DeleteSpaceBefore, InsertSpaceAfter], Ø)
- Rules may be applicable to a particular language, e.g. English, and the family of languages to which that language belongs, e.g. Latin, or to all languages in the world. There are multiple conventions for punctuation that are common to a number of languages. For example, in all languages URLs are written the same way and therefore they all must have the necessary rules for the correct formatting of these elements.
- In the preferred embodiment in which the
formatting module 10 supports a plurality of languages, thelanguage identifier 20 is configured to pass the identified language to thecomparison mechanism 60, and thecomparison mechanism 60 is configured to compare the rules from the set ofrules 70 that are relevant given the particular language so identified. The set of rules is preferably ordered into a hierarchal structure, in order to avoid repeating the same rules. - Thus, in addition to the
comparison mechanism 60 being configured to compare the rules according to the rule prioritisation structure, e.g. context rule→character rule→category rule, as described above, thecomparison mechanism 60 is further configured to compare the rules in a particular order of increasing generality: - language specific rules→language family rules→worldwide rules
- To enable the
comparison mechanism 60 to compare the rules in this order, the comparison mechanism is preferably configured to identify the language generalisation rule i.e. whether the rule is a language specific rule, a language family rule or a worldwide rule. Thecomparison mechanism 60 may be coded to recognise the language generalisation rule or each rule may be labelled to identify the type of language generalisation rule, and containers may be used, as explained above when discussing the rule type prioritisation structure. As stated above, an alternative could be to order the rules into the generalisation structure. - Thus, the
comparison mechanism 60 is preferably configured to identify the rule type and the language generalisation rule, e.g. context rule applicable to French (language specific rule). - As can be seen from
FIG. 5 , thecomparison mechanism 60 compares the rules in accordance with the priority system described above, until a rule is found to be applicable: first all the “context rules” will be compared in order of increasing generality of language, e.g. the context rules are checked first for language specific rules, then for family rules, and then for worldwide rules; thecomparison mechanism 60 then proceeds to compare the next type of rule, character rules, through increasing generality in language, and then compares the category rules in the same way, until a rule is found to be applicable, at which point thecomparison mechanism 60 stops the search for an applicable rule. Alternatively, thecomparison mechanism 60 compares all of the rules to find that no rule is applicable and all rules are exhausted. - Preferably, the
comparison mechanism 60 is configured to compare each of the rules to each character in the electronic character sequence in turn. Thus, if thecomparison mechanism 60 discovers that a rule is applicable to a character of the character sequence, theformatting module 10 applies this rule to the electronic character sequence to format the spaces of the electronic character sequence, and the comparison mechanism moves on to comparing the rules to the next character in the character sequence. Likewise, if no rule is found to be applicable to that character, thecomparison mechanism 60 moves on to comparing the rules to the next character in the electronic character sequence. - As will be understood from above, the
language identifier 20 is configured to identify whether the language in which the electronic character sequence is being written is supported and, preferably, which is the most likely supported language. Thelanguage identifier 20 may be configured to identify the current language periodically, e.g. for every term (where the electronic character sequence is converted into a sequence of terms or words by a tokeniser) or for example every three terms, in order to identify whether the language has been switched by the user and thus to change the set of rules that are being compared by thecomparison mechanism 60 to the electronic character sequence. Any other frequency of checking may be used. If thelanguage identifier 20 determines that the language of the character sequence is not supported by theformatting module 10, thecomparison mechanism 60 stops searching for an applicable rule. - A
formatting module 10 orsystem 100 comprising aformatting module 10 in accordance with the present invention provides language detection and rule mechanisms that provide automatic dynamic punctuation. Unlike existing systems which neglect the possibility of having different behaviours for the same punctuation mark depending on the context in which the punctuation mark occurs, theformatting module 10 of the present invention is able to format the spaces either side of a punctuation mark on the basis of the context of the punctuation mark. - The
formatting module 10 of the present invention is therefore able to increase the productivity of the user by reducing the interaction required to produce correctly formatted punctuation appropriate to the target language. For multilingual users, theformatting module 10 is preferably able to automatically adjust the space formatting to the language currently being entered. This allows the user to focus on the message being delivered rather than formatting conventions specific to various target languages. - Furthermore, the
formatting module 10 of the present invention provides a separate layer that defines the behaviour of the formatting of the spaces for the punctuation, i.e. the rules and their associated actions. This allows independent manual updates of the rules and their associated actions for a particular language, to change the space formatting for that language, without affecting the space formatting for the other languages or requiring an upgrade of theentire formatting module 10. - The present invention also provides a corresponding method for formatting spaces in an electronic character sequence that has preferably been entered by a user. Turning to
FIG. 1 and the above describedformatting module 10 andsystem 100 comprising aformatting module 10, the method comprises identifying whether the electronic character sequence is written in a language supported by the formatting module; identifying, with the character identifier 40 (seeFIG. 2 ), a particular character or a particular sequence of characters in the electronic character sequence; and formatting, with theformatting module 10 if a supported language is identified, spaces in the electronic character sequence on the basis of the language identified and the particular character or sequence of characters identified. As will be apparent from the description of theformatting module 10 and thesystem 100 comprising aformatting module 10, the formatting module preferably supports a plurality of languages, and the most likely supported language can be identified by alanguage identifier 20 of theformatting module 10 or a language identifier of theprediction engine 30 of thesystem 100. - Other aspects of the method of the present invention can be readily determined by analogy to the above system description. For example, the formatting module comprises a language identifier to identify whether the electronic character sequence is written in a language supported by the formatting module and to identify the most likely language of the electronic character sequence. The method, by analogy to the formatting module, will also comprise selecting, with a
comparison mechanism 10, the set of rules that correspond to the most likely language identified, etc. - The present invention also provides a computer program product comprising a computer readable medium having stored thereon computer program means for causing a processor to carry out the method according to the present invention.
- The computer program product may be a data carrier having stored thereon computer program means for causing a processor external to the data carrier, i.e. a processor of an electronic device, to carry out the method according to the present invention. The computer program product may also be available for download, for example from a data carrier or from a supplier over the internet or other available network, e.g. downloaded as an app onto a mobile device (such as a mobile phone) or downloaded onto a computer, the mobile device or computer comprising a processor for executing the computer program means once downloaded.
- It will be appreciated that this description is by way of example only; alterations and modifications may be made to the described embodiment without departing from the scope of the invention as defined in the claims.
Claims (33)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1216640.1 | 2012-09-18 | ||
GBGB1216640.1A GB201216640D0 (en) | 2012-09-18 | 2012-09-18 | Formatting module, system and method for formatting an electronic character sequence |
PCT/GB2013/052443 WO2014045032A1 (en) | 2012-09-18 | 2013-09-18 | Formatting module, system and method for formatting an electronic character sequence |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB2013/052443 A-371-Of-International WO2014045032A1 (en) | 2012-09-18 | 2013-09-18 | Formatting module, system and method for formatting an electronic character sequence |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/136,730 Continuation US20230252222A1 (en) | 2012-09-18 | 2023-04-19 | Formatting module, system and method for formatting an electronic character sequence |
Publications (1)
Publication Number | Publication Date |
---|---|
US20150248379A1 true US20150248379A1 (en) | 2015-09-03 |
Family
ID=47144444
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/428,972 Abandoned US20150248379A1 (en) | 2012-09-18 | 2013-09-18 | Formatting module, system and method for formatting an electronic character sequence |
US18/136,730 Pending US20230252222A1 (en) | 2012-09-18 | 2023-04-19 | Formatting module, system and method for formatting an electronic character sequence |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/136,730 Pending US20230252222A1 (en) | 2012-09-18 | 2023-04-19 | Formatting module, system and method for formatting an electronic character sequence |
Country Status (6)
Country | Link |
---|---|
US (2) | US20150248379A1 (en) |
EP (1) | EP2898426A1 (en) |
JP (1) | JP6273285B2 (en) |
CN (1) | CN104641367B (en) |
GB (1) | GB201216640D0 (en) |
WO (1) | WO2014045032A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150331916A1 (en) * | 2013-02-06 | 2015-11-19 | Hitachi, Ltd. | Computer, data access management method and recording medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106909296A (en) | 2016-06-07 | 2017-06-30 | 阿里巴巴集团控股有限公司 | The extracting method of data, device and terminal device |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5062143A (en) * | 1990-02-23 | 1991-10-29 | Harris Corporation | Trigram-based method of language identification |
US5222225A (en) * | 1988-10-07 | 1993-06-22 | International Business Machines Corporation | Apparatus for processing character string moves in a data processing system |
US6529864B1 (en) * | 1999-08-11 | 2003-03-04 | Roedy-Black Publishing, Inc. | Interactive connotative dictionary system |
US6624814B1 (en) * | 1996-07-23 | 2003-09-23 | Adobe Systems Incorporated | Optical justification of text |
US20040078191A1 (en) * | 2002-10-22 | 2004-04-22 | Nokia Corporation | Scalable neural network-based language identification from written text |
US20060184357A1 (en) * | 2005-02-11 | 2006-08-17 | Microsoft Corporation | Efficient language identification |
US20060184878A1 (en) * | 2005-02-11 | 2006-08-17 | Microsoft Corporation | Using a description language to provide a user interface presentation |
US20060294054A1 (en) * | 2005-06-09 | 2006-12-28 | International Business Machines Corporation | Access management apparatus, access management method and program |
US20090050701A1 (en) * | 2007-08-21 | 2009-02-26 | Symbol Technologies, Inc. | Reader with Optical Character Recognition |
US20100174528A1 (en) * | 2009-01-05 | 2010-07-08 | International Business Machines Corporation | Creating a terms dictionary with named entities or terminologies included in text data |
US20110295858A1 (en) * | 2010-05-26 | 2011-12-01 | Samsung Electronics Co., Ltd. | Method and apparatus for searching nucleic acid sequence |
US20130047078A1 (en) * | 2007-09-28 | 2013-02-21 | Thomas G. Bever | System, plug-in, and method for improving text composition by modifying character prominence according to assigned character information measures |
US20140153830A1 (en) * | 2009-02-10 | 2014-06-05 | Kofax, Inc. | Systems, methods and computer program products for processing financial documents |
US20160062954A1 (en) * | 2012-09-15 | 2016-03-03 | Numbergun Llc | Flexible high-speed generation and formatting of application-specified strings |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100213910B1 (en) * | 1997-03-26 | 1999-08-02 | 윤종용 | Hangule/english automatic translation and method |
US6374242B1 (en) * | 1999-09-29 | 2002-04-16 | Lockheed Martin Corporation | Natural-language information processor with association searches limited within blocks |
CN100382022C (en) * | 2005-09-09 | 2008-04-16 | 华为技术有限公司 | Interface data grammar analytic processing system and its analytic processing method |
GB201016385D0 (en) * | 2010-09-29 | 2010-11-10 | Touchtype Ltd | System and method for inputting text into electronic devices |
-
2012
- 2012-09-18 GB GBGB1216640.1A patent/GB201216640D0/en not_active Ceased
-
2013
- 2013-09-18 WO PCT/GB2013/052443 patent/WO2014045032A1/en active Application Filing
- 2013-09-18 US US14/428,972 patent/US20150248379A1/en not_active Abandoned
- 2013-09-18 EP EP13771173.5A patent/EP2898426A1/en not_active Ceased
- 2013-09-18 CN CN201380048564.6A patent/CN104641367B/en active Active
- 2013-09-18 JP JP2015531650A patent/JP6273285B2/en active Active
-
2023
- 2023-04-19 US US18/136,730 patent/US20230252222A1/en active Pending
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5222225A (en) * | 1988-10-07 | 1993-06-22 | International Business Machines Corporation | Apparatus for processing character string moves in a data processing system |
US5062143A (en) * | 1990-02-23 | 1991-10-29 | Harris Corporation | Trigram-based method of language identification |
US6624814B1 (en) * | 1996-07-23 | 2003-09-23 | Adobe Systems Incorporated | Optical justification of text |
US6529864B1 (en) * | 1999-08-11 | 2003-03-04 | Roedy-Black Publishing, Inc. | Interactive connotative dictionary system |
US20040078191A1 (en) * | 2002-10-22 | 2004-04-22 | Nokia Corporation | Scalable neural network-based language identification from written text |
US20060184878A1 (en) * | 2005-02-11 | 2006-08-17 | Microsoft Corporation | Using a description language to provide a user interface presentation |
US20060184357A1 (en) * | 2005-02-11 | 2006-08-17 | Microsoft Corporation | Efficient language identification |
US20060294054A1 (en) * | 2005-06-09 | 2006-12-28 | International Business Machines Corporation | Access management apparatus, access management method and program |
US20090050701A1 (en) * | 2007-08-21 | 2009-02-26 | Symbol Technologies, Inc. | Reader with Optical Character Recognition |
US20130047078A1 (en) * | 2007-09-28 | 2013-02-21 | Thomas G. Bever | System, plug-in, and method for improving text composition by modifying character prominence according to assigned character information measures |
US20100174528A1 (en) * | 2009-01-05 | 2010-07-08 | International Business Machines Corporation | Creating a terms dictionary with named entities or terminologies included in text data |
US20140153830A1 (en) * | 2009-02-10 | 2014-06-05 | Kofax, Inc. | Systems, methods and computer program products for processing financial documents |
US20110295858A1 (en) * | 2010-05-26 | 2011-12-01 | Samsung Electronics Co., Ltd. | Method and apparatus for searching nucleic acid sequence |
US20160062954A1 (en) * | 2012-09-15 | 2016-03-03 | Numbergun Llc | Flexible high-speed generation and formatting of application-specified strings |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150331916A1 (en) * | 2013-02-06 | 2015-11-19 | Hitachi, Ltd. | Computer, data access management method and recording medium |
Also Published As
Publication number | Publication date |
---|---|
EP2898426A1 (en) | 2015-07-29 |
JP2015534171A (en) | 2015-11-26 |
CN104641367A (en) | 2015-05-20 |
CN104641367B (en) | 2019-01-11 |
WO2014045032A1 (en) | 2014-03-27 |
JP6273285B2 (en) | 2018-01-31 |
GB201216640D0 (en) | 2012-10-31 |
US20230252222A1 (en) | 2023-08-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10262062B2 (en) | Natural language system question classifier, semantic representations, and logical form templates | |
US20230252222A1 (en) | Formatting module, system and method for formatting an electronic character sequence | |
JP5901001B1 (en) | Method and device for acoustic language model training | |
CN112016310A (en) | Text error correction method, system, device and readable storage medium | |
US20090150322A1 (en) | Predicting Candidates Using Information Sources | |
US9645988B1 (en) | System and method for identifying passages in electronic documents | |
US10977155B1 (en) | System for providing autonomous discovery of field or navigation constraints | |
JP5809381B1 (en) | Natural language processing system, natural language processing method, and natural language processing program | |
US20200272435A1 (en) | Systems and methods for virtual programming by artificial intelligence | |
CN112579733A (en) | Rule matching method, rule matching device, storage medium and electronic equipment | |
CN110008807B (en) | Training method, device and equipment for contract content recognition model | |
US9275035B2 (en) | Method and system to determine part-of-speech | |
CN116151220A (en) | Word segmentation model training method, word segmentation processing method and device | |
CN113688615B (en) | Method, equipment and storage medium for generating field annotation and understanding character string | |
CN111090720B (en) | Hot word adding method and device | |
Ouersighni | Robust rule-based approach in Arabic processing | |
CN111401057B (en) | Semantic analysis method, storage medium and terminal equipment | |
US20150294008A1 (en) | System and methods for providing learning opportunities while accessing information over a network | |
Peng et al. | Prompt as a Knowledge Probe for Chinese Spelling Check | |
Littell et al. | Parser combinators for Tigrinya and Oromo morphology | |
Eger | Designing and comparing G2P-type lemmatizers for a morphology-rich language | |
Ullah et al. | Part-Of-Speech Tagging for Balochi Language: A Data driven application of Conditional Random Fields | |
CN114661917A (en) | Text amplification method, system, computer device and readable storage medium | |
CN105094358A (en) | Information processing device and method for inputting target language characters through outer codes | |
JP2005092279A (en) | Natural language processing system, natural language processing method and computer program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: TOUCHTYPE LIMITED, UNITED KINGDOM Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MEDLOCK, BENJAMIN;MARTINEZ DEL CORRAL, DAVID;SIGNING DATES FROM 20150303 TO 20150305;REEL/FRAME:035246/0681 |
|
AS | Assignment |
Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: MERGER;ASSIGNOR:TOUCHTYPE, INC.;REEL/FRAME:047259/0625 Effective date: 20171211 Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:047259/0974 Effective date: 20181002 |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE ASSIGNMENT FROM MICROSOFT CORPORATION TO MICROSOFT TECHNOLOGY LICENSING, LLC IS NOT RELEVANT TO THE ASSET. PREVIOUSLY RECORDED ON REEL 047259 FRAME 0974. ASSIGNOR(S) HEREBY CONFIRMS THE THE CURRENT OWNER REMAINS TOUCHTYPE LIMITED.;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:047909/0353 Effective date: 20181002 Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE RECEIVING PARTY PREVIOUSLY RECORDED AT REEL: 047259 FRAME: 0625. ASSIGNOR(S) HEREBY CONFIRMS THE MERGER;ASSIGNOR:TOUCHTYPE, INC.;REEL/FRAME:047909/0341 Effective date: 20171211 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCV | Information on status: appeal procedure |
Free format text: NOTICE OF APPEAL FILED |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:TOUCHTYPE LIMITED;REEL/FRAME:053965/0124 Effective date: 20200626 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCV | Information on status: appeal procedure |
Free format text: NOTICE OF APPEAL FILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |