US20010025320A1

US20010025320A1 - Multi-language domain name service

Info

Publication number: US20010025320A1
Application number: US09/792,438
Authority: US
Inventors: Ching Seng; Jun Yin; Mingliang Jiang
Original assignee: I-DNS NET INTERNATIONAL Pte Ltd
Current assignee: I-DNS NET INTERNATIONAL Pte Ltd
Priority date: 1999-02-26
Filing date: 2001-02-23
Publication date: 2001-09-27
Also published as: CN1238804C; KR20000076575A; EP1059789A3; JP3492580B2; TW461209B; KR100444757B1; EP1059789A2; EA002513B1; CN1812407A; WO2000050966A2; EA200000136A2; JP2000253067A; WO2000050966A3; US6446133B1; HK1096499A1; CN1812407B; HK1029418A1; US20010047429A1; CN1266237A; SG91854A1

Abstract

A multilingual apparatus detects the linguistic encoding type of a digital string encoding a domain name. It accomplished this using a tree or graph comprised of nodes holding linguistic digits representing the digital sequence of a character or a portion of a character. These nodes are compared against digital sequences of characters in the domain name under consideration. Each comparison results in a step down the graph. Then another comparison is performed, often with the next successive character in the domain name. Ultimately the process reaches a terminal node of the graph. This terminal node specifies the encoding type of the domain name under consideration.

Description

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation-in-part of U.S. patent application Ser. No. 09/258,690 filed Feb. 26, 1999 in the name of Ching Hong Seng et al. and titled “MULTI-LANGUAGE DOMAIN NAME SERVICE.” That application is incorporated herein by reference for all purposes.[0001]

BACKGROUND OF THE INVENTION

The present invention relates to the Domain Name Service used to resolve network domain names into corresponding network addresses. More particularly, the invention relates to an alternative or modified Domain Name Service that accepts domain names provided in many different encoding formats, not just ASCII.

The Internet has evolved from a purely research and academic entity to a global network that reaches a diverse community with different languages and cultures. In all areas the Internet has progressed to address the localization needs of its audience. Today, electronic mail is exchanged in most languages. Content on the World Wide Web is now published in many different languages as multilingual-enabled software applications proliferate. It is possible to send an e-mail message to another person in Chinese or to view a World Wide Web page in Japanese.

The Internet today relies entirely on the Domain Name System to resolve human readable names to numeric IP addresses and vice versa. The Domain Name System (DNS) is still based on a subset of Latin-1 alphabet, thus still mainly English. To provide universality, e-mail addresses, Web addresses, and other Internet addressing formats adopt ASCII as the global standard to guarantee interoperation. No provision is made to allow for e-mail or Web addresses to be in a non-ASCII native language. The implication is that any user of the Internet has to have some basic knowledge of ASCII characters.

While this does not pose a problem to technical or business users who, generally speaking, are able to understand English as an international language of science, technology, business and politics, it is a stumbling block to the rapid proliferation of the Internet to countries where English is not widely spoken. In those countries, the Internet neophyte must understand basic English as a prerequisite to send e-mail in her own native language because the e-mail address cannot support the native language even though the e-mail application can. Corporate intranets have to use ASCII to name their department domain names and Web documents simply because the protocols do not support anything other ASCII in the domain name field even though filenames and directory paths can be multilingual in the native locale.

Moreover, users of European languages have to approximate their domain names without accents and so on. A company like Citroën wishing to have a corporate identity has to approximate itself to the closest ASCII equivalent and use “www.citroen.fr” and Mr Francois from France has to constantly bear the irritation of deliberately mis-typing his e-mail address as “francois@email.fr” (as a fictitious example).

Currently, user-ids in an e-mail address field can be in multilingual scripts as operating systems can be localized to provide fonts in the relevant locale. Directories and filenames too can also be rendered in multilingual scripts. However, the domain name portion of these names are restricted to those permitted by the Internet standard in RFC1035, the standard setting forth the Domain Name System.

Based on RFC1035, valid domain names are currently restricted to a subset of the ISO-8859 Latin 1 alphabet, which comprises the alphabet letters A-Z (case insensitive), numbers 0-9 and the hyphenation symbol (-) only. This restriction effectively makes a domain name support English or languages with a romanized form, such as Malay or Romaji in Japanese, or a roman transliteration, such as transliterated Tamil. No other script is acceptable; even the extended ASCII characters cannot be used.

Unicode is a character encoding system in which nearly every character of most important languages is uniquely mapped to a 16 bit value. Since Unicode has laid down the foundations for unique non-overlapping encoding system, some researchers have begun to explore how Unicode can be used as the basis for a future DNS namespace, which can embrace the rich diversity of languages present in the world today. See M. Durst, “Internationalization of Domain Names,” Internet Draft “draft-duerst-dns-i18n-02.txt,” which can be found at the IETF home page, http://www.ietf.cnri.reston.va.us/ID.html, July 1998. This document is incorporated herein by reference in its entirety and for all purposes. The new namespace should be able to offer multilingual and multiscript functionality that will make it easier for non-English speakers to use the Internet.

Adopting Unicode as the standard character set for a new Domain Name System avoids overlapping code space for different language scripts. In this way, it may allow the Internet community to use domain names in their native scripts such as

www.citroën.ch

www.genève-city.ch

Unfortunately, several difficulties would preclude modifying the DNS server and client applications to implement a multilingual Domain Name System. For example, all future client applications and all future DNS servers have to be modified. As both client and server have to be modified for the system to work, the transition from the old system to the new system could be difficult. Further, very few available client applications use native Unicode. Instead, most multilingual client applications use non-Unicode encodings, and have strong followings.

One proposed compromise solution to this problem is the so-called “multilingual.com.” In this approach, the popular “.com” (“dot com”) top level domain is represented in ASCII characters, but the second and lower level domains are represented in a non-ASCII format. Such formats allow non-Roman characters. For example, the non-ASCII encoding type BIG-5 encodes Chinese characters. Thus, a Chinese language second level domain name could be registered and used with a com top-level domain name. However, to make use of the existing infrastructure for resolving domain names, the BIG-5 encoded second level domain name would first have to be converted to an ASCII representation. The transformed multilingual.com second level domain could then used by conventional name servers to resolve the address.

In view of these and other issues, it would be highly desirable to have a technique allowing the many linguistic encodings to be used with DNS.

SUMMARY OF THE INVENTION

The present invention pertains to methods and apparatus that detect the linguistic encoding type of a digital string encoding a domain name. This is accomplished using a tree or graph comprised of nodes holding linguistic digits representing the digital sequence of a character or a portion of a character. These nodes are compared against digital sequences of characters in the domain name under consideration. Each comparison results in a step down the graph. Then another comparison is performed, often with the next successive character in the domain name. Ultimately the process reaches a terminal node of the graph. This node specifies the encoding type of the domain name under consideration.

One specific aspect of the invention pertains to a method of detecting a linguistic encoding type of a domain name. Such method may be characterized by the following sequence: (a) receiving a digital representation of the domain name; and (b) using the digital representation to traverse a tree structure having multiple nodes connected by paths and having terminal nodes that uniquely identify distinct encoding types. By using the digital representation to traverse the tree in this manner and reach a terminal node, the method detects the linguistic encoding type of the domain name.

The tree structure may take various forms. In a preferred embodiment, the tree structure is ternary tree structure. Typically, the nodes of the tree structure comprise digital sequences of linguistic digits from characters of multiple encoding types.

Typically, the method traverses the tree structure by considering individual characters of the domain name (or portions of those characters) to determine how to move between nodes on the tree structure. In specific embodiments, the tree structure is traversed by comparing digital representations of linguistic digits in the nodes of the tree structure against digital representations of individual characters of the domain name or portions of those characters. The comparisons determine how to move between nodes on the tree structure. For example, if the digital value of a node's linguistic digit is greater than the digital value of the corresponding character of the domain name, one path is chosen. Other paths are chosen if the comparison shows different relationships between the digital values.

In certain specific embodiments, the method also involves reversing the sequence of the digital representation of the domain name prior to using the representation to traverse the tree structure. In this manner, a digital representation of a last character of the domain name is compared to a root node on the tree structure. Next, a digital representation of a next to last character of the domain name is compared to a second node of the tree structure. The method continues in this manner (i) using a next previous character of the domain name to identify a next lower level node of the tree structure; and (ii) repeating (i) until reaching a terminal node of the tree structure. Ultimately, a terminal node of the tree structure is reached. Typically, the terminal node itself specifies the linguistic encoding type of the domain name.

Another aspect of the invention pertains to apparatus for detecting the linguistic encoding type of a domain name. The apparatus may be characterized by the following features: (a) one or more processors; (b) memory in coupled to said one or more processor and configured to store a tree structure having multiple nodes connected by paths and having terminal nodes that uniquely identify distinct encoding types; and (c) a network interface configured to receive domain names from network nodes. The one or more processors are configured or designed to traverse the tree structure using information from a domain name to thereby detect the linguistic encoding type of the domain name. The tree structure may have a form as described above.

In a preferred embodiment, the apparatus also includes a logical module for converting the domain name from its linguistic encoding type to a DNS compatible encoding type (e.g., ASCII). The apparatus may also include a logical module for resolving domain names in the DNS compatible encoding type.

Another specific aspect of the invention pertains to methods for creating an encoding detection tree of the types described above (e.g., ternary tree structures). Such methods may be characterized by the following sequence: (a) receiving a representation of a digitally represented first domain name which is encoded in a first linguistic encoding type; (b) adding the first domain name, and its first linguistic encoding type, to the encoding detection tree to create a first path through the encoding detection tree; (c) receiving a representation of a digitally represented second domain name which is encoded in a second linguistic encoding type; and (d) adding the second domain name, and its second linguistic encoding type, to the encoding detection tree to create a second path through the encoding detection tree. Part of the method may also involve determining whether the first domain name (or some part of it) already exists in the encoding detection tree.

Generally, the first and second paths each include separate terminal nodes, one or more intermediate nodes, and a common root node. As part of the process, the system may add an identifier of the first and second linguistic encoding types to the terminal nodes of the first and second paths, respectively. The system may also add a sequence of the first domain name to the terminal node of the first path. In a preferred embodiment, the encoding detection tree presents the domain names in reverse order. Thus, for example, the first path in the tree presents the first domain name in reverse order of linguistic digits when moving from the root node to the terminal node.

In one approach to constructing the tree, the first domain name is included in the tree by adding a new node to the encoding detection tree for each linguistic digit of the first domain name having a digital sequence that does not appear at a corresponding location in the encoding detection tree. The positions of the new nodes with respect to existing nodes is determined by comparing the digital sequence of an existing node with the digital sequence of a corresponding linguistic digit from the first domain name. The process may also include adding to the encoding detection tree a linguistic equivalent node of the one of the linguistic digits in the first domain name.

Yet another aspect of the invention pertains to apparatus for creating an encoding detection tree. Such apparatus may be characterized by the following features: (a) one or more processors; (b) memory in coupled to said one or more processor and configured to store a partially created tree structure having multiple nodes connected by paths and having terminal nodes that uniquely identify distinct encoding types; and (c) an interface configured to receive domain names from a collection of domain names. In this apparatus, the one or more processors are configured or designed to receive representations of digitally represented domain names which are encoded in linguistic encoding types and add those domain names, together with their linguistic encoding types, to the encoding detection tree to create paths through the encoding detection tree.

Another aspect of the invention pertains to computer program products including a machine-readable media on which is stored program instructions for implementing a portion of or an entire method as described above. Any of the methods of this invention may be represented, in whole or in part, as program instructions that can be provided on such computer readable media.

These and other features and advantages of the invention will be described in more detail below with reference to the associated drawings. [0026]

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an exemplary system for resolving a non-ASCII domain name to its numeric IP address. [0027]
FIG. 2 is a process flow diagram of operations between a client and two multilingual DNS servers according to an embodiment of the invention. [0028]
FIG. 3A is a flow chart of the conversion of the domain name from one linguistic encoding type to a second linguistic encoding type according to an embodiment of the invention. [0029]
FIG. 3B is a block diagram of a multilingual domain name server according to an embodiment of the invention. [0030]
FIG. 4A is a schematic diagram of the reversed linguistic digit sequence of a domain name and a corresponding encoding detection tree according to an embodiment of the invention. [0031]
FIG. 4B is a flow chart of an algorithm to determine an encoding type of a domain name using an encoding detection tree according to an embodiment of the invention. [0032]
FIG. 4C is a flow chart of an algorithm, used with the algorithm of FIG. 4B, to search a list of terminal nodes according to an embodiment of the invention. [0033]
FIG. 4D is a schematic diagram of another example of an encoding detection tree. [0034]
FIG. 5A is a flow chart of an algorithm to construct the data structure according to an embodiment of the invention. [0035]
FIG. 5B is a schematic diagram of the reversed linguistic digit sequence, the linguistic encoding type, the linguistic equivalent, and the encoding detection tree according to an embodiment of the invention. [0036]
FIG. 5C is a flow chart of an algorithm to check whether a linguistic digit has been inserted into the encoding detection tree according to an embodiment of the invention. [0037]
FIGS. [0038] 5D-5I are schematic diagrams of the reversed linguistic digit sequence, the linguistic equivalent, the encoding detection tree, and the pointers according to an embodiment of the invention when a linguistic digit sequence “A1A2B1B2.com” is added to the encoding detection tree.
FIG. 5J is a flow chart of an algorithm to adjust sub-EDT structure according to an embodiment of the invention. [0039]
FIGS. [0040] 5K-5V are schematic diagrams of the reversed linguistic digit sequence, the linguistic equivalent, the encoding detection tree, and the pointers according to an embodiment of the invention when a linguistic digit sequence “A1A2B1B2.com” is added to the encoding detection tree.
FIGS. [0041] 5W-5AG are schematic diagrams of the reversed linguistic digit sequence, the linguistic equivalent, the encoding detection tree, and the pointers according to an embodiment of the invention when a linguistic digit sequence “C1C2D1D2.com” is added to the encoding detection tree.
FIG. 5AH is a schematic diagram of the reversed linguistic digit sequence, the linguistic equivalent, the encoding detection tree, and the pointers according to an embodiment of the invention when a linguistic digit sequence “D[0042] 1E2F1F2.com” is added to the encoding detection tree.
FIG. 5AI is a flow chart of an algorithm to reinsert a sub-EDT structure according to an embodiment of the invention. [0043]
FIG. 5AJ is a schematic diagram of the reversed linguistic digit sequence of a domain name and a corresponding encoding detection tree according to an embodiment of the invention. [0044]
FIG. 6 is a schematic diagram depicting the construction of an encoding detection tree using the procedure depicted in FIG. 5AI, when a sub-EDT structure is destroyed and reinserted in the encoding detection tree. [0045]
FIG. 7 is a simplified block diagram of a typical computer system of the type that may be employed to implement the procedures of this invention.[0046]

DETAILED DESCRIPTION OF THE INVENTION

1. INTRODUCTION [0047]
The present invention provides a technology for efficiently and accurately identifying encoding types of domain names. It uses a tree or graph structure having nodes corresponding to “linguistic digits.” In a typical application, these linguistic digits are sequentially compared against digital representations of characters in the domain name. Each comparison results in a decision on which available path to take in the graph structure. This moves a pointer through a tree sequentially until reaching a terminal node associated with an encoding type. Thus, at the end of the process, the encoding type is detected. This information can then be employed to convert the characters of the multilingual domain name to a format compatible with the DNS standard (e.g., RFC 1035). [0048]
While the invention is described below in terms of a “multilingual.com” embodiment, other domain name formats may be employed as well. In general, any domain name system that recognizes domain names from more than one encoding type can be used with this invention. Depending upon the range of acceptable domain names, the encoding detection graph/tree will have various forms. [0049]
In a typical embodiment, the present invention transforms multilingual multiscript names to a form that is compliant with DNS (e.g., DNS as explained in RFC1035). These transformed names may then be relayed as DNS queries to a conventional DNS server. An exemplary process of how a localized domain name is resolved to its numeric IP address is illustrated by FIG. 1 below. [0050]
As background for this invention, understand that programs rarely refer to hosts and other resources by their binary network addresses. Instead of binary numbers, they use ASCII strings, such as www.pobox.org.sg. Nevertheless, the network itself only understands binary addresses, so some mechanism is required to convert the ASCII strings to network addresses. This mechanism is provided by the Domain Name System. [0051]
The essence of DNS is a hierarchical, domain-based naming scheme and a distributed database system for implementing this naming scheme. It is primarily used for mapping host names and e-mail destinations to IP addresses, but can be used for other purposes. As mentioned, DNS is defined in RFCs 1034 and 1035. [0052]
As noted, the DNS protocol is currently based upon a subset of ASCII, and is thus limited to the Latin alphabet. Numerous other encodings provide digital representations for other character sets of the world. Examples include BIG5 and GB-2312 for Chinese character scripts (traditional and simplified respectively), Shift-JIS and EUC-JP for Japanese character scripts, KSC-5601 for Korean character scripts, and the extended ASCII characters for French and German characters, for instance. [0053]
Beyond these language-specific encoding types, there exists the Unicode standard (a “universal linguistic encoding type”) that provides the capacity to encode all the characters used in the written languages of the world. In a preferred embodiment, domain name strings in various different encoding types are all first converted to Unicode and then to ASCII—if necessary. [0054]
Unicode uses a 16-bit encoding that provides code points for more than 65,000 characters. Unicode scripts include Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thia, Lao, Georgian, Tibetan, Japanese Kana, the complete set of modem Korean Hangul, and a unified set of Chinese/Japanese/Korean (CJK) ideographs. Many more scripts and characters are to be added shortly, including Ethiopic, Canadian, Syllabics, Cherokee, additional rare ideographs, Sinhala, Syriac, Burmese, Khmer, and Braille. [0055]
A single 16-bit number is assigned to each code element defined by the Unicode Standard. Each of these 16-bit numbers is called a code value and, when referred to in text, is listed in hexadecimal form following the prefix “U”. For example, the code value U+0041 is the hexadecimal number 0041 (equal to the decimal number 65). It represents the character “A” in the Unicode Standard. [0056]
Each character is also assigned a unique name that specifies it and no other. For example, U+0041 is assigned the character name “LATIN CAPITAL LETTER A.” U+0A1B is assigned the character name “GURMUKHI LETTER CHA.” These Unicode names are identical to the names for the same characters in ISO/IEC 10646. [0057]
The Unicode Standard groups characters together by scripts in code blocks. A script is any system of related characters. The standard retains the order of characters in a source set where possible. When the characters of a script are traditionally arranged in a certain order—alphabetic order, for example—the Unicode Standard arranges them in its code space using the same order whenever possible. Code blocks vary greatly in size. For example, the Cyrillic code block does not exceed 256 code values, while the CJK code block has a range of thousands of code values. [0058]
Code elements are grouped logically throughout the range of code values, called the “codespace.” The coding starts at U+0000 with the standard ASCII characters, and continues with Greek, Cyrillic, Hebrew, Arabic, Indic and other scripts; then followed by symbols and punctuation. The code space continues with Hiragana, Katakana, and Bopomofo. The unified Han ideographs are followed by the complete set of modem Hangul. A surrogate range of code values is reserved for future expansion with UTF-16. Towards the end of the codespace is a range of code values reserved for private use, followed by a range of compatibility characters. The compatibility characters are character variants that are encoded only to enable transcoding to earlier standards and old implementations, which made use of them. [0059]
Character encoding standards define not only the identity of each character and its numeric value, or code position, but also how this value is represented in bits. The Unicode Standard endorses at least three forms that correspond to ISO/IEC 10646 transformation formats, UTF-7, UTF-8 and UTF-16. [0060]
The ISO/IEC 10646 transformation formats UTF-7, UTF-8 and UTF-16 are essentially ways of turning the encoding into the actual bits that are used in implementation. UTF-16 assumes 16-bit characters and allows for a certain range of characters to be used as an extension mechanism in order to access an additional million characters using 16-bit character pairs. The Unicode Standard, Version 2.0, Addison Wesley Longman (1996) (with updates and additions added via “The Unicode Standard, Version 2.1) has adopted this transformation format as defined in ISO/IEC 10646. This reference is incorporated herein by reference in its entirety and for all purposes. [0061]
The second transformation format is known as UTF-8. This is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set end up having the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites. The Unicode Consortium also endorses the use of UTF-8 as a way of implementing the Unicode Standard. Any Unicode character expressed in the 16-bit UTF-16 form can be converted to the UTF-8 form and back without loss of information. [0062]
2. TERMINOLOGY [0063]
Some of the terms used herein are not commonly used in the art. Other terms may have multiple meanings in the art. Therefore, the following definitions are provided as an aide to understanding the description that follows. The invention as set forth in the claims should not necessarily be limited to these definitions. [0064]
Linguistic encoding type—any character or glyph encoding type (e.g., ASCII or BIG5) now known or used in the future. Each encoding type has its own mapping between linguistic characters (e.g., “a” in the Latin alphabet and an “o” with an umlaut in the German alphabet) and corresponding digital representations (e.g., hexadecimal number 0041 for “A” in ASCII). [0065]
Digitally represented—the way characters are presented as a result of encoding (e.g., in a bit stream, a hexadecimal format, etc.) [0066]
Digital sequence—a particular sequence of ones and zeros, hexadecimal characters, or other constituents in a digital representation. [0067]
Encoded domain name—the digital sequence of domain name characters represented in binary or hexadecimal for example. More specifically, the string of concatenated digital representations for the characters comprising the domain name under consideration is the “encoded domain name.” For example, the ASCII encoded domain name for abc.com is \0x61\0x62\0x63\0x2E\0x63\0x6F\0x6D. As another example, the GB-2312 encoded domain name of “Taiwan.com” is \0xCC\0xA8\0xCD\0xE5\0x2E\0x63\0x6F\0x6D. [0068]
Encoding detection tree—This is a tree or graph structure used to unambiguously determine the encoding types of arbitrary bit strings that are digital sequences of domain names. In typical examples, using information in the digital sequence of the domain name, an encoding detection algorithm can traverse the tree to reach a leaf node (terminal node). There the encoding type can be unambiguously determined. Specific examples of such trees and their use are presented below. [0069]
Linguistic digit—this is the digital sequence employed at nodes of an encoding detection tree. In one embodiment, it is 8 bits long. It is typically derived from the digital representation of an encoded linguistic character. For example, one linguistic digit employed in an encoding detection tree may be the value hexadecimal number 0041, which is the ASCII representation of “A.” In another example, a linguistic digit is one byte of a two byte string used to represent a particular Chinese character in GB-2312. [0070]
The length of the linguistic digit may be chosen to provide an optimal balance between the size of the encoding detection tree and the speed at which it can be traversed. Smaller linguistic digits (1 bit in the extreme case) required more nodes and hence more storage space. Larger linguistic digits require longer comparison times. [0071]
Linguistic equivalent—this refers to two nominally different characters that have very similar linguistic meanings. Examples include uppercase and lowercase characters in the Latin and Greek alphabets. [0072]
DNS encoding type—an encoding type supported by the DNS protocol of a network or the Internet; e.g., a limited set of ASCII characters specified in RFC 1035. [0073]
Non-DNS encoding type—an encoding type not supported by the DNS protocol under consideration, e.g., BIG5 under the RFC 1035 standard. [0074]
Universal linguistic encoding type—any linguistic encoding type, now known or developed in the future, that encompasses more than one character or glyph set within its encoding. Unicode is one example. BIG5, iso-8859-11, and GB-2312 are others. [0075]
3. INTERNATIONAL DOMAIN NAME SYSTEMS [0076]
Turning now to FIG. 1, some important components of a [0077] network 10 used in a first embodiment of this invention include a client 12, a corresponding node 14 with whom client 12 wishes to communicate, a multilingual DNS (“m1DNS”) server 16 and a conventional DNS server 18. The m1DNS server 16 may listen on a DNS port (currently addressed to the domain name port 53) for multilingual domain name queries in place of a normal DNS server. Server 16 may include the Berkeley Internet Name Domain (‘BIND’ and its executable version ‘named’) which is a widely used DNS server written by Paul Vixie (http://www.isc.org/).
To understand the role of these components, assume that [0078] client 12 is used by a Chinese student who wishes to inquire about employment in a Hong Kong business that operates corresponding node 14. The student has previously communicated with the business and has obtained the domain name of that business. The domain name is provided in native Chinese characters. Client 12 is outfitted with a keyboard that can type Chinese language characters and is configured with software that can recognize encoded Chinese characters and accurately display them on a computer screen.
Now, the student prepares a message to the Hong Kong business, encloses her resume, and types in the Chinese domain name as the destination. When she instructs [0079] client 12 to send the message to corresponding node 14, the system shown in FIG. 1 takes the following actions. First, the corresponding node domain name is submitted, in the native language, to m1DNS server 16 via a DNS request. The m1DNS server 16 recognizes that the domain name is not in a format that can be handled by a conventional DNS server. Therefore it translates the Chinese domain name to a format that can be used with a conventional DNS server (normally a limited set of the ASCII characters). The m1DNS server 16 then repackages the DNS request, with the translated corresponding node domain name, and transmits that request to conventional DNS server 18. DNS server 18 then uses the normal DNS protocol to obtain a network address for the domain name it received in the DNS request. The resulting network address is the network address of corresponding node 14. DNS server 18 packages that network address according to conventional DNS protocol and forwards the address back to m1DNS server 16. The m1DNS server 16, in turn, transmits the needed network address back to client 12, where it is placed in the student's message. The message is packetized, with each packet having a destination network address corresponding to node 14. Client 12 then sends the message packets over the Internet to node 14.
While FIG. 1 shows the [0080] m1DNS server 16 and conventional DNS server 18 as separate blocks, often the two entities can be represented as a single logical block. Often both entities will reside on a single hardware device, such as a network workstation. Further, the functions of the two entities can be executed using a single block of program code or tightly coupled blocks of program code. FIG. 2 shows a network having multiple m1DNS servers, each of which performs the logical operations of m1DNS server 16 and conventional DNS server 18.
As shown in FIG. 2, [0081] client 12 is depicted by a vertical line on the left-hand side of the figure, a default m1DNS server 17 is depicted by a vertical line in the center of the figure, and a second m1DNS server 19 is depicted by a vertical line on the right-hand side of the figure.
Initially, at [0082] 203, an application running on client 12 generates a message intended for a network destination. The domain name for that destination is input in non-DNS compatible text encoding format. Thus, the text is encoded in a linguistic encoding type that digitally represents the characters of the text. As mentioned, ASCII is but one linguistic encoding type. In preferred embodiments, the invention handles a wide range of encoding types. Examples of some in wide use include GB2312, BIG5, Shift-JIS, EUC-JP, KSC5601, extended ASCII, and others.
After the client application creates the message at [0083] 203, the client operating system creates a DNS request to resolve the domain name at 205. The DNS request may resemble a conventional DNS request in most regards. However, the domain name provided in the request will be provided in a non-DNS encoding format. The client operating system transmits its DNS request to default m1DNS server 17 at 207. Note that the client operating system may be configured to send DNS requests to m1DNS server 17. In other words, the default DNS server of client 12 is m1DNS server 17.
[0084] Default m1DNS server 17 extracts the encoded domain name from the DNS request and generates a transformed DNS request presenting the domain name in a DNS compatible encoding format (presently the reduced set ASCII specified in RFC 1035). See 209. Server 17 then attempts to resolve the DNS compatible domain name. It may use the conventional DNS protocol for this purpose; i.e., to obtain the IP address of the domain name used in the client's communication. If server 17 cannot itself resolve the domain name presented to it, it will attempt to identify another m1DNS server that is authoritative for the domain name under consideration. Regardless of the outcome of operation 209, default m1DNS server 17 then transmits a message back to client 12. See 217. This message may include the IP address of the domain name under consideration or it may include a reference to another m1DNS server.
If it does not include the IP address, [0085] client 12 then sends its DNS request (with the multilingual domain name) to second m1DNS server 19. See 219. Server 19 then attempts to resolve the request locally. See 221. Regardless of its success, it sends a reply to client 12. See 223. That reply will include either the IP address of the multilingual domain name, the name of a referred server, or a failure message. Upon receipt of the reply, client 12 either sends a communication using the IP address of the resolved multilingual domain name or reports a failure to establish a connection (because servers 17 and 19 each failed to resolve the domain name). See 225. It is of course possible that server 19 sent a referral for yet another multilingual domain name server. If that is the case, then client 12 may try to send its multilingual domain name request to the newly referred server.
As indicated above, the domain name must, at some point, be converted from a non-DNS encoding type to a DNS compatible encoding type. In the above examples, this is accomplished with a m[0086] 1DNS server (or a proxy m1DNS server). This need not be the case, however, as the functionality necessary for conversion may be embodied elsewhere. In alternative embodiments, the functions performed by the m1DNS server are implemented in whole (or in part) on the client and/or on the DNS name server.
In a specific embodiment, operations including detecting an encoding type, translating a non-DNS encoded domain name to a DNS encoded domain name and identifying a default name server (operations [0087] 305-311 of the FIG. 3A flow chart discussed below) are implemented on an Internet application (e.g., a multilingual-enabled Web browser). In this embodiment, code detection and code conversion are automatically done prior to dispatching a DNS resolution request to a DNS name server.
In another specific embodiment, operations [0088] 305-311 can be implemented entirely on a proxy m1DNS server. Other embodiments include collapsing all or some fraction of these operations into a conventional DNS name server. For example, code for some m1DNS functions can be collapsed into BIND code as a compilable module.
In FIG. 2, the conversion of the domain name from one linguistic encoding type to a second linguistic encoding type (compatible with DNS) is performed at [0089] 209 or 221 (depending upon which of servers 17 and 19 is the authoritative server). As shown in FIG. 3A, in accordance with a preferred embodiment of this invention, this conversion may take place via a process 301. The process begins at 303 with the system identifying the encoding type of the domain name in the DNS request. This is necessary when the system may be confronted with multiple different encoding types. Typically, the detection will involve analyzing a bit string making up the domain name under consideration. A preferred approach to this process is described below in detail. However, in alternative embodiments, an application can present explicitly defined linguistic encoding which obviates the need for encoding type detection.
After the encoding type has been identified, the system next determines whether the domain name was encoded in a DNS compatible encoding type at [0090] 305. Currently, that requires determining whether the domain name is encoded in the reduced set ASCII encoding type. If so, further conversion is unnecessary and process control is directed to 311, which will be described below.
In the interesting case, the domain name is encoded in a non-DNS format. When this occurs, process control is directed to [0091] 307 where the system translates the domain name to a universal encoding type. In a preferred embodiment, this universal encoding type is Unicode. In this case, the characters identified in the native encoding type are then identified in the Unicode standard and converted to the Unicode digital sequences for those characters.
The newly translated domain name is then further transformed from the universal encoding type to a DNS compatible encoding type. See [0092] 309. Thus, this final encoding type may be the reduced set of ASCII specified in RFC 1035. Note that the translation from the DNS incompatible format to the DNS compatible format takes place in two operations through an intermediate universal encoding type. This two operation procedure will be detailed below. It should be understood, however, that it may be possible to directly convert, in one operation, the DNS incompatible domain name to the DNS compatible domain name. This may be accomplished in a system having multiple conversion algorithms, each designed to convert a specific encoding type to ASCII (or some other future DNS-compatible encoding type). In one example, these algorithms may be modeled after the “Dürst algorithm” identified above. Many other suitable algorithms are known or can be developed with routine effort.
With a DNS compatible domain name now in hand, the system need only determine which conventional DNS name server it should forward the domain name to. According to normal DNS protocol, the DNS request might be forwarded to a top-level name server. As will be described in more detail below, it may be convenient to have different root name servers handle different linguistic domains. For example, the Chinese government may maintain a root name server for Chinese language domain names, the Japanese government or a Japanese corporation may maintain a root name server for Japanese language domain names, the Indian government may maintain a root name server for Hindi language domain names, etc. In any event, the system must identify the appropriate name server at [0093] 311 as indicated in FIG. 3A. After this has been accomplished, the conversion process is complete and the DNS request can be transmitted to the DNS system for handling according to convention.
Preferably, the process depicted in FIG. 3A is performed solely on a m[0094] 1DNS server. However, some of the process may be performed on a client or a conventional DNS server. For example, 303 and 305 could be performed on a client and 309 could be performed on a conventional DNS server.
In some preferred embodiments, such as those depicted in FIG. 2, the name server is co-located with the m[0095] 1DNS server. So operation 311 would involve nothing more than determining that the server performing the encoding type detection and conversion can also resolve a DNS request for the domain name in question.
A preferred division of labor for the m[0096] 1DNS function is depicted in FIG. 3B. As shown there, a m1DNS server 327 performs the necessary detection of encoding type and conversion to a DNS compatible format. Server 327 also performs normal DNS resolution. An encoding detection tree (EDT) 321 and associated logic performs the operations of FIG. 3A. In addition, a normal DNS resolution subsystem 323 performs the standard DNS resolving protocol.
The EDT and associated logic detects all necessary linguistic encoding types and can convert all encoding types to Unicode (or other suitable universal encoding type). In the depicted embodiment, a [0097] client 325 submits a domain name for a corresponding node 331 in its native language. The m1DNS server 327 receives the domain name and a conventional DNS resolution sub-system 323 performs the standard DNS resolving protocol. It returns the IP address for corresponding node 331, allowing client 325 to communicate directly with node 331.
In one implementation, [0098] EDT 321 and associated logic runs on a machine (identified by i2.i-dns.com for example) on a designated port (e.g., a port number 2000). It accepts a whole portion of a digitally represented domain name in any linguistic encoding type and returns a whole portion of a digitally represented domain name in Unicode transformed to a DNS encoding type (UTF-5). Normal DNS subsystem 323 returns an IP address for the domain name under consideration.
As indicated in the discussion of FIG. 3A and elsewhere, when the system handles multiple encoding types, it must be capable of distinguishing one encoding type from the next. See [0099] block 303. An example of this process employing an encoding detection tree is detailed in FIGS. 4A-4C.
4. MATCHING ALGORITHM [0100]
FIGS. [0101] 4A-4C depict an embodiment of this invention for identifying and encoding type of a domain name string using an encoding detection tree of this invention. As shown in FIG. 4A, an encoding detection tree 401 (which may also be viewed technically as a “graph”) includes various nodes such as node 403 and connections between those nodes such as “eq” connection 405. FIGS. 4B and 4C present a process flow for using a tree structure such as tree 401 to unambiguously detect an encoding type.
To understand how the matching algorithm works, consider a very simple registry having only 3 registered domain names: A[0102] 1A2B1B2.com, C1C2D1D2.com and E1E2F1F2.com. As described in more detail below, a registry with these domain names will be used to generate the tree structure 401 depicted in FIG. 4A. When an international domain name system is presented with a domain name, it detects the encoding type of that domain name using the tree 401. Imagine, for the moment, that the international domain name system receives the domain name A1A2B1B2.com.
Consider now the [0103] process flow 450 depicted in FIG. 4B. At the beginning of the process, the linguistic encoding type, T, is unknown. At 452, the international domain name system (e.g., an m1DNS server as described above) receives the linguistic digit sequence, S, for the domain name of interest and reverses that sequence to produce a reversed digit sequence S′. In FIG. 4A, the reversed linguistic digit sequence S′ is depicted by sequence 407 (for the domain name A1A2B1B2.com). Assume for example that “.com” is represented in ASCII and A1A2 represent one 16 bit Chinese character encoded in BIG5 and B1B2 represent a second Chinese character also encoded in BIG5.
Returning to process [0104] 450 in FIG. 4B, the international domain name system next sets a pointer P1 to a first digit of the reversed sequence S′ and sets a pointer P2 to the root of the encoding detection tree 401. See operation 454. Pointers P1 and P2 are depicted in FIG. 4A. During the course of the encoding detection process, these pointers move from digit to digit (in the case of P1) and from node to node (in the case of P2).
Next, the international domain name system compares the value at pointer P[0105] 1 against the value at pointer P2. See 456. This comparison involves the digital values of the character (or portion of a character) from the domain name at the current location of pointer P1 and the linguistic digit represented in the node currently at pointer P2. In the example of FIG. 4A, pointer P2 is initially at node 403, which corresponds to the digital value of m. In the case of a multilingual.com system, the linguistic digit at node 403 will be the ASCII value of the letter “m.” The value at pointer P1 is also the digital value for m. Therefore, the comparison presented at 456 indicates that the values at pointers P1 and P2 are equal. With this outcome, process flow 450 (FIG. 5B) proceeds to 458 where the position of pointer P2 is moved to the equal child node of parent node 403. In this case, the equal child node of parent node 403 is the node 409. This node contains the linguistic digit of the letter “o” in ASCII.
Next, at [0106] 460, the international domain name system determines whether the pointer P2 is currently pointing to a terminal node of the tree structure. If so, it will determine the encoding type from that node. In the current example, however, there are many additional nodes between the pointer P2 and the terminal node. (Examples of terminal nodes are indicated by nodes 421 and 429 in FIG. 4A.) Therefore, decision 460 is answered in the negative. Then process control is directed to a decision 462, which determines whether the next digit in the domain name string represents the end of that string. In the example at hand, the pointer P1 currently points to the m character. Therefore, several more digits exist between the end of string 407 and the current digit. Hence, decision 462 is answered in the negative.
Process control is next directed to block [0107] 464 where the international domain name system moves pointer P1 to the next digit of S′, the reversed character sequence 407. See FIG. 4A. In the example at hand, this results in P1 moving from the letter m to the letter o in sequence 407. Process control is then directed back to decision operation 456.
This time through the process, the locations of pointers P[0108] 1 and P2 has changed. P1 points to the letter o and P2 points to the node 409 containing the linguistic digit for the letter o. At 456, the system determines that the digital representations at pointers P1 and P2 are equal. Therefore, as before, the process flow is directed to block 458, where pointer P2 moves down the equal child path to a node 411.
Next, [0109] decision block 460 determines whether the new location of P2 is a terminal node. As it is not in this case, the process moves to decision block 462 where it determines that the next digit in the S′ character string is not the end of that string. Then, process block 464 moves P1 to the next digit in S′, the letter “c.”
During the next two passes through the [0110] process flow 450, the pointer P1 is located at the c and the “.” respectively. At the end of these cycles, the pointer P2 points to the node 413 containing the linguistic digit that is the digital representation of character B2. Also, at this time, the pointer P1 points to the character B2 in reverse sequence 407. Now, the international domain name system compares the values located at pointers P1 and P2. See 456. As before, these values are equal. Therefore, the location of pointer P2 moves down the “eq” path to node 415, which harbors the linguistic digit for B1. Because this is not a terminal node and because the current location of pointer P1 is upstream from the end of reversed sequence 407, the process loops back to decision block 456. It proceeds in this manner through nodes 417 and 419, corresponding to linguistic digits A2 and A1. When proceeding through the loop associated with linguistic digit A1, the pointer P2 moves, at 458, to a terminal node 421. At this point, decision block 460 is answered in the affirmative. As a consequence, process control is directed to a new decision block, decision 466, where it is determined whether there is only one terminal node. In other words, there is only one encoding type associated with a terminal node.
In the situation at hand, there is only one encoding type associated with the [0111] terminal node 421. Therefore, decision 466 is answered in the affirmative. At 468, the international domain name system identifies the linguistic encoding type, T, associated with the terminal node 421.
It is possible that coincidentally, two domain names, having different encoding types, have the same digital sequence. When this occurs, the terminal node will include two separate encoding types. In this situation, [0112] decision block 466 is answered in the negative. To account for this situation, each encoding type represented at the terminal node is also associated with its unique character string, S. The international domain name system can then search through a list of character strings for the exact match of the sequence S. See 470. When the exact match of the digital sequence S is found, the corresponding encoding type is selected.
FIG. 4C depicts one example of a process flow that may be employed to search through the list of terminal nodes as depicted at [0113] process block 470. In FIG. 4C, a process 480 begins at 482 with selection of the first terminal node in the list of terminal nodes. See 482. Next, the process normalizes the sequence S associated with terminal node L based on the linguistic encoding type of that sequence. See 484. Next, the process determines whether the linguistic digital sequence S of the domain name under consideration matches the linguistic digital sequence associated with the terminal node under consideration. See 486. Assuming that this is the case, process control is directed to block 488 where system returns the terminal node currently visited in the list. This terminal node has an associated encoding type, which is the encoding type of the domain name under consideration.
Assume for the moment that the comparison rendered at [0114] 486 indicates that the sequences are not identical. As a result, process control is directed to decision block 490, which checks to determine whether the end of the list of terminal nodes has been reached. If not, the next terminal node, T, in the list is visited at 492. From there, process control is directed back to block 484 and the process continues as described above. Now, in the case where the end of the list of terminal nodes has been reached but no matching strings have been found, decision block 490 will be answered in the affirmative. As a result, the system sets a pointer to the list of terminal nodes to point to nothing. See 494. From there, process controls directed to 488 which returns no match, in this case.
The need for the process of FIG. 4C can be understood as follows. The traversing path may lead to a list of terminal nodes rather than a single match. Therefore, some mechanism to determine which terminal node is correct is required. For example “??.com” is a valid GB encoded binary sequence and also a valid iso8859-1 encoded binary sequence. Both of them have the same traversing path in the encoding detection table and will be chained up if both were previously inserted into the encoding detection tree. For iso8859-1 encoded characters, upper case and lower case are considered linguistically equivalent, while for GB encoded characters, the case of the character is sometime significant. Chinese characters in GB will be double bytes. Therefore, these two bytes are case significant. So, for iso8859-1 encoding type, the detection tree could have a terminal node with detectable string as “0xEC\0xA8\0xED\0xE5\0x2E\0x63\0x6F\0x6D” (all valid iso8859-1 characters are lower-case) and an encoding type of “iso8859-1”. And, for the GB encoding type, the encoding tree could also have another terminal node with detectable string as “\0xCC\0xA8\0xCD\0xE5\0x2E\0x63\0x6F\0x6D” (valid GB characters will be preserved, while ASCII characters will be lower cased since they are not case significant) and an encoding type of “GB”. Both of them are chained up under the same traversing path. [0115]
If a match request comes in for “\0xCC\0xA8\0xCD\0xE5\0x2E\0x63\0x6F\0x6D”, after lower casing all the ASCII characters, it will match with the GB encoded string exactly and will take precedence over the iso8859-1 encoded string. And if a match request comes in as “\0xCC\0xA8\0xED\0xE5\0x2E\0x63\0x6F\0x6D”, it will not match with the GB encoded string after lower casing all the ASCII characters but will match with the iso8859-1 encoded string exactly after lower casing the iso8859-1 characters and ASCII characters. [0116]
The normalization process (see [0117] 484) will utilize the encoding information contained in the terminal node and lower case characters that are not case significant in the query string and then do exact match on the normalized query string with detectable string stored in the terminal node.
Returning now to FIGS. 4A and 4B consider the possibility where the domain name under question is A[0118] 1A2B1B2.coM. In this case, the domain name is linguistically equivalent to the previous domain name under consideration. It so happens that it is presented with an upper case letter “M,” rather than the lower case letter “m.” On the first pass through process 450, the pointer P1 points to the M in the sequence, S′, and the pointer P2 points to the root node 403 of tree 401. At 456, where the values at P1 and P2 are compared, the system will discover that the value at P1 is less than the value at P2. This is because the digital sequence representing M has a lower value than the digital sequence representing m. As a result, process control proceeds to a process block 472, which moves the pointer P2 to the low child node branching from root node 403. In this case, that is node 423, populated with the digital sequence associated with the M.
From [0119] block 472, the system next determines whether pointer P2 is pointing to nothing. See decision block 473. In this case, that is not true, so process control is directed back to decision block 456, where the value associated with pointer P1, and the new position of pointer P2 are compared. This time, the values will match, so process controls directed to 458. There, the pointer P2 is moved down the “eq” branch to node 409. This causes the pointer P2 to move to node 409, where the linguistic digit for the letter “o” resides. The process then proceeds down the tree, loop by loop, until reaching terminal node 421 as described in the previous example.
Considering the domain name D[0120] 1E2F1F2.com, the procedure will traverse tree 401 as described above until pointer P1 reaches linguistic digit F2 and pointer P2 reaches node 413, containing linguistic digit B2. At that point, the comparison of the values at P1 and P2 (decision block 456) indicates that the digital value of linguistic digit F2 is greater than the digital value of linguistic digit B2. At this point, process control is directed to a block 474. There, the pointer P2 is moved down the “hi” path to a node for 25. The system checks whether P2 points to nothing (473). As that is not the case, process control loops back to 456 where the value of P1 (F2) is compared with the value at the new location of pointer P2 (node 425). This time, the comparison will indicate a match. Then, process block 458 moves pointer P2 down the eq path on tree 401 to a node 427. The process will continue in this manner until reaching a terminal node 429 there, the encoding type of domain name D1E2F1F2.com will be identified.
One other case of interest should be discussed in the context of matching linguistic digit sequences. Specifically, if one ever encounters a situation where the current location of P[0121] 2 is not a terminal node but the next digit of the reversed sequence S′ is the end of that sequence, then the encoding type cannot be determined unambiguously. This situation is captured in process 450 when decision block 462 is answered in the affirmative. At that point, a process block 476 returns a “not found” message. The result occurs when decision block 473 is answered in the affirmative (i.e., P2 points to nothing).
5. TREE STRUCTURE [0122]
To completely ensure that every encoding type can be detected, the tree should embody representations of all domain names that are registered with a particular host system—e.g., a particular Internet Service Provider. Thus, in one embodiment, the encoding detection tree is periodically rebuilt when new multilingual domain names are registered. In an extreme example, the tree is recomputed every time a new domain name is registered. More typically, the tree is computed only after a defined number of new domains have been registered since the tree was last computed or a set length of time has expired (e.g., 12 hours) since the tree was last computed. [0123]
To ensure that the tree can unambiguously distinguish encoding types for every registered domain name, the registrar may enforce certain restrictions on registration. In a preferred embodiment, two restrictions are imposed. First, the registrar should not register two domain names having the exact same digital sequence. Second, it should not register two domain names that are linguistically equivalent. [0124]
Considering the first restriction, unrelated domain names in different encoding types might coincidentally have the same digital representation. If this situation were allowed to occur, then the encoding detection system would be unable to unambiguously determine the encoding type when presented with one of these domain names. Hence the system could not guarantee that it would return the proper IP address. [0125]
Considering the second restriction, the domain names grasshopper.com and GrassHopper.COM are linguistically equivalent. In the traditional Latin alphabet based domain name system, domain names are case insensitive. Entities and individuals obtaining domain names expect to own rights to all linguistic equivalents of a given name. Hence, the registrar should prevent registration of two linguistically equivalent domain names. To allow this, the encoding detection tree preferably contains paths for multiple linguistic equivalents of a single domain name. [0126]
Preferably, the tree is designed considering one or more of four objectives: [0127]
1. Enable the Domain Name System with multilingual capability; [0128]
2. Require a reasonably short period of time to build a data structure for Domain Name System hosting large number of domain names; [0129]
3. Require relatively little memory consumption for a Domain Name System hosting large number of domain names; and [0130]
4. Enable efficient detection of the linguistic encoding type of a digital sequence. [0131]
In a preferred embodiment, the encoding detection tree is extended from a data structure called “ternary search tree.” Each node of the tree is associated with a record holding a single linguistic digit and pointers to its children nodes. The linguistic digit represents the digital encoding of a particular character (or a portion of that character) in a particular encoding type. The size of the linguistic digit used in the nodes is chosen balance between rapid searching and low memory usage. At one extreme, each node contains a one bit linguistic digit. This structure could be searched very fast, but would occupy too much memory. Trees having 16 bit linguistic digits at each node would occupy less memory, but would be searched more slowly. In a specific embodiment, each node of the tree includes 8 bits. [0132]
In the example shown in FIG. 4A, each node can only have at most three children nodes, which are named as “lokid”, “eqkid” and “hikid”. During encoding type detection, as explained, the linguistic digit stored with a node is compared against the digital sequences of characters in the domain name under analysis. “Lokid” will be visited if the digital value of linguistic digit from the appropriate position of incoming digital sequence is less than the value of linguistic digit held by current node of tree. If both of the digits have the same value, “eqkid” is visited. When the digital value of lingustic digit from the appropriate position of incoming digital sequence is greater than the value of linguistic digit held by current node of tree, “hikid” will be visited. Non-ternary search trees may be employed in alternative embodiments. [0133]
FIG. 4D depicts a slightly more complex version of a ternary encoding detection tree. This tree would be suitable for a multilingual domain name system that registers “.com” and “.tm” top level domain names. The tree itself would be traversed in the manner described above for [0134] tree 401 of FIG. 4A.
6. BUILDING A TREE STRUCTURE [0135]
Referring to FIG. 5A, an [0136] algorithm 500 for building the encoding detection tree (EDT) will be described. The algorithm 500 is executed by a computer system (e.g., system 700, which will be described later referring to FIG. 7) periodically. For example, when the system adds a new linguistic digit sequence of a domain name to its encoding detection tree. An exemplary process in which the system ultimately creates the EDT shown in FIG. 4A will be described in detail below. Here, it is supposed that the system receives three linguistic digit sequences, namely, “A1A2B1B2.com,” “C1C2D1D2.com,” and “D1E2F1F2.com” in this order. The EDT may be stored in various types of memory devices in the system as long as its topological structure is kept precisely.
ADDING “A[0137] 1A2B1B2.com” TO EDT
First, the process for adding the linguistic digit sequence, “A[0138] 1A2B1B2.com” to the EDT will be described below referring to FIGS. 4A, and 5A-5V.
In block [0139] 501, the routine 500 receives a linguistic digit sequence S (i.e., A1A2B1B2.com), and a linguistic encoding type of the linguistic digit sequence S (e.g., GB2312). The system then reverses the order of the linguistic digit sequence S, and substitutes the reversed sequence of S for the reversed linguistic digit sequence S′. In block 501, the system substitutes a linguistic encoding type of the linguistic digit sequence S for a linguistic encoding type T. Here, the reversed linguistic sequence S′, and the linguistic encoding type T are “moc.B2B1A2A1,” and “GB2312,” respectively. FIG. 5B illustrates the reversed linguistic sequence S′, and the linguistic encoding type T.
The system may store the linguistic encoding type T, which is represented in ASCII characters. Alternatively, the system may store a code corresponding to the linguistic encoding type T in order to reduce a memory space for storing the variable T. [0140]
In [0141] block 503, the routine 500 initializes pointers P1, P1′, P2, and P2′ as indicated in FIG. 5B. The pointer P1, which points to a current position in the linguistic digit sequence, is set to the first digit of the sequence S′, namely, “m.” The pointer P1′ points to a linguistic equivalent of a linguistic digit pointed by the pointer P1, which is “M.” As mentioned, a linguistic equivalent of a linguistic digit in ASCII code is an upper case letter of the linguistic digit. The pointer P2, which points to a currently-visited node in the EDT, is initially set to a root of the EDT. The pointer P2′, which points to a currently-visited node in the EDT for insertion of the linguistic digit at the location of pointer P1′, is also set to the root of the EDT. Since the EDT has no node, the pointers P2 and P2′ point to a null node 539 shown by the broken line.
The system stores the linguistic equivalent of the linguistic digit (e.g., “M”) as a singe digit (e.g., one-byte) buffer. Alternatively, the system may store the linguistic equivalent of the linguistic digit in a row of multiple-digit buffer having the same structure as the one where the sequence S′ is stored. In both cases, if the linguistic equivalent does not exist, those buffers do not store any digit. [0142]
Block [0143] 505 checks whether a linguistic digit pointed by the pointer P1 exists in the EDT by calling a subroutine 540 labeled as “check_existence(D, P),” which will be described in detail below referring to FIG. 5C. The subroutine 540 takes two local variables: D, which is a linguistic digit passed from the main routine 500 for checking the existence, and P, which is a pointer to the currently-visited node of the EDT (e.g., P2). The subroutine 540 returns the local variable P as a computed result to the main routine 500.
FIG. 5C illustrates the [0144] subroutine 540, check_existence(D, P) for checking whether a linguistic digit has already been inserted into the EDT. Specifically, in block 505, the subroutine 540 takes a linguistic digit Value(P1) and the pointer P2, and returns the result as a pointer P3. In this specification, the function “Value(P1)” returns a linguistic digit which the P1 points to. In the case of FIG. 5B, the function Value(P1), and the pointer P2 are substituted for the linguistic digit D, and the pointer P, respectively. A decision 541 is made based on whether P2 points to a null node. Here, since the pointer P2 points to the null node 539 as shown in FIG. 5B, control moves to block 549. Block 549 creates a new empty node 559 shown by the solid line in FIG. 5D, at a position which the pointer P2 points to.
Here, a “null node” (e.g., [0145] 539) means that the pointer P2 points to a position where a node has not been created, while an “empty node” (e.g., 559) means that the pointer P2 points to a node which stores no linguistic digit therein. In the drawings of this document, a null node is represented by a circle drawn by the broken line, and an empty node is represented by a circle drawn by the solid line. Next, block 551 returns the position of the empty node 559 which the pointer P2 points to, to block 505 as the pointer P3 as shown in FIG. 5D. In other words, the pointer P3 points to the same node as the pointers P2, and P2′ do, which is the empty node 559.
Referring back to FIG. 5A, a [0146] decision 507 is made based on whether the pointer P3 points to an empty node. Since the subroutine 540 returns the pointer P3, which points to the empty node 559, control moves to block 509. Block 509 substitutes the linguistic digit Value(P1) for the linguistic digit Value(P3). Here, the linguistic digit Value(P1), i.e., “m,” is substituted for the linguistic digit which the pointer P3 points to, i.e., the formerly empty node 559, as shown in FIG. 5E. Then, control moves to a decision 511.
The decision [0147] 511 is made based on whether the linguistic digit which the pointer P1′ points to is an “empty value,” or has no value. Here, the pointer P1′ points to a linguistic digit of “M” as shown in FIG. 5B, which is not a digit with an empty value, and thus, control moves to block 513. Block 513 checks whether a linguistic digit pointed by the pointer P1′ exists in the EDT by calling a subroutine 540. Referring to FIG. 5C again, the decision 541 is made based on whether the pointer P2′ points to a null node. Here, since the pointer P2′ points to the node 559 which now stores a linguistic digit of “m” as shown in FIG. 5E, control moves to a decision 543. The decision 543 is made based on whether the pointer P2′ points to an empty node. Since the pointer P2′ points to the node 559 storing the linguistic digit “m,” which is not an empty node, control moves to a decision 545.
The [0148] decision 545 is made based on whether Value(D) is greater than, equal to, or less than Value(P), where the local variables D, and P are the pointers P1′, and P2′ from the main routine 500, respectively. Here, the pointer P1′ points to “M” (FIG. 5B), and the pointer P2′ points to “m” (FIG. 5E). Supposing that Value(“M”)<Value(“m”), control proceeds to block 547, which moves the pointer P to a “lokid” child node of the “m” node (i.e., the node 559) which the pointer P (i.e., P2′) originally points to, as shown in FIG. 5F. The lokid child node of the “m” node has no node, which is indicated by the broken line in FIG. 5F. Control then returns to the decision 541.
At the [0149] decision 541, the pointer P now points to a null node since the lokid of the “m” node has no node. Thus, control goes to block 549 after the check_existence routine 540 has created a new empty node at a position where the pointer P points to. As a result, after block 549 has been executed, the pointer P points to the new empty node at the lokid of the “m” node as shown in FIG. 5G. Then, control moves to block 551, which returns the position of the pointer P as the pointer P3′ as shown in FIG. 5H.
Controls returns from the [0150] subroutine 540 and proceeds to a decision 515 in the main routine 500. The decision 515 is made based on whether the pointer P3′ points to an empty node. Since the subroutine 540 returns the new empty node as the pointer P3′, control moves to block 517. Block 517 substitutes the linguistic digit Value(P1′) for the linguistic digit Value(P3′). Here, the linguistic digit Value(P1′), i.e., “M” is substituted for the linguistic digit which the pointer P3′ points to as shown in FIG. 5I. Then, control moves to a decision 519.
The [0151] decision 519 is made based on whether an eqkid of the node which the pointer P3 points to is equal to an eqkid of the node which the pointer P3′ points to. As an exceptional rule, if both (i) an eqkid of the node which the pointer P3 points to, and (ii) an eqkid of the node which the pointer P3′ points to are null nodes, control moves on the “NO” branch, and proceeds to block 521. Block 521 calls a subroutine 560, which is labeled as “adjust_subgraph(P, P′),” as described in detail referring to FIG. 5J.
The [0152] subroutine 560 takes two local variables P, which is a pointer to a linguistic digit in the EDT, and P′, which is a pointer to a linguistic equivalent of the linguistic digit in the EDT. A decision 555 is made based on whether both (i) the “eqkid” child node of the node which the pointer P3 points to, and (ii) the “eqkid” child node of the node which the pointer P3′ points to, are null nodes. Here, as shown in FIG. 5I, (i) the eqkid child node of the node which the pointer P3 points to is a null node, and (ii) the eqkid child node of the node which the pointer P3′ points to is a null node. Thus, control proceeds to block 557. Block 557 creates a new empty node under (i) the eqkid child node of the node which the pointer P3 points to, and (ii) the eqkid child node of the node which the pointer P3′ points to, as shown in FIG. 5K.
Control returns from the [0153] subroutine 560 and proceeds to a decision 523 in the main routine 500. The decision 523 is made based on whether the pointer P1 has reached the end of the linguistic digit sequence S′. As shown in FIG. 5B, the pointer P1 still points to the first linguistic digit of the linguistic digit sequence S′. Thus, control moves to block 525. In block 525, the system sets the pointers P1, P1′, P2, and P2′ to add the next digit in the sequence to the EDT as shown in FIG. 5L. In block 525, the system moves the pointer P1 to a next linguistic digit of the linguistic digit sequence S′, namely, “o.” The system sets the pointer P1′ to a linguistic equivalent of the linguistic digit pointed by the pointer P1, which is “O.” The system also sets the pointers P2 and P2′ to the empty node which was created under the eqkid child nodes of the “m” node and the “M” node. Then, control returns to block 505.
In block [0154] 505, the main routine 500 again calls the subroutine check_existence(Value(P1), P2). At block 541 of the subroutine 540, the pointer P2 does not point to a null node as shown in FIG. 5L, and thus, control moves to block 543. At block 543, the pointer P2 points to an empty node as shown in FIG. 5L. Therefore, control moves to block 551, which returns the local variable P, i.e., the pointer P2 which points to the empty node, to block 505 of the main routine 500. In other words, the system sets the pointer P3 to the same node as the pointers P2 and P2′ points to, as shown in FIG. 5M. Control returns from the subroutine 540 to block 505 of the main routine 500, and proceeds to block 507.
At this point, as shown in FIG. 5M, the pointer P[0155] 3 points to an empty node. Thus, decision 507 causes control to move to block 509, which substitutes Value(P1) for Value(P3). Here, the system stores the linguistic digit “o” in the node which the pointer P3 points to as shown in FIG. 5N. Control moves on to the decision 511. In block 511, since the linguistic digit which the pointer P1′ points to is not empty, control moves to block 513, which again calls the subroutine check_existence(Value(P1′), P2′). Control jumps from block 513 of the main routine 500 to block 541 of the subroutine 540.
In [0156] block 541, the pointer P2′ does not point to a null node as shown in FIG. %N, and thus control moves to block 543. At block 543, the pointer P2′ is not an empty node, and thus control proceeds to block 545. Block 545 compares Value(P1′) (i.e., “0”) and Value(P2′) (i.e., “o”). Supposing that Value(“0”)<Value(“o”), control proceeds to block 547, which moves the pointer P to a “lokid” child node of the “o” node which the pointer P (i.e., P2′) originally points to, as shown in FIG. 50. The lokid child node of the “o” node has no node, which is indicated by the broken line in FIG. 5O. Control then returns to the decision 541.
The pointers P[0157] 1, P1′, P2, P2′, and P shown in FIG. 5O are located similarly to those in FIG. 5F. Therefore, the system creates the EDT illustrated in FIG. 5P by executing the above scheme for another iteration as described referring to FIGS. 5F-5O. FIG. 5P shows the pointers and variables of the system immediately after the execution of block 509.
The “.” (dot) which the pointer P[0158] 1 points to has no linguistic equivalent. Therefore, as indicated in FIG. 5P, the pointer P1′ which is a pointer to the linguistic equivalent of value (P1) points to an empty value. Referring back to FIG. 5A, in block 511, the pointer P1′ points to an empty value, and thus, control moves to block 523. Unlike other characters (e.g., “m,” “o,” and “c”) which have linguistic equivalents, the “.” leads to blocks 523, and 525, bypassing blocks 513, 515, and 517. Blocks 523, and 525 move the pointers to the next character without adding a portion of the tree structure representing the linguistic equivalents (e.g., “M,” “0,” and “C”). As described earlier, for the other characters having the linguistic equivalents like “m,” “o,” and “c,” the portion representing the linguistic equivalents, namely, “M,” “0,” and “C,” are created by blocks 513, 515, and 517.
In block [0159] 523, the pointer P1 still has not reached the end of the linguistic digit sequence S′, and control moves to block 525. Block 525 sets the pointers P1, P1′, P2, and P2′ as indicated in FIG. 5Q, and control returns to block 505. Block 505 creates an empty node at a position which the pointer P2 points to, and sets the pointer P3 to the newly created empty node as shown in FIG. 5R. The scheme in the subroutine 540 is similar to that described referring to FIGS. 5L and 5M. Then, control returns from the subroutine 540 to block 509 of the main routine 500.
In [0160] block 507, the pointer P3 points to an empty node as shown in FIG. 5R, and thus, control moves to block 509. Block 509 substitutes the linguistic digit which the pointer P1 points to, i.e., B2, for a node which the pointer P3 points to, as shown in FIG. 5S. Then, in the decision 511, the pointer P1′ points to an empty value. Thus, control moves to block 523. In the decision 523, the end of the linguistic digit sequence has not been reached yet, and thus, control continues to block 525. Block 525 sets the pointers P1, P1′, P2, and P2′ as shown in FIG. 5T.
By iterating the process described referring to FIGS. [0161] 5Q-5T, the system creates the three nodes “B1,” “A2,” and “A1” under the eqkid node of the node of “B2,” as shown in FIG. 5U. Control moves on to blocks 511, and then 523. In the decision 523, this time, the pointer P1 points to the end of the reversed linguistic digit sequence S′, and therefore, control proceeds to a decision 527. The decision 527 is made based on whether the tree structure of “m-M-o-O-c-C-.-B2-B1-A2-A1” has a terminal node. Here, the tree structure has no node, and thus, control proceed to block 529.
Block [0162] 529 creates a new terminal node N1 which is associated with the tree structure “m-M-o-O-c-C-.-B2-B1-A2-A1”. The terminal node N1 contains the linguistic digit sequence S, and the linguistic encoding type T of the sequence S. The terminal node N is sometimes referred to as a “leaf” node of the tree structure, which uniquely identifies distinct encoding types, thereby unambiguously specifying the linguistic encoding type T of the domain name corresponding to the tree pathway (e.g., “m-M-o-O-c-C-.-B2-B1-A2-A1”). The system calls subroutines detect_string, and detect_type, which return the linguistic digit sequence S, and the linguistic encoding type T, respectively, to the main routine 500. FIG. 5V illustrates the tree structure “m-M-o-O-c-C-.-B2-B1-A2-A1,” and the associated terminal node N1 containing the linguistic digit sequence S, and the linguistic encoding type T, which are “A1A2B1B2.com,” and “GB2312,” respectively.
ADDING “C[0163] 1C2D1D2.com” TO EDT
Now the process for adding the linguistic digit sequence, “C[0164] 1C2D1D2.com” to the EDT, which already contains the sequence, “A1A2B1B2.com,” will be described referring to FIGS. 4A, 5A, and 5W-5AG.
In block [0165] 501, the routine 500 receives a linguistic digit sequence S (i.e., C1C2D1D2.com), and a linguistic encoding type of the linguistic digit sequence S (e.g., BIG5). The computer system then reverses the order of the linguistic digit sequence S, and substitutes the reversed sequence of S for the reversed linguistic digit sequence S′. In block 501, the system substitutes a linguistic encoding type of the linguistic digit sequence S for a linguistic encoding type T. Here, the reversed linguistic sequence S′, and the linguistic encoding type T are “moc.D2D1C2C1,” and “BIG5,” respectively. FIG. 5W illustrates the reversed linguistic sequence S′, and the linguistic encoding type T.
In [0166] block 503, the routine 500 initializes pointers P1, P1′, P2, and P2′ as indicated in FIG. 5W. The pointer P1, which points to a current position in the linguistic digit sequence, is set to the first digit of the sequence S′, namely, “m.” The pointer P1′ points to a linguistic equivalent of a linguistic digit pointed by the pointer P1, which is “M.” The linguistic equivalent of a linguistic digit in ASCII code is a capital letter of the linguistic digit. The pointer P2, which points to a currently-visited node in the EDT, is set to a root of the EDT, namely a node storing “m.” The pointer P2′, which points to a currently-visited node in the EDT for insertion of the pointer P1 ′, is also set to the root of the EDT, i.e., the “m” node.
Block [0167] 505 checks whether a linguistic digit pointed by the pointer P1 exists in the EDT by calling the subroutine 540 check_existence(D, P). Referring again to FIG. 5C, the decision 541 is made based on whether the pointer P, i.e., the pointer P2, points to a null node. Here, the pointer P2 points to a node with a linguistic digit of “m,” and thus, control moves to the decision 543. The decision 543 is made based on whether the pointer P points to an empty node. The pointer P2 points to the “m” node, not an empty node, and thus, controls proceeds to the decision 545. The decision 545 is made based on whether Value(D) is greater than, equal to, or less than Value(P), where the local variables D, and P are the pointers P1, and P2 from the main routine 500, respectively. Here, the pointer P1 points to “m,” and the pointer P2 points to “m.” Therefore, control moves to block 551, which returns the position of the pointer P2 to the main routine 500 as the pointer P3.
In the [0168] decision 507, the pointer P3 points to the “m” node as shown in FIG. 5X, which is not an empty node, and thus, control moves to the decision 511. The decision 511 is made based on whether the pointer P1′ is an empty value. Here, the pointer P1′ points to a linguistic digit, “M,” which is not an empty value, and thus, control moves on to block 513. In block 513, the subroutine 540 checks whether the EDT has already stored the linguistic digit which the pointer P2′ points to. Specifically, referring to FIG. 5C, the decision 541 checks if the pointer P2′ points to a null node. The pointer P2′ points to an “m” node as shown in FIG. 5X, and thus, control moves to the decision 543.
The [0169] decision 543 checks if the pointer P2′ points to an empty node. Since the pointer P2′ does not point to an empty node, control moves to the decision 545. Here, the relationship Value(D)<Value(P) (i.e., Value(“M”)<Value(“m”)) is satisfied, and thus control moves to block 547, where the system moves the pointer P to the lokid node of the “m” node which the pointer P2′ currently points to. Then, the subroutine 540 returns the position of the pointer P to the main routine 500 as the pointer P3′. As a result, the pointers P, and P3′ are set to the “M” node of the EDT as shown in FIG. 5X.
Control returns from the [0170] subroutine 540 to block 513 of the main routine 500. Next, in the decision 515, the pointer P3′ points to an “M” node, and thus, control moves to the decision 519. The decision 519 is made based on whether the eqkid node of the node which the pointer P3 points to, and the eqkid node of the node which the pointer P3 points to are the same or not. Here, the pointer P3 points to the “m” node, and the pointer P3′ points to the “M” node. Thus, the eqkid nodes of the “m” node, and the “M” node are the same, namely, the “o” node, as shown in FIG. 5X. Consequently, control moves to the decision 523, where control proceeds further to block 525 since the end of the sequence S′ has not been reached yet.
In block [0171] 525, the system sets the pointers P1, P1′, P2, and P2′ to add the next digit in the sequence to the EDT as shown in FIG. 5Y. Control loops back to block 505 to check if the EDT has the linguistic digit which the pointer P2 points to. Since the pointer P2 points to the “o” node now, control proceeds to the decisions 541, 543, and 545. By comparing the value of the “o” pointed by the pointer P1 with the value of the “o” pointed by the pointer P2, control moves to block 551, which returns the position of the pointer P2 to the block 505 of the main routine 500 as the pointer P3, as shown in FIG. 5Z. In the decision 507, since the pointer P3 does not point to an empty node, control goes on to the decision 511, which checks whether the pointer P1′ points to an empty value. Here, the pointer P1′ points to an “O” digit, and thus, control moves to block 513.
Block [0172] 513 checks if the EDT has the linguistic digit which the pointer P2′ points to. Since the pointer P2′ points to the “o” node now, control proceeds to the decisions 541, 543, and 545. Here, the relationship Value(D)<Value(P) (i.e., Value(“O”)<Value(“o”)) is satisfied, and thus control moves to block 547, where the system moves the pointer P to the lokid node of the “o” node which the pointer P2′ currently points to. Then, the subroutine 540 returns the position of the pointer P to the main routine 500 as the pointer P3′. As a result, the pointers P, and P3′ are set to the “O” node of the EDT as shown in FIG. 5Z.
By repeating the process described referring to FIGS. [0173] 5X-5Z, the pointer P1, P1′, P2, P2′, P3, and P3′ are set as shown in FIG. 5AA, which is similar to FIG. 5Y, when the block 525 is executed. Control then loops back to block 505 to check if the EDT has the linguistic digit, “.” (dot). Control jumps from the main routine 500 to the subroutine 540, where control proceeds to decisions 541, 543, 545, and the block 551. As a result, the subroutine 540 returns the position of the pointer P to the main routine 500 as the pointer P3, thereby setting the pointer P3 as shown in FIG. 5AB.
Control returns to [0174] decision 507 of the main routine 500, and further goes to blocks 509, and then 511 since the pointer P3 is not an empty node. In the decision 511, the pointer P1′ points to an empty value since the linguistic digit of “.” has no linguistic equivalent as shown in FIG. 5AB. Thus, control moves to the decision 523, and then block 525 since the end of the sequence S′ has not been reached. Block 525 sets the pointers P1, P1′, P2, and P2′ as shown in FIG. 5AC.
Control loops back to block [0175] 505, which calls the subroutine 540. Control moves to the decisions 541, 543, and 545 since the pointer P2 points to neither a null node nor an empty node. The decision 545 is made based on whether Value(D) is greater than, equal to, or less than Value(P), where the local variables D, and P are the pointers P1, and P2 from the main routine 500, respectively.
Here, the pointer P[0176] 1 points to “D2”, and the pointer P2 points to “B2.” Supposing that Value(“D2”)<Value(“B2”), control proceeds to block 547, which moves the pointer P to a “lokid” child node of the “B2” node which the pointer P (i.e., the pointer P2′) originally points to, as shown in FIG. 5AC. The lokid child node of the “B2” node has no node, which is indicated by the broken line in FIG. 5AC. Control then returns to the decision 541. Significantly, since the linguistic digit “D2” does not have a value equal to that of the linguistic digit “B2,” the structure of the EDT diverges for the first time at the “B2” node. This causes the subsequent nodes for C1C2D1D2.com to follow a unique path different from that of A1A2B1B2.com below the “.” level, as shown in FIG. 5AG.
At the [0177] decision 541, the pointer P now points to a null node since the lokid of the “B2” node has no node. Thus, control goes to block 549, which creates a new empty node at a position which the pointer P points to. As a result, after block 549, the pointer P points to the new empty node at the lokid of the “B2” node as shown in FIG. 5AC. Then, control moves to block 551, which returns the position of the pointer P as the pointer P3 as shown in FIG. 5AD.
Referring back to FIG. 5A, the [0178] decision 507 is made based on whether the pointer P3 points to an empty node. Since the subroutine 540 returns the pointer P3 which points to the empty node, control moves to block 509. Block 509 substitutes the linguistic digit Value(P1) for the linguistic digit Value(P3). Here, the linguistic digit Value(P1), i.e., “D2” is substituted for the linguistic digit which the pointer P3 points to, as shown in FIG. 5AE. Then, control moves to the decision 511.
In the decision [0179] 511, the pointer P1′ points to an empty value, and thus control goes to the decision 523, and then block 525, which set the pointers P1, P1′, P2, and P2′ as shown in FIG. 5AF. The positions of the pointers P1, P1′, P2, and P2′ are similar to those of FIG. 5T. Accordingly, by iterating the process described referring to FIGS. 5T-5V, the routine 500 creates the EDT as illustrated by FIG. 5AG.
ADDING “D[0180] 1E2F1F2.com” TO EDT
Now the process for adding the linguistic digit sequence, “D[0181] 1E2F1F2.com” to the EDT, which already contains the linguistic digit sequences, “A1A2B1B2.com,” and “C1C2D1D2.com,” will be described below. The computer system adds the sequence, “D1E2F1F2.com” to the EDT by executing the main routine 500, which calls the subroutine 540, in a similar manner as described in detail above referring to FIGS. 5W-5AG with respect to adding “C1C2D1D2.com.” Adding the linguistic digit sequence “D1E2F1F2.com” to the EDT is similar to adding the sequence “C1C2D1D2.com” to the EDT with the exception that the node “F2” is created in a hikid node of the node “B2” shown in FIG. 4A.
Specifically, after block [0182] 525 sets the pointers P1, P1′, P2, and P2′ as shown in FIG. 5AH, control loops back to block 505, which calls the subroutine 540. Control moves to the decisions 541, 543, and 545 since the pointer P2 points to neither a null node nor an empty node. The decision 545 is made based on whether Value(D) is greater than, equal to, or less than Value(P), where the local variables D, and P are the pointers P1, and P2 from the main routine 500, respectively. Here, the pointer P1 points to “F2”, and the pointer P2 points to “B2.” Supposing that Value(“B2”)<Value(“F2”), control proceeds to block 553, which moves the pointer P to a “hikid” child node of the “B2” node which the pointer P (i.e., the pointer P2′) originally points to, as shown in FIG. 5AH. The hikid child node of the “B2” node has no node, which is indicated by the broken line in FIG. 5AH. Control then returns to the decision 541.
At the [0183] decision 541, the pointer P now points to a null node since the hikid of the “B2” node has no node. The process following the decision 541 is similar to that described referring to FIGS. 5AC-5AG, and ultimately creates the EDT illustrated in FIG. 4A.
Finally, a hypothetical case will be described below. Suppose that a new domain name “X[0184] 1X2Y1Y2.com” encoded by BIG5 is being added to the EDT 401 shown in FIG. 4A, and that the values of the linguistic digits X1, X2, Y1, and Y2 are exactly same as those of the linguistic digits A1, A2, B1, and B2 which are encoded by GB2312. This is a rare case, but can happen especially when characters with the same value are used in different linguistic encoding types.
In this case, control follows [0185] nodes 403, 409, 411, 413, 415, 417, and 419 since each of the values of the linguistic digits for A1A2B1B2.com matches the corresponding one of the values of the linguistic digits for X1X2Y1Y2.com. Finally, when the pointer P1 points to the end of the string S′, namely, X1, control goes to “YES” branch of the decision 523 in FIG. 5A. Unlike the cases for A1A2B1B2.com, C1C2D1D2.com, and D1E2F1F2.com, in block 527, control moves on to block 531 since the terminal node 421 is not null. Specifically, the terminal node 421 contains the linguistic digit sequence S, which is “A1A2B1B2.com,” and the linguistic encoding type T, which is “GB2312.”
[0186] Block 531 creates a new terminal node 431 which is associated with the tree structure “m-M-o-O-c-C-.-Y2-Y1-X2-X1”. The terminal node 431 contains the linguistic digit sequence, and the linguistic encoding type, which are “X1X2Y1Y2.com,” and “BIG5,” respectively. In block 533, the new terminal node 531 is appended at the end of the tree structure of “A1A2B1B2.com.” As a result, the tree has two leaf nodes, namely the terminal nodes 421, and 431. FIG. 5AJ illustrates the completed EDT, in which the value of each corresponding linguistic digit matches with each other for two different domain names.
THE REINSERT ROUTINE [0187]
The preceding discussion has shown how the algorithm can build up a tree structure without moving or adjusting previously completed substructures within the tree. However, it may at times become desirable to adjust a substructure within the tree. This might occur when, for example, neighboring nodes in the tree coincidentally represent linguistic equivalents of a character in some encoding type. Consider the situation in FIG. 6 where the registry contains three domain names: “A[0188] 1A2E1E2” in a first encoding type, “C1D3H3” in a second encoding type, and “aba” in ASCII. An encoding detection tree 601 is created when the first two domain names are considered. Now assume that by coincidence the digital sequence of E2 in its encoding type is identical to the ASCII value of “a” and the digital sequence of H3 in its encoding type is identical to the ASCII value of “A.”
The tree building algorithm applies the third domain name as follows. Initially, pointer P[0189] 2 points to node E2 and pointer P1 points to the last letter “a” in the domain name aba. See block 503 of FIG. 5A. Next, the algorithm runs the check_existence subroutine 540 of FIG. 5C. At 545 in this subroutine, the algorithm determines that digital values of “a” and “E2” are equal. Thus, it returns a pointer P3 pointing to node E2. Now, returning to primary algorithm 500, the process continues through decision blocks 507 and 511. At 511, it is determined that P1′ is not an empty value; it is “A.” Then, at 513, the algorithm runs subroutine check-existence 540 using P1′ =A and P2′ =E2 node. Two loops through subroutine 540 returns pointer P3′ pointing to the H3 node.
Now, returning to the main algorithm, decision [0190] 515 determines that P3′ is not pointing to an empty node. Then decision 519 directs the process to 521, where the adjust-EDT structure subroutine 560 is executed. There, the system determines that both P3 and P3′ point to a non-null node. Therefore, each of decisions 555, 559, and 561 are answered in the negative and process control is directed to process block 563, where a “reinsert_subparagraph” subroutine is executed. This subroutine is depicted in FIG. 5AI by process 570.
In [0191] process 570, the system initially creates pointers P and P′ that point to nodes located immediately under pointers P3 and P3′. In the example tree 601 of FIG. 6, P points to node El and P′ points to node D3. At 571, the subroutine determines whether P′ is null. In this case, it is not, so the process moves to 573 where the check_existence subroutine is executed. This time the input parameters are P and P′. Executing this program as such compares the values of D3 and El and moves P to hikid from El. Thereafter a new empty node is created at that position and a new pointer P″ is returned at the position of this new node. Now returning to routine 570, a decision block 575 determines whether P″ points to an empty node. It does in this case, so a process operation 577 inserts the digit at the position of P′. In this case, that is the digit D3.
The process then moves pointer P′ recursively down the branch on which it sits and destroys the previous node of P′. See [0192] 579 and 581. Process control then returns to 571 where the procedure for removing the next node on the P′ branch is executed. The process continues in this manner until the last node remaining in the P′ branch has been moved to the P branch. Then P′ points to a null node and the process completes as indicated at 571.
In general, reinsertion of a substructure is necessary when a digit and its linguistic equivalent digit have both been inserted into EDT and both of them have some substructures. Because now both of them should converge, the substructures of one needs to be redistributed to the substructures of the other to ensure no information loss. [0193]
6. HARDWARE/SOFTWARE [0194]
Embodiments of the present invention relate to an apparatus for performing the above-described m[0195] 1DNS operations. This apparatus may be specially constructed (designed) for the required purposes, or it may be a general-purpose computer selectively activated or configured by a computer program stored in the computer. The processes presented herein are not inherently related to any particular computer or other apparatus. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required method operations. The required structure for a variety of these machines will appear from the description given above.
In addition, embodiments of the present invention further relate to computer readable media that include program instructions for performing various computer-implemented operations. The media may also include, alone or in combination with the program instructions, data files, data structures, tables, and the like. The media and program instructions may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media such as floptical disks; and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM). The media may also be a transmission medium such as optical or metallic lines, wave guides, etc. including a carrier wave transmitting signals specifying the program instructions, data structures, etc. Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter. [0196]
FIG. 7 illustrates a typical computer system in accordance with an embodiment of the present invention. The [0197] computer system 700 includes any number of processors 702 (also referred to as central processing units, or CPUs) that are coupled to storage devices including primary storage 706 (typically a random access memory, or “RAM”), primary storage 704 (typically a read only memory, or “ROM”). As is well known in the art, primary storage 704 acts to transfer data and instructions uni-directionally to the CPU and primary storage 706 is used typically to transfer data and instructions in a bi-directional manner. Both of these primary storage devices may include any suitable type of the computer-readable media described above. A mass storage device 708 is also coupled bi-directionally to CPU 702 and provides additional data storage capacity and may include any of the computer-readable media described above. The mass storage device 708 may be used to store programs, data and the like and is typically a secondary storage medium such as a hard disk that is slower than primary storage. It will be appreciated that the information retained within the mass storage device 708, may, in appropriate cases, be incorporated in standard fashion as part of primary storage 706 as virtual memory. A specific mass storage device such as a CD-ROM 714 may also pass data uni-directionally to the CPU.
[0198] CPU 702 is also coupled to an interface 710 that includes one or more input/output devices such as such as video monitors, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, or other well-known input devices such as, of course, other computers. Finally, CPU 702 optionally may be coupled to a computer or telecommunications network using a network connection as shown generally at 712. With such a network connection, it is contemplated that the CPU might receive information from the network (e.g., requests to resolve domains), or might output information to the network in the course of performing the above-described method operations. The above-described devices and materials will be familiar to those of skill in the computer hardware and software arts.
The [0199] CPU 702 may take various forms. It may include one or more general purpose microprocessors that are selectively configured or reconfigured to implement the functions described herein. Or it may include one or more specially designed processors or microcontrollers that contain logic and/or circuitry for implementing the functions described herein. Any of the logical devices serving as CPU 702 may be designed as general purpose microprocessors, microcontrollers, application specific integrated circuits (ASICs), digital signal processors (DSPs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), and the like. They may execute instructions under the control of the hardware, firmware, software, reconfigurable hardware, combinations of these, etc.
The hardware elements described above may be configured (usually temporarily) to act as one or more software modules for performing the operations of this invention. For example, separate modules may be created from program instructions for detecting an encoding type, transforming that encoding type, and identifying a default name server may be stored on [0200] mass storage device 708 or 714 and executed on CPU 708 in conjunction with primary memory 706. See FIG. 3B for example.
Although the foregoing invention has been described in some detail for purposes of clarity of understanding, it will be apparent that certain changes and modifications may be practiced within the scope of the appended claims. [0201]

Claims

What is claimed is:

1. A method, implemented on an apparatus, of detecting a linguistic encoding type of a domain name, the method comprising:

receiving a digital representation of the domain name; and

using said representation to traverse a tree structure having multiple nodes connected by paths and having terminal nodes that uniquely identify distinct encoding types, thereby detecting the linguistic encoding type of the domain name.

2. The method of

claim 1

, wherein the tree structure is traversed by considering individual characters of the domain name or portions of said characters to determine how to move between nodes on the tree structure.

3. The method of

claim 1

, wherein the tree structure is traversed by comparing digital representations of linguistic digits in the nodes of the tree structure against digital representations of individual characters of the domain name or portions of said characters, and thereby determining how to move between nodes on the tree structure.

4. The method of

claim 3

, wherein a linguistic digit of a root of the tree structure is compared against a digital representation of a last character of the domain name and

wherein a linguistic digit of a terminal node of the tree structure is compared against a digital representation of a first character of the domain name.

5. The method of

claim 4

, wherein the terminal node specifies the linguistic encoding type of the domain name.

6. The method of

claim 1

, further comprising reversing the sequence of the digital representation of the domain name prior to using said representation to traverse the tree structure, wherein a digital representation of a last character of the domain name is compared to a root node on the tree structure.

7. The method of

claim 6

, wherein a digital representation of a next to last character of the domain name is compared to a second node of the tree structure.

8. The method of

claim 6

, further comprising:

(a) using a next previous character of the domain name to identify a next lower level node of the tree structure; and

(b) repeating (a) until reaching a terminal node of the tree structure.

9. The method of

claim 1

, wherein the tree structure is ternary tree structure.

10. The method of

claim 1

, wherein the nodes of the tree structure comprise digital sequences of linguistic digits from characters of multiple encoding types.

11. A computer program product comprising a machine readable medium on which is stored program code instructions for performing a method of detecting a linguistic encoding type of a domain name, the program instructions comprising:

program code for receiving a digital representation of the domain name; and

program code for using said representation to traverse a tree structure having multiple nodes connected by paths and having terminal nodes that uniquely identify distinct encoding types, thereby detecting the linguistic encoding type of the domain name.

12. The computer program product of

claim 11

, wherein the tree structure is traversed by executing program code for considering individual characters of the domain name or portions of said characters to determine how to move between nodes on the tree structure.

13. The computer program product of

claim 11

, wherein the tree structure is traversed by executing program code for comparing digital representations of linguistic digits in the nodes of the tree structure against digital representations of individual characters of the domain name or portions of said characters, and thereby determining how to move between nodes on the tree structure.

14. The computer program product of

claim 13

15. The computer program product of

claim 14

16. The computer program product of

claim 11

, further comprising program code for reversing the sequence of the digital representation of the domain name prior to using said representation to traverse the tree structure, wherein executing program code compares a digital representation of a last character of the domain name to a root node on the tree structure.

17. The computer program product of

claim 16

18. The computer program product of

claim 16

, further comprising:

(a) program code for using a next previous character of the domain name to identify a next lower level node of the tree structure; and

(b) program code for repeating (a) until reaching a terminal node of the tree structure.

19. The computer program product of

claim 11

, wherein the tree structure is ternary tree structure.

20. The computer program product of

claim 11

21. An apparatus for detecting a linguistic encoding type of a domain name, the apparatus comprising:

one or more processors;

memory in coupled to said one or more processor and configured to store a tree structure having multiple nodes connected by paths and having terminal nodes that uniquely identify distinct encoding types; and

a network interface configured to receive domain names from network nodes;

wherein the one or more processors are configured or designed to traverse the tree structure using information from a domain name to thereby detect the linguistic encoding type of the domain name.

22. The apparatus of

claim 21

, wherein the tree structure is ternary tree structure.

23. The apparatus of

claim 21

24. The apparatus of

claim 21

, wherein the one or more processors is further configured or designed to traverse the tree structure by considering individual characters of the domain name or portions of said characters to determine how to move between nodes on the tree structure.

25. The apparatus of

claim 21

, wherein the one or more processors is further configured or designed to traverse the tree structure by comparing digital representations of linguistic digits in the nodes of the tree structure against digital representations of individual characters of the domain name or portions of said characters, and thereby determining how to move between nodes on the tree structure.

26. The apparatus of

claim 21

, wherein the one or more processors are further configured or designed to (i) reverse the sequence of the digital representation of the domain name prior to using said representation to traverse the tree structure, and (ii) compare a digital representation of a last character of the domain name to a root node on the tree structure.

27. The apparatus of

claim 26

, wherein the one or more processors are further configured to compare a digital representation of a next to last character of the domain name to a second node of the tree structure.

28. The apparatus of

claim 21

, further comprising a logical module for converting the domain name from its linguistic encoding type to a DNS compatible encoding type.

29. The apparatus of

claim 28

, further comprising a logical module for resolving domain names in the DNS compatible encoding type.

30. The apparatus of

claim 28

, wherein the DNS compatible encoding type is ASCII.

31. A method, implemented on an apparatus, of creating an encoding detection tree comprising nodes connected by paths, with the paths representing domain names in various encoding types, the method comprising:

receiving a representation of a digitally represented first domain name which is encoded in a first linguistic encoding type;

adding the first domain name, and its first linguistic encoding type, to the encoding detection tree to create a first path through the encoding detection tree;

receiving a representation of a digitally represented second domain name which is encoded in a second linguistic encoding type; and

adding the second domain name, and its second linguistic encoding type, to the encoding detection tree to create a second path through the encoding detection tree.

32. The method of

claim 31

, further comprising determining whether the first domain name already exists in the encoding detection tree.

33. The method of

claim 31

, wherein the first and second paths each comprise separate terminal nodes, one or more intermediate nodes, and a common root node.

34. The method of

claim 33

, further comprising adding an identifier of the first and second linguistic encoding types to the terminal nodes of the first and second paths, respectively.

35. The method of

claim 34

, further comprising adding a sequence of the first domain name to the terminal node of the first path.

36. The method of

claim 33

, wherein the first path presents the first domain name in reverse order of linguistic digits when moving from the root node to the terminal node.

37. The method of

claim 31

, wherein each parent node of the encoding detection tree branches into three children nodes at most.

38. The method of

claim 31

, wherein adding the first domain name comprises adding a new node to the encoding detection tree for each linguistic digit of the first domain name having a digital sequence that does not appear at a corresponding location in the encoding detection tree.

39. The method of

claim 31

, wherein the position of the new nodes with respect to existing nodes is determined by comparing the digital sequence of an existing node with the digital sequence of a corresponding linguistic digit from the first domain name.

40. The method of

claim 31

, wherein adding the first domain name includes adding to the encoding detection tree a linguistic equivalent node of the one of the linguistic digits in the first domain name.

41. A computer program product comprising a machine readable medium on which is provided program instructions for creating an encoding detection tree comprising nodes connected by paths, with the paths representing domain names in various encoding types, the instructions comprising:

program code for receiving a representation of a digitally represented first domain name which is encoded in a first linguistic encoding type;

program code for adding the first domain name, and its first linguistic encoding type, to the encoding detection tree to create a first path through the encoding detection tree;

program code for receiving a representation of a digitally represented second domain name which is encoded in a second linguistic encoding type; and

program code for adding the second domain name, and its second linguistic encoding type, to the encoding detection tree to create a second path through the encoding detection tree.

42. The computer program product of

claim 41

, further comprising program code for determining whether the first domain name already exists in the encoding detection tree.

43. The computer program product of

claim 41

44. The computer program product of

claim 43

, further comprising program code for adding an identifier of the first and second linguistic encoding types to the terminal nodes of the first and second paths, respectively.

45. The computer program product of

claim 44

, further comprising program code for adding a sequence of the first domain name to the terminal node of the first path.

46. The computer program product of

claim 43

47. The computer program product of

claim 41

48. The computer program product of

claim 41

, wherein the program code for adding the first domain name comprises program code for adding a new node to the encoding detection tree for each linguistic digit of the first domain name having a digital sequence that does not appear at a corresponding location in the encoding detection tree.

49. The computer program product of

claim 41

50. The computer program product of

claim 41

, wherein the program code for adding the first domain name includes program code for adding to the encoding detection tree a linguistic equivalent node of the one of the linguistic digits in the first domain name.

51. An apparatus for creating an encoding detection tree comprising nodes connected by paths, with the paths representing domain names in various encoding types, the apparatus comprising:

one or more processors;

memory in coupled to said one or more processor and configured to store a partially created tree structure having multiple nodes connected by paths and having terminal nodes that uniquely identify distinct encoding types; and

an interface configured to receive domain names from a collection of domain names;

wherein the one or more processors are configured or designed to receive representations of digitally represented domain names which are encoded in linguistic encoding types and add those domain names, together with their linguistic encoding types, to the encoding detection tree to create paths through the encoding detection tree.

52. The apparatus of

claim 51

, wherein the interface is configured to receive domain names from a registry of domain names.

53. The apparatus of

claim 51

, wherein the paths each comprise separate terminal nodes, one or more intermediate nodes, and a common root node.

54. The apparatus of

claim 53

, wherein the one or more processors are further designed or configured to add identifiers of the linguistic encoding types to the terminal nodes of the paths.

55. The apparatus of

claim 54

, wherein the one or more processors are further designed or configured to add a sequence of the domain names to the terminal nodes of the paths.

56. The apparatus of

claim 53

, wherein the paths present the domain names in reverse order of linguistic digits when moving from the root node to the terminal node.

57. The apparatus of

claim 51