Web Page Extract that is Updated on an Ongoing Basis
FIELD OF THE INVENTION
The present invention pertains to the field of HTML and Internet applications such as Web browsers.
BACKGROUND OF THE INVENTION
Web pages display various kinds of information. On some pages the information is static - it is not updated on a regular basis. A web page containing a history article is an example of a static page.
Other web pages show information that changes - it is updated on a regular basis and can be considered live information. An example of live information is the current weather on a weather page. The following site http://search. excite. com/search. gw?search=san+diego+weather contains a paragraph with the current weather. The weather paragraph is updated regularly (at least once a day) so that it displays live information.
Users of the Internet currently view both static and live information the same way: A Web browser (such as Netscape's Navigator or Microsoft's Internet Explorer) is used to view the entire page.
This approach is not very efficient for obtaining live information. Whenever the user needs an update, the entire page of interest has to be downloaded afresh. The downloading of a web page can take a substantial amount of time depending on Internet congestion and the bandwidth of the user's connection to the Internet. Waiting during this time is unproductive and annoying to the user. Once the page is loaded, the user needs to locate the paragraph of interest visually. Sometimes it is necessary to scroll down the page if the information is not located on the first page view. This is also unproductive and inconvenient.
The problem is exacerbated when a user needs to check many pieces of live information on a regular basis. A typical user might check stock prices, traffic conditions, mortgage rates, news headlines and what is showing at the local
theater on a daily basis. The cumulative time taken to download and find this information can be considerable. In a large company where several thousand employees repeat a similar exercise several times a day, the cumulative cost of the time wasted can be excessive.
Several solutions have been proposed and implemented in an attempt to address this problem. They include what are known as Push technologies. Examples are Castanet from Marimba Inc and Netcaster from Netscape Inc. A user can subscribe to what is known as a "channel". Each channel supplies a certain kind of live information, similar to channels on a cable television system. For example, a news channel provides news headlines, a weather channel provides weather information.
Unfortunately, the number of channels is quite limited when compared to the number of live web pages available on the Internet. This is because channel providers need to format the information a special way for it to work with the channel viewer. Another unsatisfactory characteristic of this approach is that the channel content is configured by the channel provider rather than the user.
The benefit of the present invention is that the new technology, Web
Snippets, allows a user to configure the live information as the user desires. It also allows the user to obtain the live information from any web page accessible to that user, rather than a limited set of channels.
SUMMARY OF THE INVENTION
The present invention introduces a new kind of Internet technology called Web Snippet™ technology. WebSnippet™ technology provides a live view of a portion (preferably a small portion) of text taken from an existing Internet Web page. This portion of the text is updated to or for the original Web page at a specified time interval. This creates the impression that the updated text portion is a live view of the text on the original Web page.
As an example consider Figure 1A which shows what a weather page might look like as displayed on a Web Browser. Note that the entire page is
displayed as one entity. Figure 1B shows the current temperature extracted from the page. The extract is the text portion that is updated. The extract can be used as a distinct entity within other electronic documents or computer programs.
The text portion would not be very useful if it simply contained the same value ("63 deg F" in this case) for all time after its creation. The real benefit of the WebSnippet™ technology is that it maintains the same value one the user's computer as that shown on the original Web page on its Web server. In this way, the WebSnippet ™technology provides a live view onto a portion of the original Web page.
The updateable text portion or Snippet™ is live in that when the original page changes on its server so does the updateable text portion or Snippet™ value of same, as far as the user is able to discern. Continuing with the previous example, when the weather page shown in Figure 1 A changes to that shown in Figure 2A, the updateable text portion or Snippet ™ changes too, as shown in Figure 2B. Note that even though the length and content of the original Web page have changed, the updatable text portion or Snippet ™ continues to display the correct information.
WebSnippet™ technology can be used in a variety of applications. The preferred embodiment of the present invention displays a list of recently updated text portions or Snippets™. The update interval is specified by the user. The user has the option of emailing the list after each update. Another embodiment shows how computer calculations and comparisons can be performed on two updatable text portions or Snippets™.
WebSnippet™ technology allows computer programs to leverage the vast amount of live information on the public Internet (as well as the information on restricted intranets for those users with access).
NOTATIONS AND NOMENCLATURE
In the context of this invention, 'word' refers to one or more consecutive characters that do not contain whitespace. For example, in Figure 3, 301 , 302, 303 and 304 are examples of words. Whitespace is defined as ASCII characters 0-32 and 255. The most common whitespace characters are the 'space' or 'blank' character, and the carriage return character(s) at the end of a line.
In the context of this invention, a 'paragraph' is a set of one or more words.
The present invention makes use of terms and notation associated with
HTML - The HyperText Markup Language. An HTML tag is text that appears within the '<' and '>' symbols. Web pages are defined by means of HTML text.
HTML text contains both HTML tags and ordinary text which is not enclosed within the '<' and '>' symbols.
In the context of this invention, the word 'document' means HTML text that defines a Web page. The text can be stored in a file on a server or produced dynamically by the server in response to a request.
Useful machines for performing the operations of the present invention include general purpose digital computers or similar devices such as portable computers, smartphones and PDAs (Personal Digital Assistants). The general purpose computer may be selectively activated or reconfigured by a computer program stored in the computer. A special purpose computer may also be used to perform the operations of the present invention. In short, use of the methods described and suggested herein is not limited to a particular computer configuration.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG 1 A is an example of an HTML document viewed through a Web Browser.
FIG 1 B shows a Snippet™ extract of the temperature taken from the weather page in FIG 1A.
FIG 2A shows the same web page as FIG 1 A viewed on a different day when the weather conditions have changed.
FIG 2B shows the updateable text portion or Snippet™ extract of the temperature taken from the weather page in FIG 2A, with the updateable text portion or Snippet™ now showing a different value for the temperature.
FIG 3 shows examples of what the term 'word' refers to in the context of
this invention.
FIG 4 illustrates the primary components of a Uniform Resource Locator ("URL").
FIG 5 illustrates the HTML text source that describes the web page of FIG 1A.
FIG 6 shows the HTML text source for a web page whose content changes.
FIG 7A shows the HTML text source for a weather page on a day when weather conditions are good and the temperature is 63 degrees.
FIG 7B shows the HTML text source for the same weather page as FIG
7A on a day when weather conditions are worse and the temperature has changed to 54 degrees.
FIG 8 shows the HTML text source for a web page where the text in a certain paragraph changes from day to day.
FIG 9 illustrates a flowchart for retrieving a updateable text portion or Snippet™ whose location has been specified according to the preferred method of the invention.
FIG 10 shows a flowchart for the algorithm that extracts a word from an HTML text document at a specified paragraph count and a specified word count within the paragraph.
FIG 11 illustrates the user interface of the preferred embodiment as it appears on a personal computer.
FIG 12 is a flowchart that illustrates the algorithm for retrieving a list of updated text portions or Snippets™.
FIG 13 shows a sample email message that is generated by the program of the preferred embodiment shown in FIG 11.
FIG 14 illustrates the user interface for an alternative embodiment that
performs calculations on two updateable text portions or Snippets™.
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention make use of WebSnippet™ technology. An updateable text portion or Snippet™ is made up of one or more words extracted on an ongoing basis from a Web page.
A Web page is defined by means of an HTML document that describes to the Web browser how to display the Web page. The Web page seen previously in Figure 1A is defined by the HTML text shown in Figure 5. The HTML text contains HTML tags that instruct the browser how to display ordinary text (ordinary text is text which is not within the '<' and '>' symbols). For example, the tag 501 tells the browser to start using the bold typeface. The tag 502 tells the browser to stop using the bold typeface. A paragraph break is induced by the <p> tag, and so on. The updateable text portion or Snippet™ shown previously in Figure 1B is part of the HTML text shown in Figure 5. It is the word 503.
Thus, the problem of extracting the updateable text portion or Snippet™ from a Web page is that of extracting the updateable text portion or Snippet™ words from the HTML text that defines the Web page. However, the HTML language was designed with the assumption that the whole document would be viewed as one entity. It was not designed to facilitate the extraction of small portions of the document - currently there are no tags that say "this is the first extractable piece from this document", "this is the second extractable portion of this document" etc. Hence a method is required for defining the location of an updateable text portion or Snippet™ in such a way that it can be extracted.
The ongoing nature of the extraction means that the words are extracted at many different times, either at preprogrammed intervals or at sporadic intervals in response to the key press of a user. Between each time of extraction it is possible that the content of the Web page changes. The difficulty in designing an extraction method is that it must be robust enough to account for these content changes and still find the correct words that make up the updateable text portion or Snippet™.
The process of extracting the words comprises:
1 ) Locating and downloading the HMTL document which contains the HTML text.
2) Locating the words of the updateable text portion or Snippet™ within the document. Once located, the extraction of the words is straightforward for anyone skilled in the art of computer programming. The words are simply copied into one or more string variables.
For the first step, a method for uniquely specifying the location of an HTML document already exists. The URL (Uniform Resource Locator) uniquely specifies the location of a document on the Internet. Figure 4 illustrates the primary components of a URL.
A service type 401 is a required part of a URL. The service type specifies the protocol of how to contact the server to retrieve the document. The most common service type is the HyperText Transfer Protocol or http. A system name 402 is also a required part of a URL. The system name is the fully qualified domain name of the server which stores the document. A port 403 is an optional part of a URL. A port and domain name together define a TCP connection end point. The port is optional because a default value of 80 is used for the http service. A port is only needed if the server does not communicate on the default port for a that service. A directory path 404 is a required part of a URL. It defines the location of the document within the filesystem of the server. A filename 405 is an optional part of a URL. The filename is the data file itself. The server can be configured so that if a filename is not provided, a default file or directory listing is returned. Additional input data 406 is another optional part of a URL. It is used as input for programs that create and return a document. An example is a stock quote URL that takes the ticker symbol as additional input data. The program returns the recent stock price in an HTML document.
The second step of extracting the updateable text portion or Snippet™ is to locate the updateable text portion or Snippet™ word(s) within the document. For clarity, the description below explains the method for locating an updateable text portion made up of a single word. The process of locating a multi-word updateable text portion is an extension of the method shown. A simple (though inefficient) extension is to repeat the method for each word in a multi-word phrase. For example, to locate the three words in a three word updateable text portion or Snippet™, the single word method can be invoked three times.
Locating a word inside an HTML document requires a definition of a measure of distance. In general, the definition of distance measure can be one of several metrics. For example, distance in an HTML document could be measured by counting characters. In Figure 6 the word "12:05am" 601 is located at a distance of 4 characters from the end of the word "time" 602. The letters 'i' and 's' plus the two spaces make 4. Another example of distance measure is the number of words. The word "12:05am" 601 is the 33rd word in the document.
Once a measure of distance is chosen, one way to define the location of the updateable text portion or Snippet™ word is by its distance from the beginning of the document. Using the beginning of the document as the distance reference presents a problem, though. Documents with live information can change any portion of their content, not just the snippet word. This is best illustrated with an example. Figure 7A shows the temperature updateable text portion or Snippet™ 701 within the HTML document that defines a weather page on the Web. Figure 7B shows the HTML text for the same page on a different day when parts of the page other than the updateable text portion or Snippet™ have also changed. Using a character or word metric for distance measurement, the distance of the updateable text portion or Snippet™ 702 from the beginning of the document has changed. The distance to the end of the document has also changed. Suppose the location of the updateable text portion or Snippet™ word is specified as "the updateable word is the 43rd word from the beginning of the document". On the day shown in Figure 7A, the correct word "63" 701 is located. But on the second day shown in Figure 7B, the correct word "54" 702 is not located because it is not the 43rd word in the document. Thus, specifying the location of the updateable word by means of a distance measure from the start or end of the document does not locate the updateable word on different days.
However, in both versions of the document, the distance between the updateable temperature text or Snippet™ and the word "Current" 703 in the heading "Current Temperature" remains the same. This constant distance leads to the preferred method for specifying the location of the updateable word within a known document. The method involves two steps:
Specifying a Location Anchor. The Location Anchor is one or more consecutive words that are always present in the document. The word(s) must be at a fixed distance to the updateable text portion or Snippet™. The word(s) must be unique within a configurable search window.
Specifying the distance between the Location Anchor and the updateable text portion or Snippet™.
In the contrived example of Figures 7A and 7B, the Location Anchor distance to the updateable text portion word is fixed, both in terms of a character metric and in terms of a word metric. In practice, small changes in content between the Location Anchor and the updateable text portion or Snippet™ word are normal. A measure of distance is required that can absorb these small changes. Experimentation with updateable text portions or Snippets™ on real Web pages has shown that an effective measure of distance is a two dimensional one, a combination of two counts expressed as a pair (p,w):
The first, p, is the paragraph count. It specifies the paragraph in which the updateable text portion or Snippet™ word is found. A paragraph is one or more words separated by one or more consecutive HTML tags. The following tags are excluded and do not increment the paragraph count: SUP, SUB, B, CITE, CODE, DFN, EM, KBD, SAMP, STRONG, VAR, BIG, BLINK, I, SMALL, S, TT, BASEFONT, PRE, CENTER. They are called non-paragraph tags because they mostly change the appearance of a font and do not indicate a paragraph boundary in the ordinary sense of paragraphs.
The second, w, is the word count. It specifies the distance within the identified paragraph as a number of words. Non-paragraph tags and their enclosing "<" and ">" symbols are excluded from the word count.
An example serves to illustrate the (p,w) distance measure. In Figure 8 the updateable text portion or Snippet™ 801 is at a distance of (3,6) from the beginning of the Location Anchor 802. Counting paragraphs from the anchor, the first HTML tag is 803, the CENTER tag. This is one of the tags on the list of excluded tags, so the paragraph count remains at zero. The next tag is the "start of anchor" tag 804. It counts as a paragraph boundary so the paragraph count is incremented to 1. The "end of anchor " tag 805 also increments the paragraph count. The paragraph tag (The P tag is called the paragraph tag in HTML because it tells the browser to start a new paragraph) 806 does not, because it is a consecutive tag (it follows the </a> tag). The paragraph count is now 2. The bold tags 808 are also on the excluded list and do not increment the paragraph count. Finally, the paragraph tag 809 increments the paragraph count to 3. The word "X" 801 is the 6th word in the 3rd paragraph and hence the second value in the distance metric is 6.
The algorithm for determining the paragraph and word count is shown in Figure 10 as a flowchart.
Including the paragraph count in the two dimensional preferred distance measurement is important because the paragraph count does not depend on the number of words in the paragraphs. It makes for a more robust distance measurement in pages where the content of paragraphs changes but the number of paragraphs between the Location Anchor and the updateable text portion or Snippet™ remains constant. For example, in Figure 8 the text pointed to by 807 may change from day to day. The text between the tags 808 may also change from day to day. Regardless of the changes, the distance between the updateable text portion or Snippet™ 801 and location anchor 802 remains at (3,6). This is because the changing text falls within a paragraph. The two dimensional distance metric has proven satisfactory in experimental versions of the embodiments.
In summary, the preferred method for specifying the location of a updateable text portion or Snippet™ word or text portion includes the following information: URL of HTML document that contains the updateable text portion .
The location of the updateable text portion or Snippet™ word or text portion within the document specified by:
A Location Anchor within the document. The distance between the Location Anchor and the updateable text portion or Snippet™ expressed as:
The paragraph count. The word count.
Figure 9 shows an algorithm for retrieving a updateable text portion given its location. The updateable text portion or Snippet™ location is the input 901 to the algorithm and is specified in terms of the preferred method above. The output of the algorithm is the updateable word or text portion.
The first step 902 in finding the updateable text or word is to get the latest copy of the document that contains the updateable text or word. Most modern network programming environments provide functionality to get a document given its URL. One example is in the Java language where the classes 'URL' and 'URLConnection' provide this functionality.
The next step, 903, is to find the distance of the Location Anchor from the beginning of the document (the beginning of the document is a convenient reference point, but any other well known position can be used, e.g. the <body> tag, which occurs once early in the document). This process involves comparing the words of the document with the text of the Location Anchor, and keeping count of the current position within the document. The algorithm for doing this step is a variation of the algorithm for the next step, 904, which is detailed in Figure 10. The distance of the Location Anchor is saved for the next step using the variables (a, b).
Once the distance of the Location Anchor from the beginning of the document is determined, the updateable text portion or Snippet™ location can be calculated as shown in step 904. The distance between the Location Anchor and the updateable text portion or Snippet™ is given as (x,y). There are two cases:
x is zero, i.e. the Location Anchor and the updateable text portion or Snippet™ are in the same paragraph. In this case the updateable text portion or Snippet™ word can be found at location (a,y+b) from the beginning of the document. Note that the paragraph count is x+a or simply a since x is zero. The algorithm for this case is a variation of the algorithm for the second case and is not shown explicitly.
x is greater or less than zero. In this case the updateable text portion or Snippet™ is found at location (x+a, y) from the beginning of the document. Note that the value b is not included in the word count. This is because the word count of the updateable text portion or Snippet™ starts at the beginning of its paragraph and is independent of the number of words in the paragraph in which the Location Anchor is found. The updateable text portion or Snippet™ word or text portion is then found by moving the distance (x+a, y) from the beginning of the document. The details of this step are shown in a step wise fashion in Figure 10.
In Figure 10, the input 1001 to the algorithm is the document text itself and the distance of the updateable text portion or Snippet™ word from the beginning of the document. The text of the document is processed word by word. The decision step 1002 ensures that consecutive paragraph tags increment the paragraph count only once. Whenever a paragraph tag is found 1005, the paragraph_count variable is incremented. A non-paragraph tag is
ignored, as shown in the conditional branch at point 1003. When the paragraph_count variable matches the input value p (the paragraph containing the updateable text portion or Snippet™), the algorithm moves onto locating the updateable text or word 1007. The document continues to be read word by word (ignoring non paragraph tags), and the word_count variable is incremented 1008. The word_count is compared 1009 with the input value w (the word count distance to the updateable text portion or Snippet™ in the paragraph). When this matches, the current word is the desired updateable text portion and is returned.
For clarity, the algorithm does not show recovery from error conditions. The algorithm shown is for a single word updateable text portion, but as mentioned earlier, this core logic can easily be extended to support multi word snippets.
THE PREFERRED EMBODIMENTS
The preferred embodiment of the invention is a computer program that allows a user to view and refresh a list of updateable text portions or Snippet™. The list can be updated automatically at a user specified interval. For automatic updates, there is an option to send the updated list by email to a user specified email address.
The embodiment is best described by its interface to the user when executing on a computer. Figure 11 shows the user interface on a computer with GUI (graphical user interface) capabilities, such as a typical Personal Computer. The computer has a mouse or pointer device that is used with the GUI.
The list window 1101 contains the list of updateable text portions or words 1102 and corresponding labels entered by the user 1103. The updateable text portions are views onto different web pages on the Internet. The labels are used to identify each updateable text portion or Snippet™ for the user, if necessary. The labels do not change when the list is refreshed. For example, the stock price updateable text portion or Snippet™ 1104 is a number. Without a label the user does not know what the number refers to, whether it is the current temperature or the price of some other company's stock. The label 1105 identifies the number as the price of IBM stock.
The list is refreshed manually when the user presses the 'Manual Refresh' button 1110.
The list can also be updated automatically. The user enters an update interval 1112 by clicking the up/down arrows 1113. The update intervals are every half, 1 , 2, 3, 4, 6, 12, 24, 48, 72 and 96 hours.
Whether the list is updated manually or automatically, the updating process is the same. It is described by the algorithm shown in Figure 12. The algorithm is executed after the main window display 1101 is cleared. The input to the algorithm is the list of Snippet™ Locations and the number N of updateable text portions or Snippets™ in the list. Each location is specified as described in the Preferred Method previously. The algorithm keeps a count of the updateable text portions or Snippets™ in the Snippet™_count variable. The algorithm performs a loop N times. The first step in the loop 1201 is to retrieve the updateable text portion or Snippet™ at the Snippet™_count position in the list. Retrieving the updateable text given its location is shown in Figure 9 and has been described previously. Once an updateable text portion or Snippet™ is retrieved, the label corresponding to the given updateable text portion or Snippet™ is displayed on a fresh line in the main updateable text portion or Snippet™ window 1101. The retrieved updateable text portion or Snippet™ is placed next to its label. After N cycles the entire list is refreshed.
For automatic updates it is possible to have the updateable text portion or Snippet™ list sent by email to one or more email addresses. To do this the user must select the checkbox 1115 with a click of the mouse. One or more email addresses can be entered in the text box 1117. If the checkbox is checked, the program will send the content of the main list window 1101 to the given email addresses after each automatic update. An example of the email that would be received is shown in Figure 13.
The GUI components of the embodiment, namely the button, the list, the checkbox and the selection arrows are standard components in window programming environments. Those skilled in the art of computer programming are familiar with implementing them. Examples of programming tools that supply these components are Visual Studio from Microsoft Inc. and Visual Cafe PDE from Symantec Inc. Sending email to a list of email addresses is also well known operation amongst those skilled in the art and requires no further explanation.
An alternative embodiment is shown in Figure 14. It highlights the very useful nature of updateable text portions or Snippets™ in that it shows how a computer program can perform calculations and comparisons on the updateable text portion or Snippet™ values automatically. Figure 14 shows a Snippet™ Calculator that can perform various operations on one or two Snippets™. The user can refresh the updateable text portions or Snippets™ and perform the calculations either manually or automatically, as in the previous embodiment. The user can select the operation to be performed from a pull-down selection box 1401 . The choice of operations is addition ("A+B"), subtraction ("A-B"), multiplication ("A*B") and division ("A/B"). The result of this operation can be compared to a constant K 1402 that is entered by the user. The comparison operation is also user selectable from a pull-down selection box 1403. The three comparison choices are equality ("="), greater than (">") and less than ("<"). Whenever the result of the operation is true, email containing the two Snippet™ values A and B is sent to the recipients on the user specified email list. In Figure 14 the user has configured the program to calculate the spread between long and short term interest rates. If the spread exceeds 4 percentage points, the user is notified by email, with one email address interfacing to the user's alpha pager.