US20130238980A1 - Method and Apparatus for Processing World Wide Web Page - Google Patents

Method and Apparatus for Processing World Wide Web Page Download PDF

Info

Publication number
US20130238980A1
US20130238980A1 US13/823,603 US201113823603A US2013238980A1 US 20130238980 A1 US20130238980 A1 US 20130238980A1 US 201113823603 A US201113823603 A US 201113823603A US 2013238980 A1 US2013238980 A1 US 2013238980A1
Authority
US
United States
Prior art keywords
page
www
www page
dom
dom tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US13/823,603
Other versions
US8739024B2 (en
Inventor
Shudong Ruan
Yu Xu
Mo Peng
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Assigned to TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED reassignment TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: PENG, MO, RUAN, SHUDONG, YU, XU
Publication of US20130238980A1 publication Critical patent/US20130238980A1/en
Application granted granted Critical
Publication of US8739024B2 publication Critical patent/US8739024B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • G06F17/2247
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/143Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching

Definitions

  • the present invention relates to Internet technology, and more particularly, to a method and apparatus for processing a World Wide Web (WWW) page.
  • WWW World Wide Web
  • embodiments of the present invention provide a method for processing a WWW page and an apparatus for processing a WWW page, so as to effectively eliminate the redundant information.
  • redundant Hypertext Markup Language (HTML) information such as advertising information
  • HTML Hypertext Markup Language
  • advertising information may be filtered from a grabbed WWW page according to a page template. That is, redundant information is efficiently eliminated, thereby facilitating the user's browse.
  • the technical solutions of the present invention can be implemented simply and conveniently.
  • FIG. 1 is a flowchart illustrating a method for processing a WWW page according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram illustrating a structure of an apparatus for processing a WWW page according to an embodiment of the present invention.
  • the present invention provides a new scheme for processing a WWW page.
  • FIG. 1 is a flowchart illustrating a method for processing a WWW page according to an embodiment of the present invention. As shown in FIG. 1 , the method includes the following processes.
  • a website to be optimized is determined.
  • the website link list may only include some websites commonly used.
  • a corresponding page template is respectively generated and stored.
  • each website X in the website link list the following processes are performed respectively.
  • a WWW page is obtained.
  • Each obtained WWW page is respectively analyzed to construct a Document Object Model (DOM) tree.
  • DOM Document Object Model
  • each DOM node unnecessary to be reserved in each DOM tree is deleted.
  • relevant plug-in may be developed and installed in a browser, such as a FireFox browser, of the background processing system. Subsequently, the background administrator may access different types of WWW pages in different websites via the FireFox browser with the plug-in. Specifically, for each type, such as news type or BBS type, of WWW pages in each website, it is possible to randomly select one WWW page to access, and select contents to be reserved and contents to be deleted in the accessed WWW page by a mouse.
  • the plug-in is adapted to, according to operations of the background administrator, correspondingly implement the functions including analyzing the WWW page to construct a DOM tree, deleting a DOM node, and transforming a DOM tree into a WWW page.
  • HTML Hyper Text Mark-up Language
  • the background processing system may constantly grab WWW pages from each website in the website link list.
  • the grabbing operation may be performed in real time or may be performed once at each interval.
  • Objects grabbed by the background processing system include all WWW pages in each website.
  • WWW page Y is analyzed to construct a DOM tree, namely DOM tree 1 .
  • Page template Y corresponding to WWW page Y is analyzed to construct a DOM tree, namely DOM tree 2 .
  • DOM node in DOM tree 1 whether there is a matching DOM node in DOM tree 2 is determined. If there is a matching DOM node in DOM tree 2 , no operations will be performed on the DOM node in DOM tree 1 ; otherwise, the DOM node in DOM tree 1 will be deleted.
  • DOM tree 1 in which each DOM node unnecessary to be reserved has been deleted is transformed into a WWW page, and the DOM tree 2 is transformed into page template Y.
  • How to determine whether a DOM node has a matched node may be state-of-the-art technology. With this method, it is possible to filter out redundant HTML data, such as advertising information, from a WWW page.
  • page template Y corresponding to WWW page Y refers to a page template of the same type as WWW page Y, and the page template and WWW page Y belong to the same website.
  • URL Uniform Resource Location
  • URL may reflect information, such as the website to which the page template belongs and the type of the page template.
  • the background processing system when receiving a request for accessing a WWW page sent by a terminal, the background processing system firstly determines whether there is a WWW page with the redundant HTML data filtered out corresponding to the WWW page requested by the terminal stored in local, namely, determines whether the WWW page requested by the terminal has been grabbed and optimized. If there is a corresponding filtered WWW page without redundant HTML data, the corresponding filtered WWW page without the redundant HTML data is returned to the terminal, otherwise, the real-time transformation process of the WWW page is implemented according to existing technologies.
  • FIG. 2 is a schematic diagram illustrating a structure of an apparatus for processing a WWW page according to an embodiment of the present invention. As shown in FIG. 2 , the apparatus includes a first processing unit 21 and a second processing unit 22 .
  • the first processing unit 21 is configured to determine at least one website to be optimized. For each WWW page of different types in each website, the first processing unit respectively generates and stores a corresponding page template, constantly grabs WWW pages from each website, compares each grabbed WWW page with its corresponding page template respectively, filters redundant HTML data from the grabbed WWW page according to a compared result, and stores the filtered WWW page without redundant HTML data.
  • the second processing unit 22 is configured to, when receiving a request for accessing a WWW page sent by a terminal, determine whether there is a filtered WWW page corresponding to the WWW page requested by the terminal stored in the first processing unit 21 . When there is a filtered WWW page corresponding to the WWW page requested by the terminal stored in the first processing unit 21 , the second processing unit 22 obtains the filtered WWW page from the first processing unit 21 , and returns the filtered WWW page to the terminal.
  • the second processing unit 22 is further configured to, when there is no filtered WWW page corresponding to the WWW page requested by the terminal stored in the first processing unit 21 , implement the real-time transformation process for the WWW page.
  • the first processing unit 21 may further include (to simplify the drawings, the detailed structure of the first processing unit is not illustrated) a first processing sub-unit, a second processing sub-unit and a third processing sub-unit.
  • the first processing sub-unit is configured to receive at least one website to be optimized inputted by a background administrator.
  • the second processing sub-unit is configured to perform the following operations for each website X: according to a received instruction of the background administrator, obtain a WWW page from each type of the various types of WWW pages in website X; respectively analyze each obtained WWW page to construct a Document Object Model (DOM) tree; according to a received instruction of the background administrator, delete each DOM node unnecessary to be reserved from each DOM tree; transform each DOM tree that has DOM node deleted into a WWW page respectively, and store the WWW page as a page template.
  • DOM Document Object Model
  • the third processing unit is configured to constantly grab WWW pages from each website, and for each grabbed WWW page Y, to perform the following operations respectively: analyze WWW page Y to construct a DOM tree, and obtain DOM tree 1 ; analyze page template Y corresponding to WWW page Y to construct a DOM tree, and obtain DOM tree 2 ; for each DOM node in DOM tree 1 , determine whether there is a matching DOM node in DOM tree 2 ; when there is a matching DOM node in DOM tree 2 , perform no operations on the DOM node in DOM tree 1 ; otherwise, delete the DOM node in DOM tree 1 ; transform DOM tree 1 in which each DOM node unnecessary to be reserved has been deleted into a WWW page; and transform DOM tree 2 into page template Y.
  • the terminal mentioned in the embodiments shown in FIG. 1 and FIG. 2 is generally a mobile terminal.

Abstract

Embodiments of the present invention provide a method for processing a World Wide Web (WWW) page, which includes: determining at least one website to be optimized; generating a corresponding page template for each of WWW pages with different types in each website, and storing the page template; grabbing WWW pages from each website, matching each grabbed WWW page with a page template, filtering redundant HTML data from the WWW page according to a matching result, and storing the filtered WWW page; after receiving a request sent by a terminal for accessing a WWW page, determining whether there is a stored filtered WWW page corresponding to the WWW page requested by the terminal, if yes, returning the filtered WWW page to the terminal. Embodiments of the present invention also provide an apparatus for processing a WWW page. With the scheme of the present invention, redundant information may be efficiently eliminated.

Description

    FIELD OF THE INVENTION
  • The present invention relates to Internet technology, and more particularly, to a method and apparatus for processing a World Wide Web (WWW) page.
  • BACKGROUND OF THE INVENTION
  • With the popularity of broadband Internet, contents displayed on WWW pages of the Internet are increasingly enriching. However, redundant information, such as advertising information, is also constantly increasing. When a user browses a WWW page by a terminal with a limited size, such as a mobile terminal, the redundant information will bring about a great deal of inconvenience to the user's browse.
  • SUMMARY OF THE INVENTION
  • In view of above, embodiments of the present invention provide a method for processing a WWW page and an apparatus for processing a WWW page, so as to effectively eliminate the redundant information.
  • The method for processing a WWW page provided by embodiments of the present invention includes:
      • determining at least one website to be optimized; generating a corresponding page template for each of WWW pages with different types in each network, and storing the corresponding page template;
      • constantly grabbing WWW pages from each website, matching each grabbed WWW page with a page template corresponding to the grabbed WWW page, filtering redundant Hyper Text Mark-up Language (HTML) data from the WWW page according to a matching result, and storing the filtered WWW page without the redundant HTML data;
      • after receiving a request, sent by a terminal, for accessing a WWW page, determining whether there is a stored filtered WWW page without the redundant HTML data corresponding to the WWW page requested by the terminal, when there is a stored WWW page without the redundant HTML data corresponding to the WWW page requested by the terminal, returning the filtered WWW page without the redundant HTML data to the terminal.
  • The apparatus for processing a WWW page provided by embodiments of the present invention includes:
      • a first processing unit configured to determine at least one website to be optimized; generate a corresponding page template for each of WWW pages with different types in each network, and store the corresponding page template; constantly grab WWW pages from each website, match each grabbed WWW page with a page template corresponding to the grabbed WWW page, filter redundant Hyper Text Mark-up Language, HTML, data from the WWW page according to a matching result, and store the filtered WWW page without the redundant HTML data;
      • a second processing unit configured to, after receiving a request sent by a terminal for accessing a WWW page, determine whether there is a filtered WWW page without the redundant HTML data corresponding to the WWW page requested by the terminal stored in the first processing unit, when there is a stored filtered WWW page without the redundant HTML data corresponding to the WWW page requested by the terminal, obtain the filtered WWW page without the redundant HTML data from the first processing unit, and return the filtered WWW page without the redundant HTML data to the terminal.
  • As can be seen, by adopting the technical solutions of the present invention, redundant Hypertext Markup Language (HTML) information, such as advertising information, may be filtered from a grabbed WWW page according to a page template. That is, redundant information is efficiently eliminated, thereby facilitating the user's browse. In addition, the technical solutions of the present invention can be implemented simply and conveniently.
  • BRIEF DESCRIPTION OF DRAWINGS
  • The exemplary embodiment of the present invention will be described in detail hereinafter with reference to accompanying drawings, so as to make above mentioned or other features and advantages of the present invention clearer to one skilled in the art. In the accompanying drawings:
  • FIG. 1 is a flowchart illustrating a method for processing a WWW page according to an embodiment of the present invention;
  • FIG. 2 is a schematic diagram illustrating a structure of an apparatus for processing a WWW page according to an embodiment of the present invention.
  • DETAILED DESCRIPTION OF THE INVENTION
  • In view of above problem existed in the prior art, the present invention provides a new scheme for processing a WWW page.
  • In order to make objectives, technical solutions and advantages of the present invention clearer, the present invention will be described in detail hereinafter with reference to accompanying drawings.
  • FIG. 1 is a flowchart illustrating a method for processing a WWW page according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following processes.
  • In block 11, a website to be optimized is determined.
  • In practical application, it is possible to determine websites to be optimized (of which WWW pages will be filtered according to the follow-up method) by a background administrator, compose a website link list with determined websites, and input the website link list to the background processing system.
  • Theoretically, the more websites the website link list includes, the better. However, taking into account factors, such as maintenance costs, the website link list may only include some websites commonly used.
  • In block 12, for each of WWW pages with different types in each network, a corresponding page template is respectively generated and stored.
  • In block 12, for each website X in the website link list, the following processes are performed respectively. According to a received instruction of the background administrator, for each of the various types of WWW pages in website X, a WWW page is obtained. Each obtained WWW page is respectively analyzed to construct a Document Object Model (DOM) tree. According to a received instruction of the background administrator, each DOM node unnecessary to be reserved in each DOM tree is deleted. Then each DOM tree in which each DOM node unnecessary to be reserved is deleted is respectively transformed into a WWW page, and the WWW page is stored as a page template. How to analyze the WWW page to construct the DOM tree and how to transform a DOM tree into a WWW page may be implemented with existing technologies.
  • In practical application, relevant plug-in may be developed and installed in a browser, such as a FireFox browser, of the background processing system. Subsequently, the background administrator may access different types of WWW pages in different websites via the FireFox browser with the plug-in. Specifically, for each type, such as news type or BBS type, of WWW pages in each website, it is possible to randomly select one WWW page to access, and select contents to be reserved and contents to be deleted in the accessed WWW page by a mouse. The plug-in is adapted to, according to operations of the background administrator, correspondingly implement the functions including analyzing the WWW page to construct a DOM tree, deleting a DOM node, and transforming a DOM tree into a WWW page.
  • After the process described in block 12, a series of page templates are obtained. For example, suppose the website link list includes three websites (here only gives an example for description, in practice, the number of websites included in the website link list may far exceed three), in which, the first website includes five different types of WWW pages, the second website includes six different types of WWW pages, the third website includes four different types of WWW pages, then it is possible to obtain total 5+6+4=15 page templates.
  • In block 13, WWW pages are constantly grabbed from each website. Each grabbed WWW page is respectively matched with a page template corresponding to the grabbed WWW page. According to a matching result, redundant Hyper Text Mark-up Language (HTML) data is filtered from the grabbed WWW page, and the grabbed WWW page without the redundant HTML data is stored.
  • The background processing system may constantly grab WWW pages from each website in the website link list. The grabbing operation may be performed in real time or may be performed once at each interval. Objects grabbed by the background processing system include all WWW pages in each website.
  • For each grabbed WWW page Y, the following processes are performed respectively. WWW page Y is analyzed to construct a DOM tree, namely DOM tree 1. Page template Y corresponding to WWW page Y is analyzed to construct a DOM tree, namely DOM tree 2. For each DOM node in DOM tree 1, whether there is a matching DOM node in DOM tree 2 is determined. If there is a matching DOM node in DOM tree 2, no operations will be performed on the DOM node in DOM tree 1; otherwise, the DOM node in DOM tree 1 will be deleted. DOM tree 1 in which each DOM node unnecessary to be reserved has been deleted is transformed into a WWW page, and the DOM tree 2 is transformed into page template Y. How to determine whether a DOM node has a matched node may be state-of-the-art technology. With this method, it is possible to filter out redundant HTML data, such as advertising information, from a WWW page.
  • Above-mentioned page template Y corresponding to WWW page Y refers to a page template of the same type as WWW page Y, and the page template and WWW page Y belong to the same website. In practical application, when each page template is stored, it is possible to simultaneously store the Uniform Resource Location (URL) of each page template. URL may reflect information, such as the website to which the page template belongs and the type of the page template. Thus, before each grabbed WWW page is matched with its corresponding page template, it is possible to determine the corresponding page template according to the URL of each grabbed WWW page.
  • In block 14, when a request sent by a terminal for accessing a WWW page is received, it is determined whether there is a stored WWW page with the redundant HTML data filtered out corresponding to the WWW page requested by the terminal. If there is a stored and filtered WWW page without the redundant HTML data corresponding to the WWW page requested by the terminal, the filtered WWW page without the redundant HTML data is returned to the terminal.
  • In block 14, when receiving a request for accessing a WWW page sent by a terminal, the background processing system firstly determines whether there is a WWW page with the redundant HTML data filtered out corresponding to the WWW page requested by the terminal stored in local, namely, determines whether the WWW page requested by the terminal has been grabbed and optimized. If there is a corresponding filtered WWW page without redundant HTML data, the corresponding filtered WWW page without the redundant HTML data is returned to the terminal, otherwise, the real-time transformation process of the WWW page is implemented according to existing technologies.
  • Based on above mentioned description, FIG. 2 is a schematic diagram illustrating a structure of an apparatus for processing a WWW page according to an embodiment of the present invention. As shown in FIG. 2, the apparatus includes a first processing unit 21 and a second processing unit 22.
  • The first processing unit 21 is configured to determine at least one website to be optimized. For each WWW page of different types in each website, the first processing unit respectively generates and stores a corresponding page template, constantly grabs WWW pages from each website, compares each grabbed WWW page with its corresponding page template respectively, filters redundant HTML data from the grabbed WWW page according to a compared result, and stores the filtered WWW page without redundant HTML data.
  • The second processing unit 22 is configured to, when receiving a request for accessing a WWW page sent by a terminal, determine whether there is a filtered WWW page corresponding to the WWW page requested by the terminal stored in the first processing unit 21. When there is a filtered WWW page corresponding to the WWW page requested by the terminal stored in the first processing unit 21, the second processing unit 22 obtains the filtered WWW page from the first processing unit 21, and returns the filtered WWW page to the terminal.
  • The second processing unit 22 is further configured to, when there is no filtered WWW page corresponding to the WWW page requested by the terminal stored in the first processing unit 21, implement the real-time transformation process for the WWW page.
  • The first processing unit 21 may further include (to simplify the drawings, the detailed structure of the first processing unit is not illustrated) a first processing sub-unit, a second processing sub-unit and a third processing sub-unit.
  • The first processing sub-unit is configured to receive at least one website to be optimized inputted by a background administrator.
  • The second processing sub-unit is configured to perform the following operations for each website X: according to a received instruction of the background administrator, obtain a WWW page from each type of the various types of WWW pages in website X; respectively analyze each obtained WWW page to construct a Document Object Model (DOM) tree; according to a received instruction of the background administrator, delete each DOM node unnecessary to be reserved from each DOM tree; transform each DOM tree that has DOM node deleted into a WWW page respectively, and store the WWW page as a page template.
  • The third processing unit is configured to constantly grab WWW pages from each website, and for each grabbed WWW page Y, to perform the following operations respectively: analyze WWW page Y to construct a DOM tree, and obtain DOM tree 1; analyze page template Y corresponding to WWW page Y to construct a DOM tree, and obtain DOM tree 2; for each DOM node in DOM tree 1, determine whether there is a matching DOM node in DOM tree 2; when there is a matching DOM node in DOM tree 2, perform no operations on the DOM node in DOM tree 1; otherwise, delete the DOM node in DOM tree 1; transform DOM tree 1 in which each DOM node unnecessary to be reserved has been deleted into a WWW page; and transform DOM tree 2 into page template Y.
  • For specific process of the apparatus embodiment shown in FIG. 2, corresponding description in the method embodiment shown in FIG. 1 may be referred, thus no further description will be provided here. In addition, the terminal mentioned in the embodiments shown in FIG. 1 and FIG. 2 is generally a mobile terminal.
  • The foregoing description is only preferred embodiments of the present invention and is not used for limiting the protection scope thereof. Any modification, equivalent substitution, or improvement made without departing from the principle of the present invention is within the protection scope of the present invention.

Claims (10)

1. A method for processing a World Wide Web, WWW, page, the method comprises:
determining at least one website to be optimized;
generating a corresponding page template for each of WWW pages with different types in each website, and storing the corresponding page template;
constantly grabbing WWW pages from each website, matching each grabbed WWW page with a page template corresponding to the grabbed WWW page, filtering redundant Hyper Text Mark-up Language, HTML, data from the WWW page according to a matching result, and storing the filtered WWW page without the redundant HTML data;
after receiving a request, sent by a terminal, for accessing a WWW page, determining whether there is a stored filtered WWW page without the redundant HTML data corresponding to the WWW page requested by the terminal; and
when there is a stored filtered WWW page without the redundant HTML data corresponding to the WWW page requested by the terminal, returning the filtered WWW page without the redundant HTML data to the terminal.
2. The method according to claim 1, the method further comprises:
when there is no stored filtered WWW page without the redundant HTML data corresponding to the WWW page requested by the terminal, implementing a real-time transformation process for the WWW page requested by the terminal.
3. The method according to claim 1, wherein generating a corresponding page template for each of WWW pages with different types in each website, and storing the corresponding page template comprises performing the following operations for each respective website X: obtaining a WWW page from each of the various types of WWW pages in the website X according to a received instruction of a background administrator;
respectively analyzing each obtained WWW page to construct a Document Object Model, DOM, tree; deleting each DOM node unnecessary to be reserved from each DOM tree according to a received instruction of a background administrator;
respectively transforming each DOM tree in which each DOM node unnecessary to be reserved is deleted into a WWW page; and
storing the WWW page as a page template.
4. The method according to claim 1, wherein matching each grabbed WWW page with a page template corresponding to the grabbed WWW page, filtering redundant HTML data from the WWW page according to a matching result comprises:
for each grabbed WWW page Y, performing the following processes:
analyzing the WWW page Y to construct a DOM tree, and obtaining DOM tree 1;
analyzing page template Y corresponding to the WWW page Y to construct a DOM tree, and obtaining DOM tree 2;
for each DOM node in DOM tree 1, determining whether there is a matched DOM node in DOM tree 2;
when there is a matched DOM node in DOM tree 2, performing no operations on the DOM node in DOM tree 1, otherwise, deleting the DOM node from DOM tree 1;
transforming DOM tree 1 in which each DOM node unnecessary to be reserved is deleted into a WWW page; and
transforming DOM 2 into page template Y.
5. The method according to claim 1, the method further comprises:
storing a Uniform Resource Location, URL, of each page template; and
before matching each grabbed WWW page with a page template corresponding to the grabbed WWW page, further comprising: determining the page template corresponding to the grabbed WWW page according to the URL of the grabbed WWW page.
6. The method according to claim 1, wherein the terminal is a mobile terminal.
7. An apparatus for processing a World Wide Web, WWW, page, the apparatus comprises:
a first processing unit configured to determine at least one website to be optimized, generate a corresponding page template for each of WWW pages with different types in each website, store the corresponding page template; constantly grab WWW pages from each website, match each grabbed WWW page with a page template corresponding to the grabbed WWW page, filter redundant Hyper Text Mark-up Language (HTML) data from the WWW page according to a matching result, and store the filtered WWW page without the redundant HTML data; and
a second processing unit configured to, after receiving a request sent by a terminal for accessing a WWW page, determine whether there is a filtered WWW page without the redundant HTML data corresponding to the WWW page requested by the terminal stored in the first processing unit, when there is a stored filtered WWW page without the redundant HTML data corresponding to the WWW page requested by the terminal, obtain the filtered WWW page without the redundant HTML data from the first processing unit, and return the filtered WWW page without the redundant HTML data to the terminal.
8. The apparatus according to claim 7, wherein the second processing unit is further configured to, when there is no filtered WWW page without the redundant HTML data corresponding to the WWW page requested by the terminal stored in the first processing unit 21, implement a real-time transformation process for the WWW page requested by the terminal.
9. The apparatus according to claim 7, wherein the first processing unit comprises:
a first processing sub-unit, configured to receive at least one website to be optimized inputted by a background administrator;
a second processing sub-unit, configured to perform the following operations for each website X:
according to a received instruction of the background administrator, obtain one WWW page from each of the various types of WWW pages in the website X,
respectively analyze each obtained WWW page to construct a DOM tree,
according to a received instruction of the background administrator, delete each DOM node unnecessary to be reserved from each DOM tree,
transform each DOM tree in which each DOM node unnecessary to be reserved is deleted into a WWW page respectively, and
store the WWW page as a page template; and
a third processing unit, configured to constantly grab WWW pages from each website, for each grabbed WWW page Y, and perform the following processes respectively:
analyze the WWW page Y to construct a DOM tree, and obtain DOM tree 1,
analyze the page template Y corresponding to the WWW page Y to construct a DOM tree, and obtain DOM tree 2,
for each DOM node in DOM tree 1, determine whether there is a matched DOM node in DOM tree 2, when there is a matched DOM node in DOM tree 2, perform no operations on the DOM node in DOM tree 1, otherwise, delete the DOM node in DOM tree 1,
transform DOM tree 1 in which each DOM node unnecessary to be reserved is deleted into a WWW page, and
transform DOM tree 2 into page template Y.
10. The apparatus according to claim 7, wherein the terminal is a mobile terminal.
US13/823,603 2010-12-03 2011-11-21 Method and apparatus for processing world wide web page Active US8739024B2 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201010586269 2010-12-03
CN201010586269.4A CN102486799B (en) 2010-12-03 2010-12-03 World wide web (WWW) page processing method and device
CN201010586269.4 2010-12-03
PCT/CN2011/082504 WO2012071993A1 (en) 2010-12-03 2011-11-21 Processing method and device for world wide web page

Publications (2)

Publication Number Publication Date
US20130238980A1 true US20130238980A1 (en) 2013-09-12
US8739024B2 US8739024B2 (en) 2014-05-27

Family

ID=46152292

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/823,603 Active US8739024B2 (en) 2010-12-03 2011-11-21 Method and apparatus for processing world wide web page

Country Status (4)

Country Link
US (1) US8739024B2 (en)
EP (1) EP2605155A4 (en)
CN (1) CN102486799B (en)
WO (1) WO2012071993A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130269014A1 (en) * 2012-04-09 2013-10-10 Justin Brock GERBER Method and apparatus for browser interface, account management, and profile management
CN110955428A (en) * 2019-11-27 2020-04-03 北京奇艺世纪科技有限公司 Page display method and device, electronic equipment and medium
US10754917B2 (en) * 2013-03-04 2020-08-25 Alibaba Group Holding Limited Method and system for displaying customized webpage on double webview

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880679B (en) * 2012-09-11 2016-01-13 北京易云剪客科技有限公司 A kind of info web storage means and device
CN104239369A (en) * 2013-06-24 2014-12-24 腾讯科技(深圳)有限公司 Method, device and system for filtering out webpage advertisements
WO2015070795A1 (en) * 2013-11-15 2015-05-21 北京奇虎科技有限公司 Method, apparatus, client terminal and system for saving favourite items and providing status change alerts
CN104750463B (en) * 2013-12-26 2018-05-22 任子行网络技术股份有限公司 A kind of developing plug method and system
CN104765592B (en) * 2014-01-03 2018-09-18 任子行网络技术股份有限公司 A kind of plug-in management method and its device of object web page acquisition tasks
CN108280109A (en) * 2017-04-17 2018-07-13 广州市动景计算机科技有限公司 Page data filter method, device and user terminal
CN110968821A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Website processing method and device
CN111125587B (en) * 2019-12-31 2023-08-04 北京百度网讯科技有限公司 Webpage structure optimization method, device, equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010037490A1 (en) * 2000-03-17 2001-11-01 Hiang-Swee Chiang Web application generator
US20040255233A1 (en) * 2003-06-11 2004-12-16 Croney Joseph K. Utilizing common layout and functionality of multiple web pages
US20050108259A1 (en) * 2003-11-14 2005-05-19 Fujitsu Limited Method of and apparatus for gathering information, system for gathering information, and computer program
US6944817B1 (en) * 1997-03-31 2005-09-13 Intel Corporation Method and apparatus for local generation of Web pages
US6955298B2 (en) * 2001-12-27 2005-10-18 Samsung Electronics Co., Ltd. Apparatus and method for rendering web page HTML data into a format suitable for display on the screen of a wireless mobile station
US7047318B1 (en) * 2001-04-20 2006-05-16 Softface, Inc. Method and apparatus for creating and deploying web sites with dynamic content
US20100199197A1 (en) * 2008-11-29 2010-08-05 Handi Mobility Inc Selective content transcoding
US7873680B2 (en) * 2005-02-15 2011-01-18 International Business Machines Corporation Hierarchical inherited XML DOM
US20110066662A1 (en) * 2009-09-14 2011-03-17 Adtuitive, Inc. System and Method for Content Extraction from Unstructured Sources
US7945556B1 (en) * 2008-01-22 2011-05-17 Sprint Communications Company L.P. Web log filtering

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7415538B2 (en) 2001-03-19 2008-08-19 International Business Machines Corporation Intelligent document filtering
CN101276362B (en) 2007-03-26 2011-05-11 国际商业机器公司 Apparatus and method for customizing web page
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
US8762556B2 (en) * 2007-06-13 2014-06-24 Apple Inc. Displaying content on a mobile device
US9405847B2 (en) * 2008-06-06 2016-08-02 Apple Inc. Contextual grouping of a page
CN101625700A (en) 2009-08-12 2010-01-13 中兴通讯股份有限公司 Method and device for optimizing display network page on terminal

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6944817B1 (en) * 1997-03-31 2005-09-13 Intel Corporation Method and apparatus for local generation of Web pages
US20010037490A1 (en) * 2000-03-17 2001-11-01 Hiang-Swee Chiang Web application generator
US7047318B1 (en) * 2001-04-20 2006-05-16 Softface, Inc. Method and apparatus for creating and deploying web sites with dynamic content
US6955298B2 (en) * 2001-12-27 2005-10-18 Samsung Electronics Co., Ltd. Apparatus and method for rendering web page HTML data into a format suitable for display on the screen of a wireless mobile station
US20040255233A1 (en) * 2003-06-11 2004-12-16 Croney Joseph K. Utilizing common layout and functionality of multiple web pages
US7529771B2 (en) * 2003-11-14 2009-05-05 Fujitsu Limited Method of and apparatus for gathering information, system for gathering information, and computer program
US20050108259A1 (en) * 2003-11-14 2005-05-19 Fujitsu Limited Method of and apparatus for gathering information, system for gathering information, and computer program
US7873680B2 (en) * 2005-02-15 2011-01-18 International Business Machines Corporation Hierarchical inherited XML DOM
US7945556B1 (en) * 2008-01-22 2011-05-17 Sprint Communications Company L.P. Web log filtering
US8195638B1 (en) * 2008-01-22 2012-06-05 Sprint Communications Company L.P. Web log filtering
US20100199197A1 (en) * 2008-11-29 2010-08-05 Handi Mobility Inc Selective content transcoding
US20110066662A1 (en) * 2009-09-14 2011-03-17 Adtuitive, Inc. System and Method for Content Extraction from Unstructured Sources
US8073865B2 (en) * 2009-09-14 2011-12-06 Etsy, Inc. System and method for content extraction from unstructured sources

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130269014A1 (en) * 2012-04-09 2013-10-10 Justin Brock GERBER Method and apparatus for browser interface, account management, and profile management
US10754917B2 (en) * 2013-03-04 2020-08-25 Alibaba Group Holding Limited Method and system for displaying customized webpage on double webview
CN110955428A (en) * 2019-11-27 2020-04-03 北京奇艺世纪科技有限公司 Page display method and device, electronic equipment and medium

Also Published As

Publication number Publication date
CN102486799B (en) 2014-10-15
CN102486799A (en) 2012-06-06
US8739024B2 (en) 2014-05-27
WO2012071993A1 (en) 2012-06-07
EP2605155A4 (en) 2013-08-14
EP2605155A1 (en) 2013-06-19

Similar Documents

Publication Publication Date Title
US8739024B2 (en) Method and apparatus for processing world wide web page
CN101388768B (en) Method and device for detecting malicious HTTP request
CN103577595B (en) Keyword method for pushing and device based on current browse webpage
CN103577596B (en) Keyword search methodology and device based on current browse webpage
CN103577392B (en) Keyword method for pushing and device based on current browse webpage
US9348811B2 (en) Obtaining data from electronic documents
CN105243159A (en) Visual script editor-based distributed web crawler system
CN109614569A (en) Page rendering method, apparatus and intelligent terminal
KR102222287B1 (en) Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL
US9009850B2 (en) Database management by analyzing usage of database fields
CN106874271A (en) A kind of method and system that PC webpages are converted to mobile terminal webpage
WO2017124692A1 (en) Method and apparatus for searching for conversion relationship between form pages and target pages
CN102646135A (en) Webpage collecting method, device and system
CN102857369A (en) Website log saving system, method and apparatus
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN103177025A (en) Recommendation method and device of answer information by information interactive question and answer system
CN107958009A (en) Company information acquisition methods, device and equipment
CN110808868A (en) Test data acquisition method and device, computer equipment and storage medium
CN109862074B (en) Data acquisition method and device, readable medium and electronic equipment
CN106055591B (en) Weather pushing method and device
CN111797297B (en) Page data processing method and device, computer equipment and storage medium
CN103246680B (en) A kind of method in browser, web page contents polymerization being represented and device
US20190332859A1 (en) Method for identifying main picture in web page
JP5737249B2 (en) Load simulation apparatus, simulation apparatus, load simulation method, simulation method, and program
CN107784054B (en) Page publishing method and device

Legal Events

Date Code Title Description
AS Assignment

Owner name: TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, CHI

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:RUAN, SHUDONG;YU, XU;PENG, MO;REEL/FRAME:030013/0532

Effective date: 20130311

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551)

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8