CN103440239A - Functional region recognition-based webpage segmentation method and device - Google Patents

Functional region recognition-based webpage segmentation method and device Download PDF

Info

Publication number
CN103440239A
CN103440239A CN2013101765519A CN201310176551A CN103440239A CN 103440239 A CN103440239 A CN 103440239A CN 2013101765519 A CN2013101765519 A CN 2013101765519A CN 201310176551 A CN201310176551 A CN 201310176551A CN 103440239 A CN103440239 A CN 103440239A
Authority
CN
China
Prior art keywords
piece
block
sub
tag
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013101765519A
Other languages
Chinese (zh)
Other versions
CN103440239B (en
Inventor
郭瑞
牛正雨
吴一璞
李乐丁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201310176551.9A priority Critical patent/CN103440239B/en
Publication of CN103440239A publication Critical patent/CN103440239A/en
Application granted granted Critical
Publication of CN103440239B publication Critical patent/CN103440239B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a functional region recognition-based webpage segmentation method and a functional region recognition-based webpage segmentation device. The method comprises the following steps of: generating a document object model (DOM) tree aiming at a webpage, wherein the DOM tree comprises contents for web display; extracting positional information and size information of a DOM tree node; analyzing a boundary edge attribute and a marginal blank space attribute in a cascading style sheet (CSS) attribute; labeling the webpage by using a webpage block labeling algorithm to label functional and semantic regions and marking the labeled blocks as granularity candidates; scanning image-text mixed blocks in the remaining webpage according to the DOM tree structure and marking the scanned image-text mixed blocks as granularity candidates; scanning the remaining blocks and marking the blocks as the granularity candidates if the boundary edge attributes and the marginal blank space attributes of the scanned blocks are not zero; marking the remaining unmarked blocks in the DOM tree as the granularity candidates.

Description

A kind of segmenting web page method and device based on functional area identification
Technical field
The present invention relates to a kind of segmenting web page method and device, relate in particular to a kind of functional area based on webpage and identify also method and the device of cutting webpage.
Background technology
At present, on browser, during displaying web page, major part is, by resolving html source code, the page is carried out to layout.For the page of showing in browser, can according on function or difference semantically the page is divided, so just can determine which is partly the main contents of this webpage.In addition, the current user with the mobile phone browsing page is on the increase, but and the viewing area of mobile phone screen is very little, will determine like this which content in webpage will combine to represent.
Because the webpage on internet is very complicated, and a lot of webpage has been difficult to a standard and has judged which piece a webpage can be cut into not in accordance with the standard of homepages language.Therefore, need to provide a kind of method of on a suitable granularity, webpage being carried out to cutting, make the follow-up application to structure of web page and content more targeted.
Summary of the invention
The present invention proposes a kind of segmenting web page method and device of functional area identification.The method is divided into webpage the granularity of a tiling, and each granularity is independently function, semanteme or a content area, and application can be take granularity and used as unit.For example, mobile phone browser can be take granularity as unit, is pushed to the user; Can, to the Granular Computing importance degree of output, according to importance degree, web page contents be studied etc.
According to an exemplary embodiment of the present invention on the one hand, provide a kind of segmenting web page method based on functional area identification, described method comprises: generate DOM Document Object Model (DOM) tree for webpage, dom tree comprises the content for web page display; Extract positional information and the size information of dom tree node; Parse boundary edge attribute and marge clear area attribute in CSS (cascading style sheet) (CSS) attribute; Utilize webpage piecemeal dimensioning algorithm to be marked webpage, to mark out function and semantic space, and the piece of mark is labeled as to the granularity candidate; According to dom tree structure scanning image mixed character typeset piece in the residue webpage, the image mixed character typeset piece scanned is labeled as to the granularity candidate; Scan remaining, if the boundary edge attribute of the piece scanned and marge clear area attribute are not 0, by described, be labeled as the granularity candidate; Do not have markd to be labeled as the granularity candidate residue in dom tree.
Described function and semantic space can comprise navigation bar, crumbs, copyright, page turning hurdle and picture box.
The image mixed character typeset piece can be the piece met the following conditions: dimension of picture is greater than 5000 pixels; Text node is arranged beyond the picture node; The link number is less than or equal to 5; Brother's piece has identical dom tree structure.
Described segmenting web page method also can comprise: according to the merging condition, the whole of described granularity candidate are merged, and wipe candidate's mark of the sub-block that merges piece.
If sub-block is the image mixed character typeset piece, and the piece that does not comprise two or more difference in functionalitys and semantic type in sub-block to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM, can be merged.
If sub-block is the plain text piece, and the piece that does not comprise two or more difference in functionalitys and semantic type in sub-block to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM, can be merged.
If sub-block has same structure in dom tree, and the piece that does not comprise two or more difference in functionalitys and semantic type in sub-block to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM, can be merged.
According to an exemplary embodiment of the present invention on the one hand, a kind of segmenting web page device based on functional area identification is provided, described device comprises: DOM Document Object Model (DOM) tree generation unit, generate dom tree for webpage, and DOM comprises the content for web page display; The DOM information extraction unit, positional information and the size information of extraction dom tree node; CSS (cascading style sheet) (CSS) resolution unit, parse boundary edge attribute and marge clear area attribute in the CSS attribute; The mark unit, marked webpage, to mark out function and semantic space; Image mixed character typeset block scan unit, according to dom tree structure scanning image mixed character typeset piece in the residue webpage; The rest block scanning element is not 0 piece at remaining middle scanning boundary edge attributes and marge clear area attribute; Granularity candidate's indexing unit, will be labeled as the granularity candidate by the piece of mark unit mark; The image mixed character typeset piece that will go out by image mixed character typeset block scan unit scan is labeled as the granularity candidate; The piece that will scan by the rest block scanning element is labeled as the granularity candidate; Do not have markd to be labeled as the granularity candidate residue in dom tree.
The segmenting web page device also can comprise: merge cells is merged the whole of granularity candidate that mark by granularity candidate indexing unit according to the merging condition, and is wiped candidate's mark of the sub-block that merges piece.
The accompanying drawing explanation
By the description of carrying out below in conjunction with accompanying drawing, above and other purpose of the present invention and characteristics will become apparent, wherein:
Fig. 1 illustrates the block diagram of the segmenting web page device based on functional area identification according to an exemplary embodiment of the present invention;
Fig. 2 to Fig. 5 illustrates the schematic diagram of the segmenting web page example based on functional area identification according to an exemplary embodiment of the present invention;
Fig. 6 illustrates the process flow diagram of the segmenting web page method based on functional area identification according to an exemplary embodiment of the present invention.
Embodiment
Provide referring to the description of accompanying drawing to help the complete understanding of the exemplary embodiment of the present invention to being limited by claim and equivalent thereof.Description comprises the various specific detail of understanding for helping, exemplary but these details should only be regarded as.Therefore, those of ordinary skill in the art will recognize, without departing from the scope and spirit of the present invention, can make various changes and modifications the embodiments described herein.In addition, for clarity and conciseness, can omit the description to known function and structure.
Fig. 1 illustrates the block diagram of the segmenting web page device based on functional area identification according to an exemplary embodiment of the present invention.
With reference to Fig. 1, segmenting web page device 100 comprises DOM Document Object Model (DOM) tree generation unit 110, DOM information extraction unit 120, CSS (cascading style sheet) (CSS) resolution unit 130, mark unit 140, image mixed character typeset block scan unit 150, rest block scanning element 160 and granularity candidate indexing unit 170.
Dom tree generation unit 110 generates dom tree for webpage, and DOM comprises the content for web page display.DOM information extraction unit 120 is extracted positional information and the size information of dom tree node.CSS resolution unit 130 parses boundary edge attribute and the marge clear area attribute in the CSS attribute, and here, boundary edge is the boundary line in the page, if border width is set to zero, another attribute of boundary edge and CSS (filling edge) overlaps; The outside that the marge clear area is node, it shows should retain how many blank outside node, if width is set to zero, with boundary edge, overlaps.Mark unit 140 utilizes webpage piecemeal dimensioning algorithm to be marked webpage, to mark out function and semantic space, and such as navigation bar, crumbs, copyright, page turning hurdle and picture box etc.The patented claim that publication number is CN102637172A provides a kind of webpage piecemeal mask method and system.This webpage piecemeal mask method can produce automatically according to machine learning algorithm the training sample of piecemeal mark, and automatic cycle iteration, thereby in conjunction with the artificial training sample of setting, sum up classifying rules, set up disaggregated model, to realize webpage piecemeal mark, pass through the method, can mark out function and the semantic spaces such as navigation bar, crumbs, copyright, page turning hurdle, picture box, at this, omit detailed description.Image mixed character typeset block scan unit 150 is according to dom tree structure scanning image mixed character typeset piece in the residue webpage.Rest block scanning element 160 is not 0 piece at remaining middle scanning boundary edge attributes and marge clear area attribute.
Image mixed character typeset is a kind of common structure during webpage is arranged, for example, all may occur various types of webpages (, " homepage formula ", " list formula " etc.) are inner.The image mixed character typeset piece is the piece met the following conditions: dimension of picture is enough large, for example, is greater than 5000 pixels; Text node is arranged beyond the picture node; The link number is less than or equal to 5; Brother's piece has identical dom tree structure, and fraternal piece is the image mixed character typeset piece.
Granularity candidate's indexing unit 170 will be labeled as the granularity candidate by the piece of mark unit 140 marks, the image mixed character typeset piece that will scan by image mixed character typeset block scan unit 150 is labeled as the granularity candidate, the piece that will scan by rest block scanning element 160 is labeled as the granularity candidate, and does not have markd to be labeled as the granularity candidate remaining after above-mentioned marking operation in dom tree.
Preferably, segmenting web page device 100 also comprises merge cells 180.In the situation that meet the merging condition by the piece of granularity candidate indexing unit 170 marks, merge cells 180 is merged the piece satisfied condition, and described merging condition is as follows:
1) sub-block is the image mixed character typeset piece, and does not comprise the piece of two or more difference in functionalitys and semantic type in piece to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM; Or
2) sub-block is the plain text piece, and does not comprise the piece of two or more difference in functionalitys and semantic type in piece to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM; Or
3) sub-block has identical structure in dom tree, and does not comprise the piece of two or more difference in functionalitys and semantic type in piece to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM.
Afterwards, the sub-block that merge cells 180 will merge piece is wiped candidate's mark.The remaining piece with candidate's mark is a suitable particle size cutting of current page.
Fig. 2 to Fig. 5 illustrates the schematic diagram of the segmenting web page example based on functional area identification according to an exemplary embodiment of the present invention.
With reference to Fig. 2 to Fig. 5, describe the example of the segmenting web page based on functional area identification according to an exemplary embodiment of the present invention in detail.
Take Baidu knows that the page is as example, at first on dom tree, adds visual information and CSS attribute, and the page is carried out to layout, original web page as shown in Figure 2.
Utilize webpage piecemeal dimensioning algorithm to carry out the piecemeal mark to original web page, identify function and semantic space in webpage, as picture block, navigation bar, mutual piece and the crumbs of red line part in Fig. 3.
According to the boundary edge attribute in the CSS attribute and marge clear area attribute, scan qualified piecemeal in remaining piecemeal, as the blue line part in Fig. 4.Because this page does not have the image mixed character typeset structure, therefore omit the identification of image mixed character typeset piece.
Also be left at present a part of piecemeal in the upper right corner in the page, it is 0 that these pieces do not have specific function and its boundary edge attribute and marge clear area attribute, therefore remaining these piecemeals is merged into to a piecemeal nearby, as Fig. 5 green line part.
Up to the present, completed segmenting web page.It is 8 pieces that webpage finally is split, 4 functional blocks wherein, and 3 pieces that separate according to the CSS attribute, 1 rest block merges piece, and these 8 pieces are a suitable cutting granularity of this webpage.
Fig. 6 illustrates the process flow diagram of the segmenting web page method based on functional area identification according to an exemplary embodiment of the present invention.
With reference to Fig. 6, at step S601, for webpage, generate DOM Document Object Model (DOM) tree, dom tree comprises the content for web page display.
At step S602, extract positional information and the size information of dom tree node.
At step S603, parse boundary edge attribute and marge clear area attribute in the CSS attribute.
At step S604, utilize webpage piecemeal dimensioning algorithm to be marked webpage, to mark out function and semantic space, and the piece of mark is labeled as to the granularity candidate.Open in the patented claim that the operation that utilizes webpage piecemeal dimensioning algorithm to be marked webpage has been CN102637172A at publication number, therefore at this, omit detailed description.
At step S605, according to dom tree structure scanning image mixed character typeset piece in the residue webpage, the image mixed character typeset piece scanned is labeled as to the granularity candidate.Here, the image mixed character typeset piece is the piece met the following conditions: dimension of picture is enough large, for example, is greater than 5000 pixels; Text node is arranged beyond the picture node; The link number is less than or equal to 5; Brother's piece has identical dom tree structure, and fraternal piece is the image mixed character typeset piece.
At step S606, scan remaining, if the boundary edge attribute of the piece scanned and marge clear area attribute are not 0, by described, be labeled as the granularity candidate.
At step S607, do not have markd to be labeled as the granularity candidate residue in dom tree.
Segmenting web page method based on functional area identification also can comprise step S608 according to an exemplary embodiment of the present invention, at step S608, according to the merging condition, the whole of granularity candidate that mark at step S604 to S607 are merged, and wiped candidate's mark of the sub-block that merges piece.Described merging condition is above being described.
The present invention is based on the piecemeal labeling system of bottom, utilize the senior features such as piecemeal annotation results to carry out functional area identification to the page, thereby complete the cutting of this webpage.The present invention becomes to have the zone of difference in functionality, semanteme by segmenting web page, and on this basis rest block is merged, and finally forms one deck tiling and mutual without occur simultaneously and the block collection covering full page, that is, and and the suitable cutting granularity of this webpage.Each granularity is independently function, semanteme or a content area, and application can be take granularity and used as unit.For example, mobile phone browser can be take granularity as unit, is pushed to the user; Can, to the Granular Computing importance degree of output, according to importance degree, web page contents be studied etc.
Although with reference to certain exemplary embodiments of the present invention, illustrate and described the present invention, but one skilled in the art should appreciate that, in the situation that do not break away from the spirit and scope of the present invention that limited by claim and equivalent thereof, can carry out various changes to the present invention in form and details.

Claims (14)

1. the segmenting web page method based on functional area identification, described method comprises:
Generate DOM Document Object Model (DOM) tree for webpage, dom tree comprises the content for web page display;
Extract positional information and the size information of dom tree node;
Parse boundary edge attribute and marge clear area attribute in CSS (cascading style sheet) (CSS) attribute;
Utilize webpage piecemeal dimensioning algorithm to be marked webpage, to mark out function and semantic space, and the piece of mark is labeled as to the granularity candidate;
According to dom tree structure scanning image mixed character typeset piece in the residue webpage, the image mixed character typeset piece scanned is labeled as to the granularity candidate;
Scan remaining, if the boundary edge attribute of the piece scanned and marge clear area attribute are not 0, by described, be labeled as the granularity candidate;
Do not have markd to be labeled as the granularity candidate residue in dom tree.
2. segmenting web page method according to claim 1, wherein, described function and semantic space comprise navigation bar, crumbs, copyright, page turning hurdle and picture box.
3. segmenting web page method according to claim 1, wherein, the image mixed character typeset piece is the piece met the following conditions: dimension of picture is greater than 5000 pixels; Text node is arranged beyond the picture node; The link number is less than or equal to 5; Brother's piece has identical dom tree structure.
4. segmenting web page method according to claim 1, also comprise: according to the merging condition, the whole of described granularity candidate are merged, and wipe candidate's mark of the sub-block that merges piece.
5. segmenting web page method according to claim 4, wherein, if sub-block is the image mixed character typeset piece, and the piece that does not comprise two or more difference in functionalitys and semantic type in sub-block to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM, is merged.
6. segmenting web page method according to claim 4, wherein, if sub-block is the plain text piece, and the piece that does not comprise two or more difference in functionalitys and semantic type in sub-block to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM, is merged.
7. segmenting web page method according to claim 4, wherein, if sub-block has same structure in dom tree, and the piece that does not comprise two or more difference in functionalitys and semantic type in sub-block to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM, is merged.
8. the segmenting web page device based on functional area identification, described device comprises:
DOM Document Object Model (DOM) tree generation unit, generate dom tree for webpage, and DOM comprises the content for web page display;
The DOM information extraction unit, positional information and the size information of extraction dom tree node;
CSS (cascading style sheet) (CSS) resolution unit, parse boundary edge attribute and marge clear area attribute in the CSS attribute;
The mark unit, marked webpage, to mark out function and semantic space;
Image mixed character typeset block scan unit, according to dom tree structure scanning image mixed character typeset piece in the residue webpage;
The rest block scanning element is not 0 piece at remaining middle scanning boundary edge attributes and marge clear area attribute;
Granularity candidate's indexing unit, will be labeled as the granularity candidate by the piece of mark unit mark; The image mixed character typeset piece that will go out by image mixed character typeset block scan unit scan is labeled as the granularity candidate; The piece that will scan by the rest block scanning element is labeled as the granularity candidate; Do not have markd to be labeled as the granularity candidate residue in dom tree.
9. segmenting web page device according to claim 8, wherein, described function and semantic space comprise navigation bar, crumbs, copyright, page turning hurdle and picture box.
10. segmenting web page device according to claim 8, wherein, the image mixed character typeset piece is the piece met the following conditions: dimension of picture is greater than 5000 pixels; Text node is arranged beyond the picture node; The link number is less than or equal to 5; Brother's piece has identical dom tree structure.
11. segmenting web page device according to claim 8 also comprises: merge cells is merged the whole of granularity candidate that mark by granularity candidate indexing unit according to the merging condition, and is wiped candidate's mark of the sub-block that merges piece.
12. segmenting web page device according to claim 11, wherein, if sub-block is the image mixed character typeset piece, and the piece that does not comprise two or more difference in functionalitys and semantic type in sub-block to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM, and merge cells is merged described sub-block.
13. segmenting web page device according to claim 11, wherein, if sub-block is the plain text piece, and the piece that does not comprise two or more difference in functionalitys and semantic type in sub-block to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM, and merge cells is merged described sub-block.
14. segmenting web page device according to claim 11, wherein, if sub-block has identical structure in dom tree, and the piece that does not comprise two or more difference in functionalitys and semantic type in sub-block to be combined, and the main node of sub-block does not comprise more than the label of two types in TAG_UL, TAG_TABLE, TAG_FORM, and merge cells is merged described sub-block.
CN201310176551.9A 2013-05-14 2013-05-14 A kind of segmenting web page method and device based on functional area identification Active CN103440239B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310176551.9A CN103440239B (en) 2013-05-14 2013-05-14 A kind of segmenting web page method and device based on functional area identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310176551.9A CN103440239B (en) 2013-05-14 2013-05-14 A kind of segmenting web page method and device based on functional area identification

Publications (2)

Publication Number Publication Date
CN103440239A true CN103440239A (en) 2013-12-11
CN103440239B CN103440239B (en) 2016-08-10

Family

ID=49693930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310176551.9A Active CN103440239B (en) 2013-05-14 2013-05-14 A kind of segmenting web page method and device based on functional area identification

Country Status (1)

Country Link
CN (1) CN103440239B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677827A (en) * 2016-01-04 2016-06-15 百度在线网络技术(北京)有限公司 Method and device for obtaining form
CN109492177A (en) * 2018-11-02 2019-03-19 中国搜索信息科技股份有限公司 A kind of web page release method based on web page semantics structure
CN109657208A (en) * 2017-10-10 2019-04-19 株式会社理光 Webpage similarity calculating method, device, equipment, computer readable storage medium
CN111353115A (en) * 2018-12-24 2020-06-30 中移(杭州)信息技术有限公司 Method and device for generating Spanish chart
CN112906559A (en) * 2021-02-10 2021-06-04 网易有道信息技术(北京)有限公司 Machine-implemented method for correcting formulas and related product
CN113806665A (en) * 2021-09-24 2021-12-17 刘秀萍 Webpage blocking method based on non-patterned Web data model
CN114186164A (en) * 2021-12-17 2022-03-15 北京大学 Method and system for determining and dividing boundary of webpage content block

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071755A1 (en) * 2003-07-30 2005-03-31 Xerox Corporation Multi-versioned documents and method for creation and use thereof
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN102637172A (en) * 2011-02-10 2012-08-15 北京百度网讯科技有限公司 Webpage blocking marking method and system
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071755A1 (en) * 2003-07-30 2005-03-31 Xerox Corporation Multi-versioned documents and method for creation and use thereof
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN102637172A (en) * 2011-02-10 2012-08-15 北京百度网讯科技有限公司 Webpage blocking marking method and system
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677827A (en) * 2016-01-04 2016-06-15 百度在线网络技术(北京)有限公司 Method and device for obtaining form
CN105677827B (en) * 2016-01-04 2019-03-29 百度在线网络技术(北京)有限公司 A kind of acquisition methods and device of list
CN109657208A (en) * 2017-10-10 2019-04-19 株式会社理光 Webpage similarity calculating method, device, equipment, computer readable storage medium
CN109492177A (en) * 2018-11-02 2019-03-19 中国搜索信息科技股份有限公司 A kind of web page release method based on web page semantics structure
CN111353115A (en) * 2018-12-24 2020-06-30 中移(杭州)信息技术有限公司 Method and device for generating Spanish chart
CN111353115B (en) * 2018-12-24 2023-10-27 中移(杭州)信息技术有限公司 Method and device for generating snowplow map
CN112906559A (en) * 2021-02-10 2021-06-04 网易有道信息技术(北京)有限公司 Machine-implemented method for correcting formulas and related product
CN113806665A (en) * 2021-09-24 2021-12-17 刘秀萍 Webpage blocking method based on non-patterned Web data model
CN114186164A (en) * 2021-12-17 2022-03-15 北京大学 Method and system for determining and dividing boundary of webpage content block
CN114186164B (en) * 2021-12-17 2023-06-09 北京大学 Method and system for determining and dividing boundary of webpage content block

Also Published As

Publication number Publication date
CN103440239B (en) 2016-08-10

Similar Documents

Publication Publication Date Title
CN103440239A (en) Functional region recognition-based webpage segmentation method and device
Clausner et al. Aletheia-an advanced document layout and text ground-truthing system for production environments
US9946695B2 (en) Systems and methods for automatically generating content layout based on selected highest scored image and selected text snippet
CN102253979B (en) Vision-based web page extracting method
CN110363102B (en) Object identification processing method and device for PDF (Portable document Format) file
JP2012059248A (en) System, method, and program for detecting and creating form field
CN103052950A (en) Systems and methods for filtering web page contents
CN104598577A (en) Extraction method for webpage text
CN107633055B (en) Method for converting picture into HTML document
US10572566B2 (en) Image quality independent searching of screenshots of web content
US10803363B2 (en) Media intelligence automation system
CN102651002A (en) Webpage information extracting method and system
Xu et al. Identifying semantic blocks in Web pages using Gestalt laws of grouping
CN103491116A (en) Method and device for processing text-related structural data
CN111310750B (en) Information processing method, device, computing equipment and medium
JP2016001403A (en) Template management apparatus and program
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN115546809A (en) Table structure identification method based on cell constraint and application thereof
CN114661286A (en) Large-screen visual code generation method, system and storage medium
JP2012190434A (en) Form defining device, form defining method, program and recording medium
JP2012203491A (en) Document processing device and document processing program
CN108021423B (en) Multilingual website generation method and system and computer readable storage medium
CN105740355A (en) Aggregated text density based webpage body text extraction method and apparatus
JP2014179831A (en) Information display device, information editing method and information editing program
CN104182424A (en) Webpage processing method suitable for mobile terminal and server

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant