WO2007048444A1 - Method of identifying multi zone block edges in an electronic document - Google Patents

Method of identifying multi zone block edges in an electronic document Download PDF

Info

Publication number
WO2007048444A1
WO2007048444A1 PCT/EP2005/055518 EP2005055518W WO2007048444A1 WO 2007048444 A1 WO2007048444 A1 WO 2007048444A1 EP 2005055518 W EP2005055518 W EP 2005055518W WO 2007048444 A1 WO2007048444 A1 WO 2007048444A1
Authority
WO
WIPO (PCT)
Prior art keywords
block
edge
zone
zones
far
Prior art date
Application number
PCT/EP2005/055518
Other languages
French (fr)
Inventor
Jose Abad
Sherif Yacoub
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Priority to PCT/EP2005/055518 priority Critical patent/WO2007048444A1/en
Publication of WO2007048444A1 publication Critical patent/WO2007048444A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Definitions

  • the present invention relates to a method of identifying multi zone block edges in an electronic document.
  • the invention may be used to find either the vertical edges of columns in a multi column document, or the horizontal edges of rows.
  • the invention finds particular although not exclusive application in the automated analysis of scanned copies of pages of printed magazines, newspapers or books.
  • a typical page 10 from a magazine is shown in Figure 1, and this may typically be processed using a document analysis system, the main features of which are shown semantically in Figure 2.
  • OCR optical character recognition
  • the next step in the process is structure analysis 220.
  • this includes both the clustering of zones into semantically meaningful blocks or areas, as well as identification of a table of contents (TOC) if one is present.
  • the clustering of the zones may be undertaken using a variety of techniques including row and column clustering, inset and background clustering, alignment clustering, layout clustering and font type clustering.
  • control passes to a logical analysis section 240. As shown by the box 300, this includes such things as the detection of advertisements, schematic labelling, article detection, and the reconstruction of text flow and reading order.
  • One of the most important aspects of carrying out structure analysis is the determination of whether the page under study uses text zones which are laid in out in a single column or in multiple columns. Likewise, it is important to know whether the page contains multiple rows of text zones. Multiple rows and/or columns may conveniently be identified using the method of the present invention, as described more detail below. Once any rows and/or columns have been identified, clustering may be carried out, for example within each row or each column, in an effort to link together zones which are semantically related to each other.
  • a method of identifying opposing near and far edges of an elongate block of content-bearing zones within an electronic document divided into a plurality of said zones comprising: a) Identifying a base line to one side of and in a direction parallel to an expected elongate block; b) Selecting as a first current zone a zone having a near edge which is closest to the base line, in a measurement direction perpendicular to the base line; c) defining a near edge of the block as a line co-incident with the said near edge; d) determining whether a further zone exists which extends further from the base line and which partly overlaps the current zone; and
  • the defining of the near edge of the block as being aligned co-incident with the said near edge need not necessarily be carried out in the sequential order shown. It could be carried out a different time (for example at the same time as defining far edge of the block), or alternatively could be omitted entirely in situation where the only requirement is to identify the far edge of the block, for example the right-hand side of a column or the lower edge of a row.
  • the invention entends to a computer program for carrying out the method, to a computer-readable carrier carrying such a program, and to a computer system when programmed to carry out the method.
  • a computer system for identifying opposing near and far edges of an elongate block of content-bearing zones within an electronic document divided into a plurality of said zones including:
  • Figure 1 is an exemplary page from a magazine, after completion of zone identification and OCR;
  • Figure 2 illustrates schematically the stages of analysis that may be applied to a page such as that shown in Figure 1 ;
  • FIG 3 illustrates the way in which an individual column may be identified according to an embodiment of the present invention.
  • a vertical baseline is chosen which lies to the left of the various zones under consideration.
  • the page edge 300 will be taken as the baseline.
  • we look to the right of the baseline until we reach the left-hand edge of the zone which lies closest to the baseline. In this case, we find zone 340.
  • the left-hand edge of that zone is then defined as the left-hand edge 400 of the potential column.
  • any one of those zones may be chosen, at random. For the sake of example, let us assume that 350 is chosen. This zone becomes the current zone, the slider position is moved to the right hand edge
  • the process is repeated, we find that there are no further possibilities available for selection since although the zone 310 does overlap the zone 320, it does not extend any further to the right.
  • the right-hand edge of the current zone 320 is taken to represent the right-hand edge of the possible column.
  • the edge 500 may represent the left-hand edge of the narrow column separator separating this column from an adjacent column. In order to identify any adjacent column, the procedure described above is repeated, but this time taking the vertical line 500 as being the initial baseline.
  • the further zone is chosen at random from the available partially overlapping zones, as described above.
  • the preferred further zone for selection may be determined or restricted by some algorithm, for example based on the proximity of the possible further zone to the current zone, the zone size, the font size, the overall zone width and so on.
  • the simple alternative possibility is to select as the further zone that zone, out of all the available possibilities, which extends furthest to the right.
  • columnSeparator[columnCount] page Width i.e. the page is just one column.
  • step 1 of the above algorithm it is desirable where possible to carry out at least some pre-processing in order to exclude zones which self evidently do not form part of any possible column structure.
  • the main heading 12 extends across the entire width of the page and may thus be excluded from any attempt to identify multiple columns.
  • the baseline in such an arrangement would be horizontal and normally above the relevant zones. The slider then moves down the page rather than across it.
  • clustering algorithm may be defined as follows:
  • the extracted statistics may be used in a variety of different ways to complete the logical analysis of the page under consideration.
  • An exemplary approach may be found in US application 11/189930, filed 27 July 2005.

Abstract

A method of identifying multi-zone blocks, such as rows or columns, within an electronic document comprises dividing the document into zones and looking for partially overlapping zones. The method finds applications in the automated analysis of scanned copies of pages from books, newspapers or magazines.

Description

METHOD OF IDENTIFYING MULTI ZONE BLOCK EDGES IN AN
ELECTRONIC DOCUMENT
The present invention relates to a method of identifying multi zone block edges in an electronic document. In a typical application, the invention may be used to find either the vertical edges of columns in a multi column document, or the horizontal edges of rows. The invention finds particular although not exclusive application in the automated analysis of scanned copies of pages of printed magazines, newspapers or books.
A typical page 10 from a magazine is shown in Figure 1, and this may typically be processed using a document analysis system, the main features of which are shown semantically in Figure 2. Once the printed page has been scanned or otherwise electronically captured, it is first analysed using page segmentation and optical character recognition (OCR) components 200 to identify the individual content bearing zones on the page. As shown in Figure 1, the identified zones may include text zones 24, 26, 28, headings 12, 14, images 32, captions 34 and so on. OCR is carried out within each zone, wherever text can be recognised.
Turning back to Figure 2, the next step in the process is structure analysis 220. As shown at box 260, this includes both the clustering of zones into semantically meaningful blocks or areas, as well as identification of a table of contents (TOC) if one is present. As shown at box 280, the clustering of the zones may be undertaken using a variety of techniques including row and column clustering, inset and background clustering, alignment clustering, layout clustering and font type clustering. Once the structure analysis is complete, control passes to a logical analysis section 240. As shown by the box 300, this includes such things as the detection of advertisements, schematic labelling, article detection, and the reconstruction of text flow and reading order.
One of the most important aspects of carrying out structure analysis is the determination of whether the page under study uses text zones which are laid in out in a single column or in multiple columns. Likewise, it is important to know whether the page contains multiple rows of text zones. Multiple rows and/or columns may conveniently be identified using the method of the present invention, as described more detail below. Once any rows and/or columns have been identified, clustering may be carried out, for example within each row or each column, in an effort to link together zones which are semantically related to each other.
A typical prior art system, which simplistically identifies multiple columns simply by looking for a suitably large white space, is disclosed in US- A-5164899.
According to the present invention there is provided a method of identifying opposing near and far edges of an elongate block of content-bearing zones within an electronic document divided into a plurality of said zones, the method comprising: a) Identifying a base line to one side of and in a direction parallel to an expected elongate block; b) Selecting as a first current zone a zone having a near edge which is closest to the base line, in a measurement direction perpendicular to the base line; c) defining a near edge of the block as a line co-incident with the said near edge; d) determining whether a further zone exists which extends further from the base line and which partly overlaps the current zone; and
(i) if so, taking the further zone as a new current zone and repeating (d); and
(ii) if not, defining a far edge of the block as a line coincident with a far edge of the current zone.
It will be understood, of course, that the defining of the near edge of the block as being aligned co-incident with the said near edge (as set out in part (c) above) need not necessarily be carried out in the sequential order shown. It could be carried out a different time (for example at the same time as defining far edge of the block), or alternatively could be omitted entirely in situation where the only requirement is to identify the far edge of the block, for example the right-hand side of a column or the lower edge of a row.
The invention entends to a computer program for carrying out the method, to a computer-readable carrier carrying such a program, and to a computer system when programmed to carry out the method.
According to a second aspect, there is provided:
A computer system for identifying opposing near and far edges of an elongate block of content-bearing zones within an electronic document divided into a plurality of said zones, the systems including:
(a) means for identifying a base line to one side of and in a direction parallel to an expected elongate block; (b) means for selecting as a first current zone a zone having a near edge which is closest to the base line, in a measurement direction perpendicular to the base line;
(c) means for defining a near edge of the block as a line co- incident with the said near edge;
(d) means for determining whether a further zone exists which extends further from the base line and which partly overlaps the current zone;
The invention may be carried into practice in a number of ways and one specific embodiment will now be described, by way of example, with reference to the accompanying drawings, in which:
Figure 1, is an exemplary page from a magazine, after completion of zone identification and OCR;
Figure 2, illustrates schematically the stages of analysis that may be applied to a page such as that shown in Figure 1 ; and
Figure 3, illustrates the way in which an individual column may be identified according to an embodiment of the present invention.
The method by which an individual column on the page may be identified will now be described with reference to Figure 3. Here, we see a variety of partially overlapping content bearing zones 310, 320, 330 340, 350, positioned near a left-hand edge 300 of the page.
The method proceeds as follows. First, a vertical baseline is chosen which lies to the left of the various zones under consideration. In this example, the page edge 300 will be taken as the baseline. Next, we look to the right of the baseline until we reach the left-hand edge of the zone which lies closest to the baseline. In this case, we find zone 340. The left-hand edge of that zone is then defined as the left-hand edge 400 of the potential column.
Next, taking the zone 340 as the current zone, we define a current slider position at the right-hand edge 344 of that zone.
Then, we look for a further zone which partially overlaps the current zone 340 and which has a far edge to the right of the existing slider position 344. Or, to put it another way, we look for a further zone which has a near edge to the left of the current slider position and a right edge to the right of that position. Here, there are four possible candidates, namely zones 310, 320, 330, 350.
In the present embodiment, any one of those zones may be chosen, at random. For the sake of example, let us assume that 350 is chosen. This zone becomes the current zone, the slider position is moved to the right hand edge
354 of the new current zone, and the process repeated. This time, there are three zones available for consideration, 310, 320 and 330. If 330 is chosen at random, that then becomes the current zone and the slider is moved to its right- hand edge 344.
Finally, there are two remaining possibilities available for selection 310 and 320. Let us assume that 320 is chosen. That zone then becomes the current zone and the slider is moved to the right-hand edge of 320.
When the process is repeated, we find that there are no further possibilities available for selection since although the zone 310 does overlap the zone 320, it does not extend any further to the right. When there are no further candidates, the right-hand edge of the current zone 320 is taken to represent the right-hand edge of the possible column. Thus, in the example of Figure 3, we have identified a possible column of zones having a left-hand edge 400 and a right-hand edge 500. Typically, in a multicolumn document, the edge 500 may represent the left-hand edge of the narrow column separator separating this column from an adjacent column. In order to identify any adjacent column, the procedure described above is repeated, but this time taking the vertical line 500 as being the initial baseline.
Depending upon the application, it is not essential that the further zone is chosen at random from the available partially overlapping zones, as described above. The preferred further zone for selection may be determined or restricted by some algorithm, for example based on the proximity of the possible further zone to the current zone, the zone size, the font size, the overall zone width and so on. The simple alternative possibility is to select as the further zone that zone, out of all the available possibilities, which extends furthest to the right.
One specific algorithm for finding the edges of column separators (that is, the vertical lines, 400, 500) is as follows:
1. Exclude zones that are semantically irrelevant. This assumes that some pre-analysis has been performed. For example, excluding headers, margins, page numbers, etc.
2. Set counters columnCount= 0 and slider=0
3. V Zones, find zone j such that Ixjl - sliderl < Ixil - sliderl V i≠j
4. if j==null, go to 11
5. slider = Xj2
6. Find zone j such that XjI < slider and Xj 2 > slider 7. Ifj ≠ null, go to 5
8. Check validity of slider. If slider is not valid go to 5
9. Set columnSeparator [columnCount] = slider. columnCount = columnCount +1 10. go to 3
11. If columnCount==0, columnSeparator[columnCount]=page Width i.e. the page is just one column.
As will be evident from step 1 of the above algorithm, it is desirable where possible to carry out at least some pre-processing in order to exclude zones which self evidently do not form part of any possible column structure. In Figure 1, for example, the main heading 12 extends across the entire width of the page and may thus be excluded from any attempt to identify multiple columns.
Once the column edges have been tentatively identified as described above, further checks may be carried out to exclude column edges which might have been identified as a result of a coincidence of page layout. For example, text alignment checks may be made since text will normally be left justified, and will often be right-justified, too, within a column. The spacing between adjacent columns may be measured as a function of distance down the page, since spacers of a constant width are more likely to represent column spacers than if the width is variable. It may also be convenient to define a minimum permitted spacer width between adjacent columns, since extremely small gaps can sometime result from accidental vertical coincidences of word spacing, and may not represent true column separators at all. To further avoid false identification of columns, when the variable slider in the algorithm set out above indicates a column separator, the following additional checks may be made at step 8 to verify its validity:
1. Alignment check. The percentage of the slider length that is covered by zones terminating close to the slider.
2. The number of zones terminating close to the slider.
3. Spacing. The distance between the slider and the beginning of the following column. Too narrow separation may be a false alarm since column separators tend to be significant in width.
It will be understood, of course, that the embodiment described above may equally well be used to identify rows of zones running across the page.
Instead of the baseline being vertical and to the left to the zones to be analysed, the baseline in such an arrangement would be horizontal and normally above the relevant zones. The slider then moves down the page rather than across it.
Once all the columns and/or rows have been identified, that information may be fed into clustering algorithms to assist in clustering the zones into semantically meaningful groups. Typically, zones will be clustered within each column and within each row. Mathematically, the clustering procedure may be defined as follows:
1. Set index =0
2. V zone j such that zone j ^- cluster[i] V i< index && Xj2 < columnSeparator [index], then cluster [index] = cluster [index] γ zone j
3. if index == columnCount END
4. Set index++, go to 2 Once the columns and/or rows are identified, additional statistics may be obtained for each column such as the total area of the column, text area, graphics area, empty area, number of fonts used, average font size, zone density, average zone area, font histogram, etc. These statistics are then used in the logical and semantic analysis phase.
The extracted statistics may be used in a variety of different ways to complete the logical analysis of the page under consideration. An exemplary approach may be found in US application 11/189930, filed 27 July 2005.
It will be understood that any or all of the method described above may be implemented using a suitably-programmed general - purpose computer, or alternatively may be implemented in purpose - designed hardware.

Claims

1. A method of identifying opposing near and far edges of an elongate block of content-bearing zones within an electronic document divided into a plurality of said zones, the method comprising:
(c) identifying a base line to one side of and in a direction parallel to an expected elongate block;
(d) selecting as a first current zone a zone having a near edge which is closest to the base line, in a measurement direction perpendicular to the base line;
(c) defining a near edge of the block as a line co-incident with the said near edge;
(d) determining whether a further zone exists which extends further from the base line and which partly overlaps the current zone; and
(i) if so, taking the further zone as a new current zone and repeating (d); and
(ii) if not, defining a far edge of the block as a line coincident with a far edge of the current zone.
2. A method as claimed in claim 1 in which the elongate block is a column and in which the near and far edges are left and right vertical column edges.
3. A method as claimed in claim 1 in which the elongate block is a row and in which the near end far edges are upper and lower horizontal row edges.
4. A method as claimed in any one of the preceding claims including the initial step of excluding from consideration zones which are expected to be semantically irrelevant.
5. A method as claimed in any one of the preceding claims 1 including a verification step of accepting the far edge of the block as valid only if a sufficiently high proportion of the overall block length is covered by zones which terminate at or near the far edge of the block.
6. A method as claimed in any one of claims 1 to 4 including a verification step of accepting the far edge of the block as valid only if a sufficiently high number or proportion of zones terminate at or near the far edge of the block.
7. A method as claimed in any one of claims 1 to 4 including a verification step of accepting the far edge of the block as valid only if the spacing between the said far edge of the block and an edge of an adjacent block is sufficiently large.
8. A method as claimed in any one of claims 1 to 4 including a verification step of accepting the far edge of the block as valid only if the closest spacing between the said far edge of the block and an edge of an adjacent block does not vary substantially along the edge of the block.
9. A computer program comprising a series of instructions which, when executed on a computer, carry out a method as claimed in any one of claims 1 to 8.
10. A computer-readable data carrier carrying a computer program as claimed in claim 9.
11. A computer system for identifying opposing near and far edges of an elongate block of content-bearing zones within an electronic document divided into a plurality of said zones, the systems including:
(a) means for identifying a base line to one side of and in a direction parallel to an expected elongate block;
(b) means for selecting as a first current zone a zone having a near edge which is closest to the base line, in a measurement direction perpendicular to the base line;
(c) means for defining a near edge of the block as a line co- incident with the said near edge;
(d) means for determining whether a further zone exists which extends further from the base line and which partly overlaps the current zone;
12. A computer system as claimed in claim 11 including means for excluding a method as claimed in any one of the preceding claims including the initial step of excluding from consideration zones which are expected to be semantically irrelevant.
13. A computer system as claimed in claim 11 or claim 12 including means for verifying the far edge of the block as valid.
PCT/EP2005/055518 2005-10-25 2005-10-25 Method of identifying multi zone block edges in an electronic document WO2007048444A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2005/055518 WO2007048444A1 (en) 2005-10-25 2005-10-25 Method of identifying multi zone block edges in an electronic document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2005/055518 WO2007048444A1 (en) 2005-10-25 2005-10-25 Method of identifying multi zone block edges in an electronic document

Publications (1)

Publication Number Publication Date
WO2007048444A1 true WO2007048444A1 (en) 2007-05-03

Family

ID=36090888

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2005/055518 WO2007048444A1 (en) 2005-10-25 2005-10-25 Method of identifying multi zone block edges in an electronic document

Country Status (1)

Country Link
WO (1) WO2007048444A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0843276A1 (en) * 1996-11-18 1998-05-20 Canon Information Systems, Inc. HTML generator
US6377704B1 (en) * 1998-04-30 2002-04-23 Xerox Corporation Method for inset detection in document layout analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0843276A1 (en) * 1996-11-18 1998-05-20 Canon Information Systems, Inc. HTML generator
US6377704B1 (en) * 1998-04-30 2002-04-23 Xerox Corporation Method for inset detection in document layout analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HADJAR K ET AL: "Newspaper page decomposition using a split and merge approach", DOCUMENT ANALYSIS AND RECOGNITION, 2001. PROCEEDINGS. SIXTH INTERNATIONAL CONFERENCE ON SEATTLE, WA, USA 10-13 SEPT. 2001, LOS AALMITOS, CA, USA,IEEE COMPUT. SOC, US, 10 September 2001 (2001-09-10), pages 1186 - 1189, XP010560689, ISBN: 0-7695-1263-1 *

Similar Documents

Publication Publication Date Title
CN105589841B (en) A kind of method of PDF document Table recognition
EP1907946B1 (en) A method for finding text reading order in a document
Shafait et al. Table detection in heterogeneous documents
EP2080113B1 (en) Media material analysis of continuing article portions
US7937653B2 (en) Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents
US8249356B1 (en) Physical page layout analysis via tab-stop detection for optical character recognition
CN106802884B (en) Method for fragmenting text of layout document
CN102541929B (en) Method and device for extracting format file catalogue
CN106250830A (en) Digital book structured analysis processing method
US8208737B1 (en) Methods and systems for identifying captions in media material
Harit et al. Table detection in document images using header and trailer patterns
CN107291682B (en) Multi-electronic-document segmentation algorithm based on skip processing and double verification
Zuyev Table image segmentation
Ohta et al. A cell-detection-based table-structure recognition method
Klampfl et al. An unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles
Anh et al. A hybrid method for table detection from document image
Melinda et al. Parameter-free table detection method
Berg et al. Towards high-quality text stream extraction from PDF. Technical background to the ACL 2012 Contributed Task
Huang et al. Associating text and graphics for scientific chart understanding
WO2007048444A1 (en) Method of identifying multi zone block edges in an electronic document
KR102572180B1 (en) text classification
CN112541505B (en) Text recognition method, text recognition device and computer-readable storage medium
Garz et al. Multi-scale texture-based text recognition in ancient manuscripts
Kaur et al. TxtLineSeg: text line segmentation of unconstrained printed text in Devanagari script
Ohta et al. Table-structure recognition method using neural networks for implicit ruled line estimation and cell estimation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 05801486

Country of ref document: EP

Kind code of ref document: A1