WO2007048444A1

WO2007048444A1 - Method of identifying multi zone block edges in an electronic document

Info

Publication number: WO2007048444A1
Application number: PCT/EP2005/055518
Authority: WO
Inventors: Jose Abad; Sherif Yacoub
Original assignee: Hewlett-Packard Development Company, L.P.
Priority date: 2005-10-25
Filing date: 2005-10-25
Publication date: 2007-05-03

Abstract

A method of identifying multi-zone blocks, such as rows or columns, within an electronic document comprises dividing the document into zones and looking for partially overlapping zones. The method finds applications in the automated analysis of scanned copies of pages from books, newspapers or magazines.

Description

METHOD OF IDENTIFYING MULTI ZONE BLOCK EDGES IN AN

ELECTRONIC DOCUMENT

The present invention relates to a method of identifying multi zone block edges in an electronic document. In a typical application, the invention may be used to find either the vertical edges of columns in a multi column document, or the horizontal edges of rows. The invention finds particular although not exclusive application in the automated analysis of scanned copies of pages of printed magazines, newspapers or books.

A typical page 10 from a magazine is shown in Figure 1, and this may typically be processed using a document analysis system, the main features of which are shown semantically in Figure 2. Once the printed page has been scanned or otherwise electronically captured, it is first analysed using page segmentation and optical character recognition (OCR) components 200 to identify the individual content bearing zones on the page. As shown in Figure 1, the identified zones may include text zones 24, 26, 28, headings 12, 14, images 32, captions 34 and so on. OCR is carried out within each zone, wherever text can be recognised.

Turning back to Figure 2, the next step in the process is structure analysis 220. As shown at box 260, this includes both the clustering of zones into semantically meaningful blocks or areas, as well as identification of a table of contents (TOC) if one is present. As shown at box 280, the clustering of the zones may be undertaken using a variety of techniques including row and column clustering, inset and background clustering, alignment clustering, layout clustering and font type clustering. Once the structure analysis is complete, control passes to a logical analysis section 240. As shown by the box 300, this includes such things as the detection of advertisements, schematic labelling, article detection, and the reconstruction of text flow and reading order.

One of the most important aspects of carrying out structure analysis is the determination of whether the page under study uses text zones which are laid in out in a single column or in multiple columns. Likewise, it is important to know whether the page contains multiple rows of text zones. Multiple rows and/or columns may conveniently be identified using the method of the present invention, as described more detail below. Once any rows and/or columns have been identified, clustering may be carried out, for example within each row or each column, in an effort to link together zones which are semantically related to each other.

A typical prior art system, which simplistically identifies multiple columns simply by looking for a suitably large white space, is disclosed in US- A-5164899.

According to the present invention there is provided a method of identifying opposing near and far edges of an elongate block of content-bearing zones within an electronic document divided into a plurality of said zones, the method comprising: a) Identifying a base line to one side of and in a direction parallel to an expected elongate block; b) Selecting as a first current zone a zone having a near edge which is closest to the base line, in a measurement direction perpendicular to the base line; c) defining a near edge of the block as a line co-incident with the said near edge; d) determining whether a further zone exists which extends further from the base line and which partly overlaps the current zone; and

(i) if so, taking the further zone as a new current zone and repeating (d); and

(ii) if not, defining a far edge of the block as a line coincident with a far edge of the current zone.

It will be understood, of course, that the defining of the near edge of the block as being aligned co-incident with the said near edge (as set out in part (c) above) need not necessarily be carried out in the sequential order shown. It could be carried out a different time (for example at the same time as defining far edge of the block), or alternatively could be omitted entirely in situation where the only requirement is to identify the far edge of the block, for example the right-hand side of a column or the lower edge of a row.

The invention entends to a computer program for carrying out the method, to a computer-readable carrier carrying such a program, and to a computer system when programmed to carry out the method.

According to a second aspect, there is provided:

A computer system for identifying opposing near and far edges of an elongate block of content-bearing zones within an electronic document divided into a plurality of said zones, the systems including:

(a) means for identifying a base line to one side of and in a direction parallel to an expected elongate block; (b) means for selecting as a first current zone a zone having a near edge which is closest to the base line, in a measurement direction perpendicular to the base line;

(c) means for defining a near edge of the block as a line co- incident with the said near edge;

(d) means for determining whether a further zone exists which extends further from the base line and which partly overlaps the current zone;

The invention may be carried into practice in a number of ways and one specific embodiment will now be described, by way of example, with reference to the accompanying drawings, in which:

Figure 1, is an exemplary page from a magazine, after completion of zone identification and OCR;

Figure 2, illustrates schematically the stages of analysis that may be applied to a page such as that shown in Figure 1 ; and

Figure 3, illustrates the way in which an individual column may be identified according to an embodiment of the present invention.

The method by which an individual column on the page may be identified will now be described with reference to Figure 3. Here, we see a variety of partially overlapping content bearing zones 310, 320, 330 340, 350, positioned near a left-hand edge 300 of the page.

The method proceeds as follows. First, a vertical baseline is chosen which lies to the left of the various zones under consideration. In this example, the page edge 300 will be taken as the baseline. Next, we look to the right of the baseline until we reach the left-hand edge of the zone which lies closest to the baseline. In this case, we find zone 340. The left-hand edge of that zone is then defined as the left-hand edge 400 of the potential column.

Next, taking the zone 340 as the current zone, we define a current slider position at the right-hand edge 344 of that zone.

Then, we look for a further zone which partially overlaps the current zone 340 and which has a far edge to the right of the existing slider position 344. Or, to put it another way, we look for a further zone which has a near edge to the left of the current slider position and a right edge to the right of that position. Here, there are four possible candidates, namely zones 310, 320, 330, 350.

In the present embodiment, any one of those zones may be chosen, at random. For the sake of example, let us assume that 350 is chosen. This zone becomes the current zone, the slider position is moved to the right hand edge

354 of the new current zone, and the process repeated. This time, there are three zones available for consideration, 310, 320 and 330. If 330 is chosen at random, that then becomes the current zone and the slider is moved to its right- hand edge 344.

Finally, there are two remaining possibilities available for selection 310 and 320. Let us assume that 320 is chosen. That zone then becomes the current zone and the slider is moved to the right-hand edge of 320.

When the process is repeated, we find that there are no further possibilities available for selection since although the zone 310 does overlap the zone 320, it does not extend any further to the right. When there are no further candidates, the right-hand edge of the current zone 320 is taken to represent the right-hand edge of the possible column. Thus, in the example of Figure 3, we have identified a possible column of zones having a left-hand edge 400 and a right-hand edge 500. Typically, in a multicolumn document, the edge 500 may represent the left-hand edge of the narrow column separator separating this column from an adjacent column. In order to identify any adjacent column, the procedure described above is repeated, but this time taking the vertical line 500 as being the initial baseline.

Depending upon the application, it is not essential that the further zone is chosen at random from the available partially overlapping zones, as described above. The preferred further zone for selection may be determined or restricted by some algorithm, for example based on the proximity of the possible further zone to the current zone, the zone size, the font size, the overall zone width and so on. The simple alternative possibility is to select as the further zone that zone, out of all the available possibilities, which extends furthest to the right.

One specific algorithm for finding the edges of column separators (that is, the vertical lines, 400, 500) is as follows:

1. Exclude zones that are semantically irrelevant. This assumes that some pre-analysis has been performed. For example, excluding headers, margins, page numbers, etc.

2. Set counters columnCount= 0 and slider=0

3. V Zones, find zone j such that Ixjl - sliderl < Ixil - sliderl V i≠j

4. if j==null, go to 11

5. slider = Xj2

6. Find zone j such that XjI < slider and Xj 2 > slider 7. Ifj ≠ null, go to 5

8. Check validity of slider. If slider is not valid go to 5

9. Set columnSeparator [columnCount] = slider. columnCount = columnCount +1 10. go to 3

11. If columnCount==0, columnSeparator[columnCount]=page Width i.e. the page is just one column.

As will be evident from step 1 of the above algorithm, it is desirable where possible to carry out at least some pre-processing in order to exclude zones which self evidently do not form part of any possible column structure. In Figure 1, for example, the main heading 12 extends across the entire width of the page and may thus be excluded from any attempt to identify multiple columns.

Once the column edges have been tentatively identified as described above, further checks may be carried out to exclude column edges which might have been identified as a result of a coincidence of page layout. For example, text alignment checks may be made since text will normally be left justified, and will often be right-justified, too, within a column. The spacing between adjacent columns may be measured as a function of distance down the page, since spacers of a constant width are more likely to represent column spacers than if the width is variable. It may also be convenient to define a minimum permitted spacer width between adjacent columns, since extremely small gaps can sometime result from accidental vertical coincidences of word spacing, and may not represent true column separators at all. To further avoid false identification of columns, when the variable slider in the algorithm set out above indicates a column separator, the following additional checks may be made at step 8 to verify its validity:

1. Alignment check. The percentage of the slider length that is covered by zones terminating close to the slider.

2. The number of zones terminating close to the slider.

3. Spacing. The distance between the slider and the beginning of the following column. Too narrow separation may be a false alarm since column separators tend to be significant in width.

It will be understood, of course, that the embodiment described above may equally well be used to identify rows of zones running across the page.

Instead of the baseline being vertical and to the left to the zones to be analysed, the baseline in such an arrangement would be horizontal and normally above the relevant zones. The slider then moves down the page rather than across it.

Once all the columns and/or rows have been identified, that information may be fed into clustering algorithms to assist in clustering the zones into semantically meaningful groups. Typically, zones will be clustered within each column and within each row. Mathematically, the clustering procedure may be defined as follows:

1. Set index =0

2. V zone j such that zone j ^- cluster[i] V i< index && Xj2 < columnSeparator [index], then cluster [index] = cluster [index] ^γ zone j

3. if index == columnCount END

4. Set index++, go to 2 Once the columns and/or rows are identified, additional statistics may be obtained for each column such as the total area of the column, text area, graphics area, empty area, number of fonts used, average font size, zone density, average zone area, font histogram, etc. These statistics are then used in the logical and semantic analysis phase.

The extracted statistics may be used in a variety of different ways to complete the logical analysis of the page under consideration. An exemplary approach may be found in US application 11/189930, filed 27 July 2005.

It will be understood that any or all of the method described above may be implemented using a suitably-programmed general - purpose computer, or alternatively may be implemented in purpose - designed hardware.

Claims

1. A method of identifying opposing near and far edges of an elongate block of content-bearing zones within an electronic document divided into a plurality of said zones, the method comprising:

(c) identifying a base line to one side of and in a direction parallel to an expected elongate block;

(d) selecting as a first current zone a zone having a near edge which is closest to the base line, in a measurement direction perpendicular to the base line;

(c) defining a near edge of the block as a line co-incident with the said near edge;

(d) determining whether a further zone exists which extends further from the base line and which partly overlaps the current zone; and

(i) if so, taking the further zone as a new current zone and repeating (d); and

2. A method as claimed in claim 1 in which the elongate block is a column and in which the near and far edges are left and right vertical column edges.

3. A method as claimed in claim 1 in which the elongate block is a row and in which the near end far edges are upper and lower horizontal row edges.

4. A method as claimed in any one of the preceding claims including the initial step of excluding from consideration zones which are expected to be semantically irrelevant.

5. A method as claimed in any one of the preceding claims 1 including a verification step of accepting the far edge of the block as valid only if a sufficiently high proportion of the overall block length is covered by zones which terminate at or near the far edge of the block.

6. A method as claimed in any one of claims 1 to 4 including a verification step of accepting the far edge of the block as valid only if a sufficiently high number or proportion of zones terminate at or near the far edge of the block.

7. A method as claimed in any one of claims 1 to 4 including a verification step of accepting the far edge of the block as valid only if the spacing between the said far edge of the block and an edge of an adjacent block is sufficiently large.

8. A method as claimed in any one of claims 1 to 4 including a verification step of accepting the far edge of the block as valid only if the closest spacing between the said far edge of the block and an edge of an adjacent block does not vary substantially along the edge of the block.

9. A computer program comprising a series of instructions which, when executed on a computer, carry out a method as claimed in any one of claims 1 to 8.

10. A computer-readable data carrier carrying a computer program as claimed in claim 9.

11. A computer system for identifying opposing near and far edges of an elongate block of content-bearing zones within an electronic document divided into a plurality of said zones, the systems including:

(a) means for identifying a base line to one side of and in a direction parallel to an expected elongate block;

(b) means for selecting as a first current zone a zone having a near edge which is closest to the base line, in a measurement direction perpendicular to the base line;

12. A computer system as claimed in claim 11 including means for excluding a method as claimed in any one of the preceding claims including the initial step of excluding from consideration zones which are expected to be semantically irrelevant.

13. A computer system as claimed in claim 11 or claim 12 including means for verifying the far edge of the block as valid.