ICDAR Competition 2011: Establishing the physical model of a table: <p>table location and/or segmentation</p>

Printable explanation + Appendices     Training data for the Location Sub-Competition     Training data for the Segmentation Sub-Competition     Original text files

 

 

 

 

 

 

Table competition at ICDAR 2011

Establishing the physical model of a table: table location and/or segmentation

 

 

Motivation: Tables are a prominent element of communication in documents, often containing information that would take many a paragraph to write otherwise. The first step to table understanding is to draw the tables physical model, i.e. identify its location and component cells, rows ad columns. Several authors have dedicated themselves to these tasks, using diverse methods, however it is difficult to know which methods work best under which circumstance because of the diverse testing conditions used by each. This competition aims at addressing this lacuna in our field.

 

Tasks: This competition will involve two independent sub-competitions. Authors may choose to compete for one task or the other or both.
  1. Table location sub-competition:

    This task consists of identifying which lines in the document belong to one same table area or not;

     
  2. Table segmentation sub-competition:

    This task consists of identifying which column the cells of each table belong to, i.e. identifying which cells belong to one same column. Each cell should be attributed a start and end column index (which will be different from each other for spanning cells). Identifying row spanning cells is not relevant for this competition.

 

Description of the datasets: We have gathered 22 PDF financial statements. Our documents have lengths varying between 13 and 235 pages with very diverse page layouts, for example, pages can be organised in one or two columns and page headers and footers are included; each document contains between 3 and 162 tables. In Appendix A, we present some examples of pages in our dataset with tables that we consider hard to locate or segment. We randomly chose 19 documents for training and 3 for validation; our tougher cases turned out to be in the training set.

We then converted all files to ASCII using the pdttotext linux utility2 (2Red Hat Linux 7.2 (Enigma), October 22, 2001, Linux 2.4.7-10, pdftotext version 0.92., copyright 1996-2000 Derek B. Noonburg.). As a result of the conversion, each line of each document became a line of ASCII, which when imported into a database becomes a record in a relational table. Apart from this, we collected an extra 19 PDF financial statements to form the test set; these were converted into ASCII using the same tool as the training set.

Table 1 underneath shows the resulting dimensions of the datasets and how they compare to those used by other authors (Wang et al. (2002)’s tables were automatically generated and Pinto et al. (2003)’s belong to the same government statistics website). The sizes of the datasets in other papers are not distant from ours. An exception would be Cafarella et al. (2008), who created the first large repository of HTML tables, with 154 million tables. These consist of non-marked up HTML tables detected using Wang and Hu (2002)’s algorithm, which is naturally subject to mistakes.

We have then manually created the ground-truth for this data, which involved: a) identifying which lines belong to tables and which do not; b) for each line, identifying how it should be clipped into cells; c) for each cell, identifying which table column it belongs to.

 

Table 1: Sizes of the datasets compared with those used by other table processing scientists.

 
Our datasets
GranularityTraining setValidation setTest setOther datasets
# lines86,6279,53786,20879,966Pinto et al. (2003)
# tables1,003120968*1,740Wang and Hu (2002)
482Wang et al. (2002)
50Hurst (2003)
# table lines17,6192,47815,99815,287Pinto et al. (2003)
# table cells61,2867,12856,065*

* Estimated (for now).

 

Description of the fields in the datasets: For each of the competitions, the training set with the following fields is made available:

  1. Table location sub-competition:

    A text file containing all documents in the training set will be made available, such that each line in the original document will be a line in the excel file. Each will be identified with a document identifier, a page number within that document and a line number within each page; for each line, the sequential ID of the table area it belongs to is supplied. The ASCII content of each line will naturally also be supplied.

     
    Field NameDescription
    KEYUniquely identifies each line in the dataset
    DOCIdentifies the document the line originates from
    PAGEIdentifies the page within that document the line originates from
    LINEIdentifies the line index within each page the line originates from
    NonEmptyLineCounts the number of non-empty lines that occurred from the start of the document
    LineContentIdentifies the string the cell is a part of, i.e. the content of the respective line
    TABLEIdentifies whether the line belongs to a table (1), its titles (2), or footnotes (3)
    TABLEIDIdentifies the table the element originates from
     

    Ground-truth’ing was not always straight forward, Hu et al. (2001). One of the difficulties, for example, is agreeing on a common table definition: some authors tend to consider that table titles and footnotes should be part of the table and others do not. In terms of the datasets for the location task, our ground-truth is specified in such a way that both options are allowed, as long as they are applied consistently. Table lines are marked with 1 under the variable "Table", while table titles get a 2 and table footers get a 3. Location performance will be measured as the maximum between including titles and footers or not. We recommend authors learn their methods using just table lines, i.e. those marked with 1. All remaining lines, including empty lines, regardless of their location, get a 0.

    The last two fields are what the competing authors should be able to fill in for an unseen dataset, of which they know only the content of the remainder fields. Under the field "Table", we expect to find only 1s.

    The training set is provided here. It is a tab separated text file of which the first line contains column headers. The file can be readily imported as a relational table.

  2. Table segmentation sub-competition:

    The input will include the string content of each cell in the table and its start and end position in terms of the number of characters (white space or not) that stand to its left from the edge of the page; for each cell, the sequential ID of the column it belongs to is supplied, as well as how many other cells appear to its left, together with the document, page and line identifier.

     
    Field NameDescription
    KEYUniquely identifies each cell in the dataset
    DOCIdentifies the document the cell originates from
    PAGEIdentifies the page within that document the cell originates from
    LINEIdentifies the line within each page the cell originates from
    TABLEIDIdentifies the table the element originates from
    SequenceNrCounts the number of cells to the left of the current cell + 1
    NonEmptyLineCounts the number of non-empty lines that occurred from the start of the document
    StartPositionCounts the number of characters to the left of the current cell + 1
    Mid_positionCounts the number of characters to the left of the mid-point of the current cell + 1
    End_PositionCounts the number of characters to the left of the last character on the current cell + 1
    CellContentIdentifies the characters that appear within the cell
    LineContentIdentifies the string the cell is a part of, i.e. the content of the respective line
    Start_ColIDIdentifies the start column of the cell
    End_ColIDIdentifies the end column of the cell

    The last two fields are what the competing authors should be able to fill in for an unseen dataset, of which they know only the content of the remainder fields.

    The training set is provided here. It is a tab separated text file of which the first line contains column headers. The file can be readily imported as a relational table.

 

Performance evaluation: We propose a detailed performance evaluation approach, which rests upon the following four axes.

* Binary completeness and purity

These metrics will be used for identifying which method is the winner, Silva (2010a). They are recognisably more adequate than precision and recall for the problems or division or aggregation that are dominant in document analysis research. In location (segmentation), completeness is defined as the number of completely identified tables (columns) in the dataset with respect to the total number of existing tables (columns). For example, in order to be completely identified, a table must contain all of its lines and a column must contain all of its cells. Purity is the proportion of purely detected elements (i.e. that contain no parts that actually belong in any other elements) with respect to the total number of detected elements; a pure element is one whose components belong only to one original element.

Specifically, in table location, a table will be considered complete and pure if the first and last non-empty line number of the table area are the same between the true table and the detected one.

Two algorithms that present either higher purity or higher completeness to each other are both considered efficient and a choice between them is always user specific. Different approaches can be used for tailoring alternative efficient algorithms to users’ needs. Some common ways of choosing a favourite algorithm would be to minimize total cost or to maximise each user’s CPF. In this competition, we shall look at a visually exciting alternative.

* Production Possibility Curve and Utility Curves

When we represent all algorithms that are available for a given task in a scatter plot that has CP in the axis, and we connect the efficient algorithms to each other, we obtain a line that represents the maximum combinations of CP attainable with the resources available. This is called a Production Possibility Curve (PPC). We shall be drawing PPCs from the results of the competing algorithms.

The degree by which one user prefers completeness to purity or vice-versa can also be represented in the chart using Utility Curves (UCs) (Varian, 1999). A utility curve joins together all the combinations of CP that please a user equally; UCs closer to the top right corner of the plot, i.e. farther from the origin, represent a higher level of utility for the user, leaving him more satisfied; the UCs of a user who is consistent in his preferences never cross.

In Figure 1, we draw a PPC from the results of five different algorithms. Only algorithms D and E are efficient, since the remainder represent simultaneously less CP than these two; therefore only D and E appear on the PPC. We also draw two sets of UCs representing the preferences of two different users, X and Y. User X likes purity and completeness equally, thus the slope of his utility curves is -1, while user Y considers purity to be 6 times less serious than completeness, thus his UCs’ slope is -1/6.

To determine each user’s favourite algorithm, we identify the point where the user’s UCs intersect the PPC farthest from the origin. We can clearly see that both users prefer method E. In our competitions, the winners for each task will be those who better please user X.

Figure 1: A user’s favourite algorithm lies on the farthest from the origin intersection between their Utility Curves and the PPC.

* Statistical tests

One algorithm may seem better than another when we compare their performance metrics together, however the difference might not be statistically significant, i.e. they might not derive from an actual difference in the goodness of the algorithms but instead might be due to chance from running the experiment on this particular data on not another (this is called sampling error, in statistics). One methods that does not present a statistically signicant improvement in either completeness or purity to another can be considered its equivalent.

If we identify that one of the non-winning algorithms is statistically equivalent to the winning in terms of both completeness and purity, we shall name it the runner up of the competition.

We shall be using some of the statistical tests suggested by Demsar (2006) for comparing several algorithms to one another, in particular we shall start with the Friedman test as corrected by Iman and Davenport (1980) to determine whether there exist significant differences among the overall results of the different algorithms; and then apply the Nemenyi test (Nemenyi, 1963) to compare any pairs of algorithms to each other.

* Specific evaluation

However relevant generic evaluation is and however distant a winning algorithm is from all others, we still believe it is incomplete as an evaluation approach. Different table authors have defended the idea that different input and algorithms are capable of capturing different aspects of different tables (Klein et al., 2001; Long, 2009), it being unlikely that one algorithm will do significantly better than all others on all types of tables. We shall be applying a systematic methodology developed in Silva (2010b) for determining which algorithms with which features and parameter settings perform better on which types of data. In fact, specific evaluation aims at discovering whether and in what conditions one algorithm performs better than the other on an identifiable subset of the reference universe.

Determining which algorithm works best for which parts of the data a posteriori is interesting in itself, but if we manage to determine a priori, with some degree of accuracy, which algorithm is more likely to function better or even equivalently for an unseen case, then we can determine a policy for deciding which algorithm to use under which circumstances and as such hopefully increase results or efficiency algorithm over applying the algorithms independently.

Specific evaluation roughly consists of, for each table, identifying which algorithm performs best. In order to do this, we shall use continuous completeness and purity as performance metrics (which meaning and calculation we shall explain in the next subsection).

Having identified which is the winner algorithm for each table, we apply standard data analysis techniques to determine, based on the training set and a large set of features, which we present in Appendix A, in which circumstances and with which probability each algorithm is best for an unseen table. In our case, we shall build a decision tree to decide when each competing algorithm is likely to be the better at treating a given table. We use decision trees because we are interested in actually reading and trying to make human sense of the rules, which might clarify the strengths and weaknesses of each algorithm.

By then applying this classifier to the test set and measuring the respective accuracy rate (in terms of precision and recall, which are adequate metrics for classification tasks), we shall know whether the sort of tables on which a given algorithm outperforms have some identifiable characteristic.

If so, we shall proceed to combine different algorithms using the policy suggested by the classifiers. If the combined result on the test set is better, in terms of C&P or in terms of total run time, than that of the individual algorithms, we have identified experts in part of the space of tables and found a way of combining them that makes the most of their expertise. We present more on the particulars of this methodology in Appendix B.

 

Submission format: The competitors are expected to supply the executables that allow, from an input file with similar configuration as the training set, producing an output file with the same fields as the training set, where the last two are filled in by the competitors. Results should be submitted by email to the competition organisers until mid-night of the submission deadline, GMT.

We also expect to receive between 1 and 5 pages explaining the methods used. We believe this is important because one of the main contributions a competition can provide is insights into which techniques work better and why. In this description should be included a description of the operating environment and programs that are required for a smooth running of the executables. We assure the authors that, if there are issues with the running of the codes, they shall be contacted for clarification.

In order to encourage submissions, we allow competitors to opt for their names not being disclosed if they are not among the winning algorithms.

 

Other resources made available:

Completeness and purity are simple concepts to grasp. However, depending on the data structure used, calculating them in practise is not always simple.

In order to contribute to harmonising the way these measurementes are made as well as to provide competitors with a clear view of how their results will be assessed, we created an Access database with the training data of both competitions. These data tables include, apart from the fields mentioned above, two more that correspond to a given team's best bet for the two competition's goals. For demonstrative purposes, we filled in the authors' bets fields with the same values as the ground-truth. A set of queries is there-in made available. In particular, Evaluation_Completeness_LOCATION", "Evaluation_Completeness_SEGMENTATION", "Evaluation_Purity_LOCATION" and "Evaluation_Purity_SEGMENTATION" calculate completeness and purity automatically for each of the tasks by comparing the ground-truth with the authors' bets. These calculations accomodate for spanning cells.

When authors replace the two last fields with their own bets, the only adaptation that must be made to the queries in in query "DetectedSpanningCells_1_OneByOne", where more union alls must be included until 8 is replaced by the value answered by the query "ExistingSpanningCells_0". This last sentence will seem less cryptic once you do it yourself. If in the future, you would like to use this database to calculate completess and purity for other data, please cite this competition.

An Access database with the training data, which calculates your performance metrics

 

Important dates:

February 26, 2011Training set is made available on the Competition Website
April 15, 2011Competition registration, which consists of expressing interest in competing, by email, to the competition organisers
May 13, 2011Validation set is made available on the Competition Website
May 15, 2011Submission of results by competitors, which should be executable files; if at all impossible, the test data will be given out to competitors, but results must be submitted within no more than one hour (negotiable)
June 15, 2011Submission of summary paper for ICDAR's proceedings, already including the identification of the competition's winner
September, 2011Test set is made available on the Competition Website
September, 2011Announcement of the results will be made during ICDAR’2011, the competition session

 

Competition organisers:

Ana Costa e Silva.

anycostasilva at hotmail dot com

LIADD-INESC, Porto, L.A.

Universidade do Porto

 

 

References:  

Cafarella M., Halevy A. Y., Zhang Y., Wang D. Z., Wu E., "Uncovering the Relational Web", 11th International Workshop on the Web and Databases, Canada, 2008.

Demsar, J., "Statistical Comparisons of Classifiers over Multiple Data Sets, Journal of Machine Learning Research, Vol.7, 2006, pp. 1-30.

Hurst M., A Constraint-based Approach to Table Structure Derivation, ICDAR Proceedings, UK, 2003., pp. 911-916.

Klein, B., Gokkus S., Kieninger T., Dengel A., "Three Approaches to ``Industrial'' Table Spotting}, ICDAR Proceedings, USA, 2001, pp. 513-517.

Long, V., "An RDF-Based Blackboard Architecture for Improving Table Analysis", ICDAR Proceedings, 2009, Spain, pp. 916-920.

Pinto D., McCallum A., Wei X., Croft W. B., "Table extraction using conditional random fields", SIGIR Proceedings, 2003, pp. 235-242, Canada.

Silva (2010a), "Metrics for evaluating performance in document analysis - application to tables", IJDAR Special Issue on Competitions, 2010.

Silva (2010b), "Parts that add up to a whole: a framework for the analysis of tables", PHD Thesis, University of Edinburgh, 2010, UK.

Wang, Y., Hu J., "A Machine Learning Based Approach for Table Detection on the Web", ICWWW Proceedings, 2002, Hawaii, pp. 242-250.

Wang et al. (2002), Wang, Y., Phillips I.T., Haralick R.M.}, "Table Detection via Probability Optimization", DAS Proceedings, 2002, USA, pp. 272-282.