DocReader Table Extraction
Unlock the information stored in your document’s tables
The Table extraction module is an information extraction system based on computer vision and machine learning. Its purpose is to detect tables and understand their structure, enable the user to structure and store the content in a database.
This system takes your PDF documents as input and runs the extraction in two steps:
- It parses your documents pages and detects the tables and their cells.
- Then, it extracts and structures the data locked in those tables.
The Table extraction system’s output is structured in a way that allows you to derive business value from it.
Our detection model uses the latest advances in computer vision and deep learning to detect tables and their cells in documents.
Customizable detection lets you set your own detection levels and define the detection scope in the document by specifying pages or pages range.
Batch processing and parallelization allows you to parse the biggest documents in a few seconds.
Tabular data extraction enables you to disambiguate and structure the data stored in rows and columns of the tables.
Easy to use: Drop your document in the app via a PDF file or an API and you are set to extract table’s data from your document.
No code: once your app is set up in the 2OS AI Builder, everyone can use it. No need to be a data scientist or a machine learning practitioner to tune the system to your needs.
Any document: across industries, tables are used in documents as a way of presenting information. There are no business case limits to our module.
Less time consuming: thanks to our architecture, even documents up to 500 pages are parsed in a matter of seconds.
How it works
Our Table Extraction pipeline is simple: we start by detecting tables and their cells and then, we extract the data from the detected tables.
Tables and cells detection
We use a state-of-art detection algorithm trained on a dataset specifically curated by our team of annotators.
What is a detection algorithm?
In computer vision, a detection algorithm is fed with an image or a document’s page in our case, and will output detection bounding boxes around the detected objects in the image. Let’s visualize an example. The figure below shows a page with a detected table (in green) and its detected cells (in blue).
Figure 1. Example of a table and cells detection
What is a “trained” algorithm?
Through supervised learning, we can train our algorithm how to detect tables and cells. Therefore, we dispose of labeled data -in our case, images of PDF pages with the true bounding boxes of tables and cells- sent to our system.
By comparing the true bounding boxes to the bounding boxes of the detected objects, we can quantify the error made by the algorithm and change its parameters accordingly.
By doing so on large datasets, we train the system how to correctly detect tables and cells.
How predictions are made?
Once the algorithm is trained, you can send new images and then obtain predictions. The figure belows shows how the inference works:
Figure 2. Inference Process
Table data extraction
Once tables and cells are detected, the second block of our pipeline runs through the detected rows and columns of those tables and structures it in a way the user can easily extract business value from it.
The whole pipeline in the figure below:
Figure 3. Table Extraction Pipeline
Companies energy consumption data extraction from sustainability reports
Most of the information from big companies is inside PDF documents such as annual reports and sustainability reports. Moreover, tables are often preferred in order to illustrate some data, like financial data for example.
For this use case, let’s say you want to extract energy-related data from the sustainability report of a company. How can you use our Table Extraction module and its AI-enable applications built in the 2OS platform to carry out your task?
Step 1: Upload your document
Here, we will be interested in the page below, extracted from a sustainability report:
Figure 4. Example of a document with a table
Step 2: Table and cells detection
Let’s visualize the detection results.
Figure 5. Example of a table and cells detection
We can see that 1 table was detected (in green) alongside its cells (in blue).
Step 3: Data extraction
Our algorithm will “read” the table for you and present the results as follows:
|Total Gross generation (MWh)||Electricity||69,709,500|
|Total Gross generation (MWh)||Heat||52,746,141|
|Total Gross generation (MWh)||Steam||29,081,748|
|Total Gross generation (MWh)||Cooling||0,00|
|Generation that is consumed by the organization (MWh)||Electricity||50,874,827|
|Generation that is consumed by the organization (MWh)||Heat||52,739,047|
|Generation that is consumed by the organization (MWh)||Steam||27,600,035|
|Generation that is consumed by the organization (MWh)||Cooling||0,00|
|Gross generation from renewable sources (MWh)||Electricity||22,125.92|
|Gross generation from renewable sources (MWh)||Heat||0,00|
|Gross generation from renewable sources (MWh)||Steam||0,00|
|Gross generation from renewable sources (MWh)||Cooling||0,00|
|Generation from renewable sources that is consumed by the organization (MWh)||Electricity||22,125.92|
|Generation from renewable sources that is consumed by the organization (MWh)||Heat||0,00|
|Generation from renewable sources that is consumed by the organization (MWh)||Steam||0,00|
|Generation from renewable sources that is consumed by the organization (MWh)||Cooling||0,00|
Table 1. Data extraction result from the table
Data is now ready to be exported and used for your own project.