Text Processing Algorithms

Overview

Enhance your existing AI with advanced text processing

The 2OS text processing bundle allows you to improve your existing 2OS AI applications by introducing 5 modules: Table detection, Document structure parser, Layout segmentation, TOC extraction and Lexical field generation. You also can use them separately.

Algorithms

 

Table detection

2OS Table detection finds structured data in the documents. 

It uses state of art computer vision architectures to automatically detect tables in the provided documents. No needs of training as we provide a trained model ready to use on any type of document. If you need a tailored model for your specific application, you can train your own table detection model directly in a 2OS application, even with no machine learning knowledge.

 

Use case 1
Table extraction

Table detection is already used within 2OS table extraction as a preprocessing step to find tables. Following the process below, it then detects the structure and cells of the tables and extracts the content: 

Figure 1. Use case 1 of Table Detection: Table Extraction Pipeline

 

Use case 2
Pre-annotation

For ML practitioners such as data scientists and ML engineers, you can use table detection as a pre annotation tool. The idea is to reduce the annotation cost and speed up the process to improve your existing system.

 

How it works

Figure 2. Table detection pipeline


The module provides the tables in a json format which contains coordinate positions of the tables in the documents.

 

Document structure parser

2OS Document structure parser reads a document and structures it in a comprehensive format. It will detect titles, paragraphs, sentences and provides a machine-readable output containing this information. 2OS Document structure parser is a hybrid system using information like character size, font style, color and much more to identify titles and delimit the paragraphs. 

It is called hybrid since it is both using rule-based approach and image recognition methods to find lists and ditch unwanted images.

Use case
DocReader

2OS document structure parser is the first step in the 2OS DocReader module. Before extracting any type of information, we need to structure the document in a comprehensive way. This method makes it easier for the downstream algorithms.

How it works

Figure 3. Use case of Document Structure Parser: Documents preprocessing

 

Layout segmentation

2OS Layout segmentation splits a document into sections based on the business requirements. It uses various heuristics as well as the 2OS Document parser module to structure the document and identify the relevant sections. 2OS Layout segmentation comes as a preprocessing step before 2OS DocReader for example to help isolate the relevant information by discarding irrelevant pages of the document.

Use case
Investment fund prospectuses

In fund prospectuses, especially in large institutions like BlackRock and Vanguard, a single document can contain more than just the fund prospectus. It often comes along with the “key investor information document” and the “settlements”. Using 2OS Layout segmentation, you can focus on the prospectus part by discarding the irrelevant sections. 

 

TOC extraction

2OS TOC extraction or Table-Of-Contents extraction detects the headers in a document and classifies them hierarchically to build navigation bookmarks of the document. The pipeline consists of two steps: a first model detects the document’s titles and a second one orders them into a TOC tree. 

Use case
Table-Of-Content generation of Sustainability reports

When processing and analyzing documents, being able to generate a TOC can be highly beneficial if the document does not have one. Indeed, it can help you navigate through it, but it can also allow you to link headers with paragraphs of text to perform relations extraction tasks for instance.

How it works

Figure 4. Use case of TOC Extraction: Documents preprocessing

 

Lexical field generation

2OS Lexical field generation extends your existing glossary using word embeddings. It uses a vectorized representation of the words to find the most similar ones to yours in a set of documents. As it extends the user’s glossary before performing the classification task, 2OS Lexical field generation works, for example, very well with 2OS sentence classification module. It also can be used separately to boost downstream tasks. 

Use case
Climate data in investment

One of the key selling points in investment right now is the environmental and climate impact of businesses. Thus, it is critical to detect such information in financial documents and reports. 

2OS Lexical field generation helped extend the keywords used for this search and thus helped recover more information than using only the keywords. 

An example of extension is presented below:

Figure 5. Example of Lexical field extension using Word Embeddings
 
How it works

Figure 6. Example of Lexical field extension using Word Embeddings