Enhance your existing AI with advanced text processing
The 2OS text processing bundle allows you to improve your existing 2OS AI applications by introducing 5 modules: Table detection, Document structure parser, Layout segmentation, TOC extraction and Lexical field generation. You also can use them separately.
2OS Table detection finds structured data in the documents.
It uses state of art computer vision architectures to automatically detect tables in the provided documents. No needs of training as we provide a trained model ready to use on any type of document. If you need a tailored model for your specific application, you can train your own table detection model directly in a 2OS application, even with no machine learning knowledge.
Use case 1
Table detection is already used within 2OS table extraction as a preprocessing step to find tables. Following the process below, it then detects the structure and cells of the tables and extracts the content:
Figure 1. Use case 1 of Table Detection: Table Extraction Pipeline
Use case 2
For ML practitioners such as data scientists and ML engineers, you can use table detection as a pre annotation tool. The idea is to reduce the annotation cost and speed up the process to improve your existing system.
How it works
Figure 2. Table detection pipeline
The module provides the tables in a json format which contains coordinate positions of the tables in the documents.
Document structure parser
2OS Document structure parser reads a document and structures it in a comprehensive format. It will detect titles, paragraphs, sentences and provides a machine-readable output containing this information. 2OS Document structure parser is a hybrid system using information like character size, font style, color and much more to identify titles and delimit the paragraphs.
It is called hybrid since it is both using rule-based approach and image recognition methods to find lists and ditch unwanted images.
2OS document structure parser is the first step in the 2OS DocReader module. Before extracting any type of information, we need to structure the document in a comprehensive way. This method makes it easier for the downstream algorithms.
How it works
Figure 3. Use case of Document Structure Parser: Documents preprocessing
2OS Layout segmentation splits a document into sections based on the business requirements. It uses various heuristics as well as the 2OS Document parser module to structure the document and identify the relevant sections. 2OS Layout segmentation comes as a preprocessing step before 2OS DocReader for example to help isolate the relevant information by discarding irrelevant pages of the document.
Investment fund prospectuses
In fund prospectuses, especially in large institutions like BlackRock and Vanguard, a single document can contain more than just the fund prospectus. It often comes along with the “key investor information document” and the “settlements”. Using 2OS Layout segmentation, you can focus on the prospectus part by discarding the irrelevant sections.
2OS TOC extraction or Table-Of-Contents extraction detects the headers in a document and classifies them hierarchically to build navigation bookmarks of the document. The pipeline consists of two steps: a first model detects the document’s titles and a second one orders them into a TOC tree.
Table-Of-Content generation of Sustainability reports
When processing and analyzing documents, being able to generate a TOC can be highly beneficial if the document does not have one. Indeed, it can help you navigate through it, but it can also allow you to link headers with paragraphs of text to perform relations extraction tasks for instance.
How it works
Figure 4. Use case of TOC Extraction: Documents preprocessing
Lexical field generation
2OS Lexical field generation extends your existing glossary using word embeddings. It uses a vectorized representation of the words to find the most similar ones to yours in a set of documents. As it extends the user’s glossary before performing the classification task, 2OS Lexical field generation works, for example, very well with 2OS sentence classification module. It also can be used separately to boost downstream tasks.
Climate data in investment
One of the key selling points in investment right now is the environmental and climate impact of businesses. Thus, it is critical to detect such information in financial documents and reports.
2OS Lexical field generation helped extend the keywords used for this search and thus helped recover more information than using only the keywords.
An example of extension is presented below: