AI Builder

DocReader NER

Overview

Get the most out of your digital documents.

There is a large amount of critical information stored in a semi-structured or unstructured format
in PDFs documents. This makes it very difficult to leverage any information inside without
digging it manually which can be quite tedious.
The DocReader module uses state of the art technology in natural languages processing and
machine learning to automatically extract information from documents and allows you to focus
on what matters to you.

Features

State of the art: The neural architecture uses the latest advances in deep learning research to be fast and data-efficient.

Customizability: The user can define his own set of labels and use them to train an extraction pipeline that he can use to automatically process new documents.

Synergy with other 2OS modules, especially 2OS Annotation Tool

Benefits

Adaptability: DocReader can learn to extract any type of information expressed in natural language.

Ease of use: DocReader can be used by anyone without the need to write code or any technical knowledge of deep learning.

Extract unstructured data quickly and accurately and reduce time consuming search and extraction tasks for your business.



How it works

In order to extract structured information from pdf documents, we will use Supervised Machine Learning.
Meaning we need to construct a dataset that will be used to train the machine learning algorithm in a supervised manner.
This dataset needs to contain examples of Input documents and what is the expected output for each of those documents.

 

 

The Neural Network model is trained on a sequence of Input-Output pairs and learns how to perform the task. The model can then be applied to new documents that were not seen during training to produce new predictions:

 

This process allows the user to build custom models that can successfully extract all the information required by the user.

 

The training data

The data quality can often be a determining factor in the success of any information extraction system.

This is why it is crucial to follow a precise process:
– Having an annotation pipeline well suited to your task.
– Follow an efficient annotation guideline
An example of such process is described hereunder using 2OS Annotation Tool as annotation system:

 

Create a dataset

Upload the documents to annotate

Upload the documents to annotate

Start annotating !

 

Typical workflow

The global process can be summarized in the following diagram:

Once our training data is ready, we start training the model and once our model is trained we can start extracting information from documents