Turn your PDF Document into Structured Data in seconds
Get the most out of your digital documents.
There is a large amount of critical information stored in a semi-structured or unstructured format
in PDFs documents. This makes it very difficult to leverage any information inside without digging it manually which can be quite tedious.
The DocReader module uses state of the art technology in natural languages processing and machine learning to automatically extract information from documents and allows you to focus on what matters to you.
- The user can prepare the training set used by the neural model by annotating a few documents with the relevant information. These annotations will be used by the model as examples of what is expected in its predictions. These annotations can be created directly on a custom UI packaged within the 2OS system.
- The training dataset is then sent to the training service and used to generate a trained deep learning model that will be afterwards deployed to make predictions on new unseen documents.
- The training pipeline starts from the annotated documents and applies multiple processing steps in order to generate the trained model:
- Extract the text and its layout from the pdf.
- Segment the text into a sequence of sentences based on punctuation and layout and then tokenize each sentence into a sequence of words.
- Align the manual annotation with the tokens in the training set.
- Build a deep neural network for sequence labelling.
- Train the neural network on the processed annotated samples.
- Produce the trained model artefact.
How it works
In order to extract structured information from pdf documents, we will use Supervised Machine Learning.
Meaning we need to construct a dataset that will be used to train the machine learning algorithm in a supervised manner.
This dataset needs to contain examples of Input documents and what is the expected output for each of those documents.
The Neural Network model is trained on a sequence of Input-Output pairs and learns how to perform the task. The model can then be applied to new documents that were not seen during training to produce new predictions:
This process allows the user to build custom models that can successfully extract all the information required by the user.
The training data
The data quality can often be a determining factor in the success of any information extraction system.
This is why it is crucial to follow a precise process:
– Having an annotation pipeline well suited to your task.
– Follow an efficient annotation guideline
An example of such process is described hereunder using 2OS Annotation Tool as annotation system:
Create a dataset
Upload the documents to annotate
Upload the documents to annotate
Start annotating !
The global process can be summarized in the following diagram :
Once our training data is ready, we start training the model and once our model is trained we can start extracting information from documents.
- State of the art: The neural architecture uses the latest advances in deep learning research to be fast and data-efficient.
- Customizability: The user can define his own set of labels and use them to train an extraction pipeline that he can use to automatically process new documents.
- Synergy with other 2OS modules, especially 2OS Annotation Tool.
- Adaptability: DocReader can learn to extract any type of information expressed in natural language.
- Ease of use: Our app can be used by anyone without the need to write code or any technical knowledge of deep learning.
- Extract unstructured data quickly and accurately and reduce time consuming search and extraction tasks for your business.