DocReader Sentence Classification
Quickly identify key sentences in documents and reduce manual effort in analyzing legal, financial or any type of document.
The Sentence Classification module is a customizable machine learning service that automatically extracts sentences from digital documents.
Most companies today extract information from documents by manually reading and entering the relevant information. This is a very slow process and highly prone to error. To overcome this, 2OS Sentence Classification reads, analyzes and extracts the relevant sentences to YOU by either feeding your application some keywords or training your own machine learning model with no code or AI prior knowledge.
With 2OS Sentence Classification, you can build your own document processing workflows directly in your application and tailored to your needs. Highly scalable and available, you will be able to process thousands of pages in a matter of hours.
Synergy: add 2OS Lexical Field Generation (see Text Processing module for more details) to Sentence Classification and boost even further your extraction capabilities.
- Extract sentences quickly and accurately with our state of the art classifier. Depending on your need, 2OS sentence classification uses either a simple keyword extraction method or a sophisticated convolutional neural network to identify the relevant sentences.
- Customize your application to your needs. Whether you need to extract healthcare, financial or legal information, you can configure an application for every scenario. Train your own model and use it in your application while monitoring its performance to guarantee the best extraction.
- Language agnostic. Whether you work on english, french or german documents, 2OS sentence classification uses the same approach so that you keep a single workflow for everything.
- Easily integrate human reviews to your process for your sensitive workflows with 2OS Annotation Tool.
- Reduce cost for manual tasks by only adding human reviews to critical workflows. The rest will be taken care of automatically by the application.
- Situational: if there is none to little diversity in the extracted sentences, you can use the keyword based extraction and if the sentences you are looking to extract are very diverse you can easily train a model for that without coding.
- No code or AI knowledge required: you only have to create the application in the Studio and to provide your documents. That’s it. Our comprehensive user guide will help you throughout the process of building your application in the Studio
A glance at the algorithm
2OS Sentence Classification uses regexes and Convolutional Neural Networks to find the relevant sentences:
- It starts by detecting the document’s layout and structures
- It detects the key elements in this structure like titles, headers and paragraphs and understands the relationship between them
- It splits every page into sentences
- Depending on the setup, it retrieves the desired sentences either by keyword matching or by classification.
To get a detailed description, please refer to our scientific paper https://www.aclweb.org/anthology/W18-3106.pdf.
How it works
Figure 2. Sentence Classification pipeline
2OS Sentence Classification enables you to leverage state-of-the-art AI to automate your information extract tasks without prior knowledge of machine learning. It is very accessible and building an application with it is very fast.
2OS Sentence Classification uses a hybrid system that empowers the user with state of the art machine learning capabilities for difficult tasks while keeping the simplicity of a rule based model for simpler ones. The module has two components:
- Regex based keyword matching
- Convnet based sentence classifier
The regex approach takes as input a glossary of words that are relevant to the user. The algorithm then retrieves all the sentences containing at least one of the words in the glossary. This is very useful when the user has a clear idea of what he is looking for. One can also add a 2OS Lexical Field Generation module to expand the glossary with new terms.
The convnet sentence classifier uses a neural network to learn from a training set what is relevant and what is not. As a user, you will need to annotate some documents using the 2OS Annotation Tool and train your sentence classification model. Once you have a trained model, you can use that model to detect the sentences in new documents. All of these steps can be done within your 2OS application. This approach is a bit more demanding but is much more effective on nuanced and complex tasks.
The model behind the classifier is the following and is described in much more detailed in our paper https://www.aclweb.org/anthology/W18-3106.pdf:
You can design your application so that you can manually add entries to your glossary in the app. In the example below, the user added entries for an energy related task.
Figure 3. Example of how to feed data into a glossary
You can then add your documents directly to the application and start detecting sentences.
Figure 4. Example of Sentence Classification result from a given glossary
For further information, please navigate to the tutorial on sentence classification.
Investment rules in fund prospectuses
In financial regulation, investment funds are required to publish documents called prospectuses that contain all the characteristics of the fund. Characteristics include, in what market is the fund allowed to invest, are there any industry restrictions, what is the maximum exposure to derivatives, etc. These rules are critical information to attract investors and thus the fund must comply with them and asset managers and depository institutions are the ones ensuring it. They allocate accountants and analysts to read the prospectuses, extract the rules and implement the rules extracted into their IT system. With 2OS Sentence Classification, the first two steps are automated. In this case, the bank trains its own 2OS Sentence Classification model since its documents are industry specific. This resulted in reducing the operational cost for the team since less human intervention is required.
Figure 1. Example of Sentence Classification result on the Annotation Tool