DocReader RegexExtractor

Overview

Use rule-based methods to accurately extract information with similar patterns from a large amount of documents in no time! 

RegexExtractor is an Information Extraction system based on human-defined rules. It allows an efficient extraction of information that follows the same pattern. 

The system takes as input the human-defined patterns and PDF documents, and extracts all text sequences that follow any of the patterns.

Features

  • Our Text Processing module accepts the most used document format: PDF documents, on which you can directly visualize the RegexExtractor results.

  • The labelling feature groups many patterns under the same label so that your visualization experience is friendlier.

How it works

The user configures the RegexExtractor with a pattern for the entity that they want to extract, for example, the ISIN, and then can use this API in a workflow that takes a document as input and outputs structure data that includes all the entities that the user wants.

Figure 1. Regex based Information Extraction Pipeline

Benefits

  • Easy to use : as long as you know your pattern, the RegexExtractor is extremely simple to use; simply load the PDF and your patterns in the workflow and enjoy the results. No training data is needed.

  • Codeless solution : the Studio allows you to use the RegexExtractor in a workflow that can be created with 0 line of code.

  • Domain-independent : you can use any searchable PDF document you want with no limit on the business case.

  • Performance : one the main advantages of the RegexExtractor module is its fast execution, faster than any Machine Learning model. 

  • Explainable : You know exactly why a matching worked or did not



Use Case

Let’s suppose we want to extract an ISIN code from a Key Investor Information Document (KIID). The ISIN code is a twelve character alphanumeric string that is used to identify securities. We can write a simplified regular expression to identify the ISIN code in a KIID document as follows:

(?:\s)((?:IE|LU|GB|FR)[0-9][0-9A-Z]{9})(?:\s)

This matches a sequence of characters that start with two letters like IE or FR, which is the country code followed by a digit and ends with a sequence of 9 alphanumeric characters.

We can then use this regular expression in RegexExtractor applied to a KIID document:

Figure 2. Example of the Regex based Information Extraction

The module finds LU1735752385 as a result. This showcases the advantages of using this API since it does not require any data annotation and can give immediate results. Those results can be very accurate when using a well-crafted pattern applied to a data field that follows a fixed pattern.