AI Builder

DocReader RegexExtractor

Overview

Use rule-based methods to accurately extract information with similar patterns from a large amount of documents in no time! 

RegexExtractor is an Information Extraction system based on human-defined rules. It allows an efficient extraction of information that follows the same pattern. 

The system takes as input the human-defined patterns and PDF documents, and extracts all text sequences that follow any of the patterns.

Features

  • Our Text Processing module accepts the most used document format: PDF documents, on which you can directly visualize the RegexExtractor results.

  • The labelling feature groups many patterns under the same label so that your visualization experience is friendlier.

Benefits

  • Easy to use : as long as you know your pattern, the RegexExtractor is extremely simple to use; simply load the PDF and your patterns in the workflow and enjoy the results. No training data is needed.

  • Codeless solution : the Studio allows you to use the RegexExtractor in a workflow that can be created with 0 line of code.

  • Domain-independent : you can use any searchable PDF document you want with no limit on the business case.

  • Performance : one the main advantages of the RegexExtractor module is its fast execution, faster than any Machine Learning model. 

  • Explainable : You know exactly why a matching worked or did not



How it works

The user configures the RegexExtractor with a pattern for the entity that they want to extract, for example, the ISIN, and then can use this API in a workflow that takes a document as input and outputs structure data that includes all the entities that the user wants.

Figure 1. Regex based Information Extraction Pipeline

Use Case

Let’s suppose we want to extract an ISIN code from a Key Investor Information Document (KIID). The ISIN code is a twelve character alphanumeric string that is used to identify securities. We can write a simplified regular expression to identify the ISIN code in a KIID document as follows:

(?:\s)((?:IE|LU|GB|FR)[0-9][0-9A-Z]{9})(?:\s)

This matches a sequence of characters that start with two letters like IE or FR, which is the country code followed by a digit and ends with a sequence of 9 alphanumeric characters.

We can then use this regular expression in RegexExtractor applied to a KIID document:

Figure 2. Example of the Regex based Information Extraction

The module finds LU1735752385 as a result. This showcases the advantages of using this API since it does not require any data annotation and can give immediate results. Those results can be very accurate when using a well-crafted pattern applied to a data field that follows a fixed pattern.