Text generation is one of machine learning’s applications that is commonly used both in industry and in academic settings. It ranges from simple sentence and word suggestions to generating entire documents. In our application, we propose a generation system based on Data-to-text generation.
This concerns the problem of generating well-formed natural language descriptions from formal representations.
The system we are proposing can generate any type of documents such as reports, invoices, contracts, etc., in multiple industries including Finance, Legal, sport …
The user inserts information in a structured format using a user-friendly interface and then uses the system to generate a document formulating the information in a well structured textual format.
DocWriter is a tool allowing to generate different types of documents from a given structured data using cutting-edge AI generation techniques.
- DocWriter allows to generate a document from structured data; domain-independent and document type-independent.
- Time-saving: Once the user updates the data, this information is automatically detected and the only related sections to this modification are re-generated.
- A reporting is generated along with the document. This reporting contains KPI and details on the coverage of the given information by the user in the generated document.
- An Annotation Tool is integrated.
- A data management solution is integrated, for example, for financial documents generation, once we created an input data on a specific fund, this is used for generating different types of documents of this fund.
- A new document type can be generated simply annotating one (or more if we expect to generate various sentences) document(s) in a few minutes.
- A csv input data can be imported into the data management tool.
- The input data created via 2OS Studio can be exported into a csv file.
- Several generation strategies based on cutting-edge AI techniques are adopted: templates-based, deep-learning based, domain knowledge based, rules based.
- The generated sentences are grammatical because a spell checker and grammar corrector developed in-house are integrated in the generation algorithms
- The generation algorithms work in multi-languages with automatic language detection.
How it works
To generate a document from a structured input information, we use a dynamic templates method, which means that we need to construct a dataset of templates from a set of documents with the same type as the expected document:
You need to annotate a set of documents with the same type as the document you want to generate, and then insert a set of information in a field-value format allowing to generate a new document representing these information.
The annotation process is very important since the quality of annotations impacts directly the quality of the generated document. Before starting the annotation, you should define your data model, which corresponds to the set of fields (entities) you want to input as information to generate. You should follow an annotation process defined in the platform, that generally puts rules in the fields names choices, and also defines how to annotate the entities. In order for the system to recognize the structure of the document, you should annotate the titles of sections, this way, the system constructs a database of templates of each section and also learns the structure of the document to be generated.
The generation system is based on scoring metrics and clustering methods allowing to select the best templates dynamically according to your input data.
The most important scoring metric is the coverage, which consists of a comparison between: (1) the fields with non-empty values in the data entered by users as input and (2) the fields in the templates of a document. As we precised before, the templates are stored by sections, therefore, according to your input, the chosen templates are the ones maximizing the coverage metric.
Using the 2OS AI Builder, it is possible to create a Document Generation application, where you can create your dataset and then fill out the information about the document to generate:
Figure 2. Example of a Document Generation application’s UI
In order to make the annotation process user-friendly, we worked on an easy-to-use annotation tool for pdf documents. You can quickly annotate a document to create ready to use templates in a few easy steps.
This annotation tool can be integrated into your application so that you can do all the operations within the application without having to open an external application to annotate.
Once you annotated enough documents, all you have to do is switch the page in the application and go to the page where you can insert the structured information you want to
generate. The page will contains the list of entities in a form, and you can fill out the form with the information of the document you would like to generate.
Once you have inserted the information, you can click on the button generate document and wait a few seconds to have the link to the a google document where you can visualize the generated document, and eventually make some adjustments.