Pre Annotation Tool


Prepare a large annotated dataset in a simple and efficient way for training a supervised Machine Learning model and reduce annotation cost and time. We propose solutions to pre-annotate documents by the way that our pre-annotation methods allow to find mentions of entities in the documents automatically.


The user wants to prepare a labeled dataset with documents from a new field (pharma, car manufacturing, etc). Before annotating the documents, the 2OS app proposes to the user to pre-annotate the documents using one of the available solutions in the 2OS AI Builder. 

  • Glossary based: The user can pre-annotate the documents using either bundled glossaries or glossaries uploaded by the user. If the user chooses the pre-annotation option, the app will load the glossaries and use them to annotate the documents. Thus the user is presented with a set of pre-annotated documents for a faster usage of the app (less annotation time).
  • Regular Expressions (Regex) based: The user can pre-annotate the documents using either bundled regexes or regexes defined by the user. Thus expressions matched with the regexes are annotated in the documents.
  • Similarity Scoring based: Given a list of sentences from the user, the app calculates similarity scoring between the given sentences and the sentences in the documents. Once the most similar sentences are identified, they are used to annotate the documents. 
  • Machine Learning models based: Once the user gathers between 10 and 30 annotated documents, a machine learning model can be trained on this dataset. Although this model is not ready to be usable in production, the user can pre-annotate documents with the model to speed up the human annotation. 


How it works


  • Reduce annotation team effort: different Natural Language Processing experiments often require a large amount of labeled dataset in order to develop a supervised machine learning algorithm with a good performance. The creation of these resources is time consuming and costly, the automatic pre-annotation tool allows to speed up the annotation process by reducing human annotation effort.
  • Ensure the consistency in the labeled data: several aspects like inter-annotator disagreement, annotator’s experience and domain knowledge can impact on the human annotation process, consequently, on the quality of the labeled data. The automatic pre-annotation tool can label data in a consistent way.  
  • Adopt the most efficient option: several solutions to pre-annotate documents are proposed in 2OS. According to the task that the user wants to carry out, he/she can choose the best option, for example, if the user needs to train a classification model which identifies the sentences related to a specific topic, he/she can adopt the pre-annotation option based on the similarity scoring.

Use Case

There can be two types of users with different needs:

  • First time users: no trained models available. Pre-annotation options are glossary, regex or bundled models in 2OS (for example, Name Entity Recognition for common fields like date, locations)

In the case that a first time user wants to train a Name Entity Recognition model using the pre-annotatiion tool, the user

  • uploads documents on the DocReader app
  • goes to pre-annotation screen and choose one of the options (Glossary, Regex, Similarity Scoring, Machine Learning Model)
  • chooses the glossary to pre-annotate documents
  • goes to Annotation Tool and find the pre-annotated documents
  • Expert users: users who already have trained models with 2OS. They can use these models to pre-annotate new data for the same project or for a different one.