AI for Structured Data

Data Normalization

Overview

When you wish to exploit structured data, you first want to ensure the quality of your data at different levels: syntactic errors; heterogeneously used terms with its variants including abbreviations and acronyms, heterogeneous formats (e.g. in writing dates, letters, currencies, etc.), and duplicates. Because you will find yourself having a large volume of structured data, manually normalizing all the terms would not only be very time consuming and costly but it would also generate other problems like inconsistency in the normalized data.

Hence, we implemented data normalization algorithms which allow profiling automatically contents in a given structured data, measuring the problems, correcting the errors and normalizing the contents based on references. Thanks to this, you can retrieve all variations of each term, including its abbreviations and acronyms; identify duplicates; find spelling errors; check if a term or content of the data is coherent with the rest and finally obtain a qualified and normalized data.

Features

  • Given a list of entities, analyze if the list contains entities with spelling errors and correct them.
  • Given a list of entities, analyze if the list contains entities with duplicates and delete them.
  • Given a list of entities, analyze if the list contains entities with variations referring to a same entity (plural, singular, abbreviation, etc.) and standardize them.
  • During the validation step, we suggest top 5 candidates and leave the choice to the users.
  • The user can also accept what our model offers without any validation step
  • The data can be normalized based on two types of references:
    • if an external reference exists, our model can refer to this reference during the normalization
    • if any reference exists, our model creates internally a reference by analyzing the variations of each term and their importance

Benefits

  • Adaptability: our app works regardless of the domain, it can take into account any type of structured data.
  • Interactivity: The user is encouraged to take part in the process and has the possibility to freely define the way that he/she expects to normalize the data.
  • Speed: our app makes the whole process a lot faster than if you were to do it manually. To give you an idea, you can get 300,000 entities normalised within a minute!
  • Including the client in the process: a top priority. The client either has its reference and we can connect it to our algorithms (customisable) or it has unnormalised data with which, thanks to our algorithm, we can intelligently generate a reference and include it in our glossary of reference terms. Usually, as a user, you either get the opportunity to choose the data you want to normalise or you are not involved in the process and everything will be normalised automatically. With our AI Data Normalisation app, you can do both.