In the use case about Text anonymization on sensitive data, SIESTA develops a tool that allows training machine learning models using anonymised text data. In addition, a synthetic dataset will be generated to evaluate the performance of the
tool on both the original and the anonymised dataset.
Current methods are mainly based on masking or suppressing the original sensitive data, on pseudonymisation techniques that partially replace sensitive data by its category or on noising, where sensitive data is modified. The proposaed approach generates data similar to the sensitive one, substituting the original, avoiding any disclosure of critical information and allowing the machine learning models to classify the text in the same category as the original, with a very similar performance.