Reconstruct the Original Training Data from N-Grams.

When building Language Models using n-grams we should be aware that someone can reconstruct the original training data from built LM. If we are using sensitive data, we should keep in mind privacy concerns. If n-gram order is high enough we can always reconstruct training data. For example, if we use 5-gram model we probably can easily reconstruct the training data. If it is just 3-gram model reconstructing probably is not possible if there is a lot of data.

Other ways to build LMs are: RNNLMs or Transformers. We still must be careful with sensitive data. It is my understanding we have to have a very big dataset for Transformers and on smaller scale RNNLMs would perform better. PyTorch probably has scripts to build RNNLMs.

Nadira Povey

Search This Blog

Reconstruct the Original Training Data from N-Grams.

Comments

Post a Comment