Reconstruct the Original Training Data from N-Grams.

Reconstruct the Original Training Data from N-Grams.

When building Language Models using n-grams we should be aware that someone can reconstruct the original training data from built LM. If we are using sensitive data, we should keep in mind privacy concerns.  If n-gram order is high enough we can always reconstruct training data. For example, if we use 5-gram model we probably can easily reconstruct the training data. If it is just 3-gram model reconstructing probably is not possible if there is a lot of data.

Other ways to build LMs are: RNNLMs or Transformers. We still must be careful with sensitive data. It is my understanding we have to have a very big dataset for Transformers and on smaller scale RNNLMs would perform better. PyTorch probably has scripts to build RNNLMs. 

Comments