Reconstruct the Original Training Data from N-Grams.
When building Language Models using n-grams we should be aware that someone can reconstruct the original training data from built LM. If we are using sensitive data, we should keep in mind privacy concerns. If n-gram order is high enough we can always reconstruct training data. For example, if we use 5-gram model we probably can easily reconstruct the training data. If it is just 3-gram model reconstructing probably is not possible if there is a lot of data.
Other ways to build LMs are: RNNLMs or Transformers. We still must be careful with sensitive data. It is my understanding we have to have a very big dataset for Transformers and on smaller scale RNNLMs would perform better. PyTorch probably has scripts to build RNNLMs.
Comments
Post a Comment