Why replace words that are not in the lexicon.txt with UNK when building LM?

 

We have to decide what to do with UNKNOWN words. 

There are many language model tools SRILM etc that  automatically turn unknown words into <UNK>.

If the chosen tool doesn't do it, we have to do it during the data preparation.


Example: we have corpus.txt and our lexicon only has one word in it that is HI.

During the data preparation we need to replace all words that are not in our vocabulary to UNK

Before: HI JACK

After: HI UNK


Need to investigate "why" part.

Do some experiments.

Interesting to see what will happen if we keep the OOV words. 

Comments