We have to decide what to do with UNKNOWN words.
There are many language model tools SRILM etc that automatically turn unknown words into <UNK>.
If the chosen tool doesn't do it, we have to do it during the data preparation.
Example: we have corpus.txt and our lexicon only has one word in it that is HI.
During the data preparation we need to replace all words that are not in our vocabulary to UNK
Before: HI JACK
After: HI UNK
Need to investigate "why" part.
Do some experiments.
Interesting to see what will happen if we keep the OOV words.
Comments
Post a Comment