We must put one sentence per line because it affects the end-of-sentence (EOS) probability.
There are many tools out there that can break down text to one sentence per line.
I tried BlingFire tokenizer and liked it a lot.
Tutorial was found here: https://towardsdatascience.com/pre-processing-a-wikipedia-dump-for-nlp-model-training-a-write-up-3b9176fdf67
Comments
Post a Comment