Compare different ways of building LMs.

 This experiment was done to find out which method of building LMs is the best. I was surprised to find out that train_lm.sh performed the best. The train_lm.sh now has a flag -- include-heldout. Without a flag only 75% of data is used to build LM. With the flag we are using all that data. So, the new version LM is slightly bigger as it has all the dataset. make_kn_lm.py produces a very big LM and it is not removing words with low counts. 

Typecode for 4-gramlm sizegraph size HCLG.fstWER
NEW version fisher kaldi train_lm.sh-- include-heldout. awk '{print "foo", $0}' < $file > data/train_all/text local/fisher_train_lms.sh ||exit 1;7M108M28.69
fisher kaldi train_lm.shawk '{print "foo", $0}' < $file > data/train_all/text local/fisher_train_lms.sh ||exit 1;6.7M103M28.71
fisher kaldi train_lm.shjust before the line: if [ -f $subdir/config.$num_configs ]; then ... in train_lm.sh, add the line: cp $subdir/config.0 $subdir/config.$num_configs6.7M103M28.70
make_kn_lm.pypython3 make_kn_lm.py -ngram-order 4 -text "data/train_all/text" -lm 15M418M30.77
ngram-count [-ukndiscount]corpus=data/train_all/text ngram-count -order 4 -kn-modify-counts-at-end -ukndiscount -gt1min 0 -gt2min 0 -gt3min 0 -gt4min 0 -text $corpus -lm lm.arpa16M435M30.77
ngram-count [-kndiscount -interpolate]cat $file > data/train_all/text corpus=data/train_all/text ngram-count -order 4 -kn-modify-counts-at-end -kndiscount -interpolate -gt1min 0 -gt2min 0 -gt3min 0 -gt4min 0 -text $corpus -lm lm.arpa20M436M30.70

Advice: Be very careful with the first word. Some scripts need an id some don't. Also, when calculating perplexities both texts must be id free. 
References: https://github.com/danpovey/kaldi_lm/pull/5

Comments