This experiment was done to find out which method of building LMs is the best. I was surprised to find out that train_lm.sh performed the best. The train_lm.sh now has a flag -- include-heldout. Without a flag only 75% of data is used to build LM. With the flag we are using all that data. So, the new version LM is slightly bigger as it has all the dataset. make_kn_lm.py produces a very big LM and it is not removing words with low counts.
Type | code for 4-gram | lm size | graph size HCLG.fst | WER |
---|---|---|---|---|
NEW version fisher kaldi train_lm.sh | -- include-heldout. awk '{print "foo", $0}' < $file > data/train_all/text local/fisher_train_lms.sh ||exit 1; | 7M | 108M | 28.69 |
fisher kaldi train_lm.sh | awk '{print "foo", $0}' < $file > data/train_all/text local/fisher_train_lms.sh ||exit 1; | 6.7M | 103M | 28.71 |
fisher kaldi train_lm.sh | just before the line: if [ -f $subdir/config.$num_configs ]; then ... in train_lm.sh, add the line: cp $subdir/config.0 $subdir/config.$num_configs | 6.7M | 103M | 28.70 |
make_kn_lm.py | python3 make_kn_lm.py -ngram-order 4 -text "data/train_all/text" -lm | 15M | 418M | 30.77 |
ngram-count [-ukndiscount] | corpus=data/train_all/text ngram-count -order 4 -kn-modify-counts-at-end -ukndiscount -gt1min 0 -gt2min 0 -gt3min 0 -gt4min 0 -text $corpus -lm lm.arpa | 16M | 435M | 30.77 |
ngram-count [-kndiscount -interpolate] | cat $file > data/train_all/text corpus=data/train_all/text | 20M | 436M | 30.70 |
Advice: Be very careful with the first word. Some scripts need an id some don't. Also, when calculating perplexities both texts must be id free.
References: https://github.com/danpovey/kaldi_lm/pull/5
Comments
Post a Comment