Compare different ways of building LMs.

This experiment was done to find out which method of building LMs is the best. I was surprised to find out that train_lm.sh performed the best. The train_lm.sh now has a flag -- include-heldout. Without a flag only 75% of data is used to build LM. With the flag we are using all that data. So, the new version LM is slightly bigger as it has all the dataset. make_kn_lm.py produces a very big LM and it is not removing words with low counts.

Type	code for 4-gram	lm size	graph size HCLG.fst	WER
NEW version fisher kaldi train_lm.sh	-- include-heldout. awk '{print "foo", $0}' < $file > data/train_all/text local/fisher_train_lms.sh \|\|exit 1;	7M	108M	28.69
fisher kaldi train_lm.sh	awk '{print "foo", $0}' < $file > data/train_all/text local/fisher_train_lms.sh \|\|exit 1;	6.7M	103M	28.71
fisher kaldi train_lm.sh	just before the line: if [ -f $subdir/config.$num_configs ]; then ... in train_lm.sh, add the line: cp $subdir/config.0 $subdir/config.$num_configs	6.7M	103M	28.70
make_kn_lm.py	python3 make_kn_lm.py -ngram-order 4 -text "data/train_all/text" -lm	15M	418M	30.77
ngram-count [-ukndiscount]	corpus=data/train_all/text ngram-count -order 4 -kn-modify-counts-at-end -ukndiscount -gt1min 0 -gt2min 0 -gt3min 0 -gt4min 0 -text $corpus -lm lm.arpa	16M	435M	30.77
ngram-count [-kndiscount -interpolate]	cat $file > data/train_all/text corpus=data/train_all/text ngram-count -order 4 -kn-modify-counts-at-end -kndiscount -interpolate -gt1min 0 -gt2min 0 -gt3min 0 -gt4min 0 -text $corpus -lm lm.arpa	20M	436M	30.70

Advice: Be very careful with the first word. Some scripts need an id some don't. Also, when calculating perplexities both texts must be id free.

References: https://github.com/danpovey/kaldi_lm/pull/5

Nadira Povey

Search This Blog

Compare different ways of building LMs.

Comments

Post a Comment