Is LibriSpeech a GMM-based model or neural net model?

Trying to understand the difference between GMM-based system and NNET parts.

All the recipes start off with the GMM-based systems and later do the nnet parts. 

GMM system gives a best alignment in speech processing. My understanding is that we get 100 frames per second and for each frame we map audio to phones. [? need checking]. Alignments tell us for each frame which phone was active.

LibriSpeech ASR training starts with training GMM system, tri1, tri2b etc... GMM system has different stages corresponding to different types of feature transforms, different phases of alignments.

ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5/exp$ ls
chain_cleaned mono         nnet3_cleaned tri1_ali_10k tri2b_ali_10k tri3b_ali_clean_100 tri4b_ali_clean_460 tri5b_ali_960 tri6b_ali_cleaned tri6b_cleaned_ali_train_960_cleaned_sp
make_mfcc     mono_ali_5k tri1           tri2b         tri3b         tri4b               tri5b               tri6b         tri6b_cleaned     tri6b_cleaned_work
ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5/exp$ cd tri6b_cleaned

At the end it produces the final GMM system that can be found in exp/tri6b_cleaned final.mdl file.

ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5/exp/tri6b_cleaned$ ls
35.alimdl   ali.29.gz ali.51.gz ali.74.gz ali.97.gz                     final.mat   fsts.29.gz fsts.51.gz fsts.74.gz fsts.97.gz     trans.21 trans.44 trans.67 trans.9
35.mdl     ali.3.gz   ali.52.gz ali.75.gz ali.98.gz                     final.mdl   fsts.3.gz   fsts.52.gz fsts.75.gz fsts.98.gz     trans.22 trans.45 trans.68 trans.90
35.occs     ali.30.gz ali.53.gz ali.76.gz ali.99.gz                     final.occs   fsts.30.gz fsts.53.gz fsts.76.gz fsts.99.gz     trans.23 trans.46 trans.69 trans.91
ali.1.gz   ali.31.gz ali.54.gz ali.77.gz cmvn_opts                     fsts.1.gz   fsts.31.gz fsts.54.gz fsts.77.gz full.mat       trans.24 trans.47 trans.7   trans.92
ali.10.gz   ali.32.gz ali.55.gz ali.78.gz decode_dev_clean_fglarge     fsts.10.gz   fsts.32.gz fsts.55.gz fsts.78.gz graph_tgsmall trans.25 trans.48 trans.70 trans.93
ali.100.gz ali.33.gz ali.56.gz ali.79.gz decode_dev_clean_tglarge     fsts.100.gz fsts.33.gz fsts.56.gz fsts.79.gz log           trans.26 trans.49 trans.71 trans.94
ali.11.gz   ali.34.gz ali.57.gz ali.8.gz   decode_dev_clean_tgmed       fsts.11.gz   fsts.34.gz fsts.57.gz fsts.8.gz   num_jobs       trans.27 trans.5   trans.72 trans.95
ali.12.gz   ali.35.gz ali.58.gz ali.80.gz decode_dev_clean_tgsmall     fsts.12.gz   fsts.35.gz fsts.58.gz fsts.80.gz phones.txt     trans.28 trans.50 trans.73 trans.96
ali.13.gz   ali.36.gz ali.59.gz ali.81.gz decode_dev_clean_tgsmall.si   fsts.13.gz   fsts.36.gz fsts.59.gz fsts.81.gz questions.int trans.29 trans.51 trans.74 trans.97
ali.14.gz   ali.37.gz ali.6.gz   ali.82.gz decode_dev_other_fglarge     fsts.14.gz   fsts.37.gz fsts.6.gz   fsts.82.gz questions.qst trans.3   trans.52 trans.75 trans.98
ali.15.gz   ali.38.gz ali.60.gz ali.83.gz decode_dev_other_tglarge     fsts.15.gz   fsts.38.gz fsts.60.gz fsts.83.gz splice_opts   trans.30 trans.53 trans.76 trans.99
ali.16.gz   ali.39.gz ali.61.gz ali.84.gz decode_dev_other_tgmed       fsts.16.gz   fsts.39.gz fsts.61.gz fsts.84.gz trans.1       trans.31 trans.54 trans.77 tree
ali.17.gz   ali.4.gz   ali.62.gz ali.85.gz decode_dev_other_tgsmall     fsts.17.gz   fsts.4.gz   fsts.62.gz fsts.85.gz trans.10       trans.32 trans.55 trans.78
ali.18.gz   ali.40.gz ali.63.gz ali.86.gz decode_dev_other_tgsmall.si   fsts.18.gz   fsts.40.gz fsts.63.gz fsts.86.gz trans.100     trans.33 trans.56 trans.79
ali.19.gz   ali.41.gz ali.64.gz ali.87.gz decode_test_clean_fglarge     fsts.19.gz   fsts.41.gz fsts.64.gz fsts.87.gz trans.11       trans.34 trans.57 trans.8
ali.2.gz   ali.42.gz ali.65.gz ali.88.gz decode_test_clean_tglarge     fsts.2.gz   fsts.42.gz fsts.65.gz fsts.88.gz trans.12       trans.35 trans.58 trans.80
ali.20.gz   ali.43.gz ali.66.gz ali.89.gz decode_test_clean_tgmed       fsts.20.gz   fsts.43.gz fsts.66.gz fsts.89.gz trans.13       trans.36 trans.59 trans.81
ali.21.gz   ali.44.gz ali.67.gz ali.9.gz   decode_test_clean_tgsmall     fsts.21.gz   fsts.44.gz fsts.67.gz fsts.9.gz   trans.14       trans.37 trans.6   trans.82
ali.22.gz   ali.45.gz ali.68.gz ali.90.gz decode_test_clean_tgsmall.si fsts.22.gz   fsts.45.gz fsts.68.gz fsts.90.gz trans.15       trans.38 trans.60 trans.83
ali.23.gz   ali.46.gz ali.69.gz ali.91.gz decode_test_other_fglarge     fsts.23.gz   fsts.46.gz fsts.69.gz fsts.91.gz trans.16       trans.39 trans.61 trans.84
ali.24.gz   ali.47.gz ali.7.gz   ali.92.gz decode_test_other_tglarge     fsts.24.gz   fsts.47.gz fsts.7.gz   fsts.92.gz trans.17       trans.4   trans.62 trans.85
ali.25.gz   ali.48.gz ali.70.gz ali.93.gz decode_test_other_tgmed       fsts.25.gz   fsts.48.gz fsts.70.gz fsts.93.gz trans.18       trans.40 trans.63 trans.86
ali.26.gz   ali.49.gz ali.71.gz ali.94.gz decode_test_other_tgsmall     fsts.26.gz   fsts.49.gz fsts.71.gz fsts.94.gz trans.19       trans.41 trans.64 trans.87
ali.27.gz   ali.5.gz   ali.72.gz ali.95.gz decode_test_other_tgsmall.si fsts.27.gz   fsts.5.gz   fsts.72.gz fsts.95.gz trans.2       trans.42 trans.65 trans.88
ali.28.gz   ali.50.gz ali.73.gz ali.96.gz final.alimdl                 fsts.28.gz   fsts.50.gz fsts.73.gz fsts.96.gz trans.20       trans.43 trans.66 trans.89
ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5/exp/tri6b_cleaned$

The final GMM system final.mdl is used to start training Neural Net model that can be found in exp/chain_cleaned/tdnn_1d_sp

Currently Kaldi’s neural net training uses C and CUDA languages. In future Kaldi will be using PyTorch. New version will have production tools to do real time speech recognition on the server.

Comments