P6: Find data to decode using our Librispeech ASR model, dict and lm that we downloaded previously.

 Find data to decode using our ASR model, dict and lm that we downloaded previously.

At this point we have exp, dict, lm folders.

[ec2-user@ip-172-31-6-113 ~]$ ls
exp kaldi trash
[ec2-user@ip-172-31-6-113 ~]$

Looks like there is a script that might download the librispeech dataset for you

[ec2-user@ip-172-31-6-113 s5]$ tree -L 1 local
local
├── chain
├── data_prep.sh
├── decode_example.sh
├── download_and_untar.sh
├── download_lm.sh
├── format_data.sh
├── format_lms.sh
├── g2p
├── g2p.sh
├── lm
├── lookahead
├── nnet2
├── nnet3
├── online
├── online_pitch
├── prepare_dict.sh
├── prepare_example_data.sh
├── rnnlm
├── run_cleanup_segmentation.sh
├── run_data_cleaning.sh
├── run_nnet2_clean_100.sh
├── run_nnet2_clean_460.sh
├── run_nnet2.sh
├── run_rnnlm.sh
└── score.sh

9 directories, 16 files
[ec2-user@ip-172-31-6-113 s5]$

The file below seems does it for us.

download_and_untar.sh

I decided manually to download test set for now.

The plan is to decode files that are in the test set. We should have good results as the ASR model was trained on the similar dataset.

https://www.openslr.org/12/


wget https://www.openslr.org/resources/12/test-clean.tar.gz
tar -xvzf test-clean.tar.gz
[ec2-user@ip-172-31-6-113 ~]$ ls
exp kaldi trash
[ec2-user@ip-172-31-6-113 ~]$ wget https://www.openslr.org/resources/12/test-clean.tar.gz
--2021-12-12 21:25:52-- https://www.openslr.org/resources/12/test-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://us.openslr.org/resources/12/test-clean.tar.gz [following]
--2021-12-12 21:25:53-- http://us.openslr.org/resources/12/test-clean.tar.gz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 346663984 (331M) [application/x-gzip]
Saving to: ‘test-clean.tar.gz’

100%[===============================================================================================================================================>] 346,663,984 8.44MB/s   in 97s    

2021-12-12 21:27:30 (3.40 MB/s) - ‘test-clean.tar.gz’ saved [346663984/346663984]

[ec2-user@ip-172-31-6-113 ~]$
[ec2-user@ip-172-31-6-113 ~]$ ls
exp kaldi LibriSpeech trash
[ec2-user@ip-172-31-6-113 ~]$ ls
exp kaldi LibriSpeech trash
[ec2-user@ip-172-31-6-113 ~]$ cd LibriSpeech/
[ec2-user@ip-172-31-6-113 LibriSpeech]$ ls
BOOKS.TXT CHAPTERS.TXT LICENSE.TXT README.TXT SPEAKERS.TXT test-clean
[ec2-user@ip-172-31-6-113 LibriSpeech]$ ls
BOOKS.TXT CHAPTERS.TXT LICENSE.TXT README.TXT SPEAKERS.TXT test-clean
[ec2-user@ip-172-31-6-113 LibriSpeech]$ cd test-clean/
[ec2-user@ip-172-31-6-113 test-clean]$ ls
1089  121   1284  1580  2094  237  2830  3570  3729  4446  4970  5105  5639  61   6829  7021  7176  8224  8455  8555
1188  1221  1320  1995  2300  260  2961  3575  4077  4507  4992  5142  5683  672  6930  7127  7729  8230  8463  908
[ec2-user@ip-172-31-6-113 test-clean]$
[ec2-user@ip-172-31-6-113 1089]$ tree 134686 
134686
├── 1089-134686-0000.flac
├── 1089-134686-0001.flac
├── 1089-134686-0002.flac
├── 1089-134686-0003.flac
├── 1089-134686-0004.flac
├── 1089-134686-0005.flac
├── 1089-134686-0006.flac
├── 1089-134686-0007.flac
├── 1089-134686-0008.flac
├── 1089-134686-0009.flac
├── 1089-134686-0010.flac
├── 1089-134686-0011.flac
├── 1089-134686-0012.flac
├── 1089-134686-0013.flac
├── 1089-134686-0014.flac
├── 1089-134686-0015.flac
├── 1089-134686-0016.flac
├── 1089-134686-0017.flac
├── 1089-134686-0018.flac
├── 1089-134686-0019.flac
├── 1089-134686-0020.flac
├── 1089-134686-0021.flac
├── 1089-134686-0022.flac
├── 1089-134686-0023.flac
├── 1089-134686-0024.flac
├── 1089-134686-0025.flac
├── 1089-134686-0026.flac
├── 1089-134686-0027.flac
├── 1089-134686-0028.flac
├── 1089-134686-0029.flac
├── 1089-134686-0030.flac
├── 1089-134686-0031.flac
├── 1089-134686-0032.flac
├── 1089-134686-0033.flac
├── 1089-134686-0034.flac
├── 1089-134686-0035.flac
├── 1089-134686-0036.flac
├── 1089-134686-0037.flac
└── 1089-134686.trans.txt

0 directories, 39 files
[ec2-user@ip-172-31-6-113 1089]$

Comments