P4: Librispeech download ASR model, i-vector extractor and 8 other files that will be used as LMs.

 Step 1: Download librispeech ASR and i-vector extractor model to you computer

https://kaldi-asr.org/models/m13

This step was covered in previous blog.










Step 2: Download 8 files using the download_lm.sh script .

[ec2-user@ip-172-31-6-113 local]$ ls
chain             download_and_untar.sh format_lms.sh lm         nnet3         prepare_dict.sh         run_cleanup_segmentation.sh run_nnet2_clean_460.sh score.sh
data_prep.sh       download_lm.sh         g2p           lookahead online       prepare_example_data.sh run_data_cleaning.sh         run_nnet2.sh
decode_example.sh format_data.sh         g2p.sh         nnet2     online_pitch rnnlm                   run_nnet2_clean_100.sh       run_rnnlm.sh
[ec2-user@ip-172-31-6-113 local]$ nano download_lm.sh

this script actually does everything below for us, discovered it after doing the steps manually

download_lm.sh






https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/local/download_lm.sh

For script to work we need correct paths to needed folders. In kaldi it is a good rule to run all scripts from s5 folder.

./local/download_lm.sh http://www.openslr.org/resources/11 data/local/lm

we can see that we got our language models below

[ec2-user@ip-172-31-6-113 local]$ tree lm
lm
├── 3-gram.arpa.gz
├── 3-gram.pruned.1e-7.arpa.gz
├── 3-gram.pruned.3e-7.arpa.gz
├── 4-gram.arpa.gz
├── g2p-model-5
├── librispeech-lexicon.txt
├── librispeech-lm-corpus.tgz
├── librispeech-vocab.txt
├── lm_fglarge.arpa.gz -> 4-gram.arpa.gz
├── lm_tglarge.arpa.gz -> 3-gram.arpa.gz
├── lm_tgmed.arpa.gz -> 3-gram.pruned.1e-7.arpa.gz
└── lm_tgsmall.arpa.gz -> 3-gram.pruned.3e-7.arpa.gz

0 directories, 12 files
[ec2-user@ip-172-31-6-113 local]$

output downloaded 8 files

[ec2-user@ip-172-31-6-113 s5]$ ./local/download_lm.sh  
Usage: ./local/download_lm.sh <base-url> <download_dir>
e.g.: ./local/download_lm.sh http://www.openslr.org/resources/11 data/local/lm
[ec2-user@ip-172-31-6-113 s5]$ ./local/download_lm.sh http://www.openslr.org/resources/11 data/local/lm
Downloading file '3-gram.arpa.gz' into 'data/local/lm'...
--2021-12-12 22:48:20-- http://www.openslr.org/resources/11/3-gram.arpa.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/3-gram.arpa.gz [following]
--2021-12-12 22:48:20-- https://us.openslr.org/resources/11/3-gram.arpa.gz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 759636181 (724M) [application/x-gzip]
Saving to: ‘data/local/lm/3-gram.arpa.gz’

100%[===============================================================================================================================================>] 759,636,181 2.87MB/s   in 3m 53s

2021-12-12 22:52:14 (3.11 MB/s) - ‘data/local/lm/3-gram.arpa.gz’ saved [759636181/759636181]

Downloading file '3-gram.pruned.1e-7.arpa.gz' into 'data/local/lm'...
--2021-12-12 22:52:14-- http://www.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz [following]
--2021-12-12 22:52:14-- https://us.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34094057 (33M) [application/x-gzip]
Saving to: ‘data/local/lm/3-gram.pruned.1e-7.arpa.gz’

100%[===============================================================================================================================================>] 34,094,057  2.61MB/s   in 15s    

2021-12-12 22:52:29 (2.24 MB/s) - ‘data/local/lm/3-gram.pruned.1e-7.arpa.gz’ saved [34094057/34094057]

Downloading file '3-gram.pruned.3e-7.arpa.gz' into 'data/local/lm'...
--2021-12-12 22:52:30-- http://www.openslr.org/resources/11/3-gram.pruned.3e-7.arpa.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/3-gram.pruned.3e-7.arpa.gz [following]
--2021-12-12 22:52:30-- https://us.openslr.org/resources/11/3-gram.pruned.3e-7.arpa.gz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13654242 (13M) [application/x-gzip]
Saving to: ‘data/local/lm/3-gram.pruned.3e-7.arpa.gz’

100%[===============================================================================================================================================>] 13,654,242  1.51MB/s   in 16s    

2021-12-12 22:52:46 (851 KB/s) - ‘data/local/lm/3-gram.pruned.3e-7.arpa.gz’ saved [13654242/13654242]

Downloading file '4-gram.arpa.gz' into 'data/local/lm'...
--2021-12-12 22:52:46-- http://www.openslr.org/resources/11/4-gram.arpa.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/4-gram.arpa.gz [following]
--2021-12-12 22:52:47-- https://us.openslr.org/resources/11/4-gram.arpa.gz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1355172078 (1.3G) [application/x-gzip]
Saving to: ‘data/local/lm/4-gram.arpa.gz’

100%[=============================================================================================================================================>] 1,355,172,078 2.33MB/s   in 6m 24s

2021-12-12 22:59:12 (3.36 MB/s) - ‘data/local/lm/4-gram.arpa.gz’ saved [1355172078/1355172078]

Downloading file 'g2p-model-5' into 'data/local/lm'...
--2021-12-12 22:59:12-- http://www.openslr.org/resources/11/g2p-model-5
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/g2p-model-5 [following]
--2021-12-12 22:59:12-- https://us.openslr.org/resources/11/g2p-model-5
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20098243 (19M)
Saving to: ‘data/local/lm/g2p-model-5’

100%[===============================================================================================================================================>] 20,098,243  3.01MB/s   in 9.2s  

2021-12-12 22:59:22 (2.09 MB/s) - ‘data/local/lm/g2p-model-5’ saved [20098243/20098243]

Downloading file 'librispeech-lm-corpus.tgz' into 'data/local/lm'...
--2021-12-12 22:59:22-- http://www.openslr.org/resources/11/librispeech-lm-corpus.tgz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/librispeech-lm-corpus.tgz [following]
--2021-12-12 22:59:22-- https://us.openslr.org/resources/11/librispeech-lm-corpus.tgz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1803499244 (1.7G) [application/x-gzip]
Saving to: ‘data/local/lm/librispeech-lm-corpus.tgz’

100%[=============================================================================================================================================>] 1,803,499,244 3.00MB/s   in 8m 55s

2021-12-12 23:08:18 (3.22 MB/s) - ‘data/local/lm/librispeech-lm-corpus.tgz’ saved [1803499244/1803499244]

Downloading file 'librispeech-vocab.txt' into 'data/local/lm'...
--2021-12-12 23:08:18-- http://www.openslr.org/resources/11/librispeech-vocab.txt
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/librispeech-vocab.txt [following]
--2021-12-12 23:08:18-- https://us.openslr.org/resources/11/librispeech-vocab.txt
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1737588 (1.7M) [text/plain]
Saving to: ‘data/local/lm/librispeech-vocab.txt’

100%[===============================================================================================================================================>] 1,737,588   1.10MB/s   in 1.5s  

2021-12-12 23:08:20 (1.10 MB/s) - ‘data/local/lm/librispeech-vocab.txt’ saved [1737588/1737588]

Downloading file 'librispeech-lexicon.txt' into 'data/local/lm'...
--2021-12-12 23:08:20-- http://www.openslr.org/resources/11/librispeech-lexicon.txt
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/librispeech-lexicon.txt [following]
--2021-12-12 23:08:21-- https://us.openslr.org/resources/11/librispeech-lexicon.txt
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5627653 (5.4M) [text/plain]
Saving to: ‘data/local/lm/librispeech-lexicon.txt’

100%[===============================================================================================================================================>] 5,627,653   1.50MB/s   in 4.2s  

2021-12-12 23:08:25 (1.28 MB/s) - ‘data/local/lm/librispeech-lexicon.txt’ saved [5627653/5627653]

[ec2-user@ip-172-31-6-113 s5]$




STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP


-------------------------------------------------------------------------------------------------------------

Ignore steps below as the script above already does these steps.

Navigate to librispeech folder in kaldi folder

navigate to /home/ec2-user/kaldi/egs/librispeech/s5

[ec2-user@ip-172-31-6-113 s5]$ ls
cmd.sh conf local path.sh RESULTS rnnlm run.sh steps utils
[ec2-user@ip-172-31-6-113 s5]$

I will download the ones that are already created by others to help us to decode.

They are located here: http://www.openslr.org/11/

Download all 9 files.

wget https://www.openslr.org/resources/11/librispeech-lm-corpus.tgz
wget https://us.openslr.org/resources/11/librispeech-lm-norm.txt.gz
wget https://us.openslr.org/resources/11/librispeech-vocab.txt
wget https://us.openslr.org/resources/11/librispeech-lexicon.txt
wget https://us.openslr.org/resources/11/3-gram.arpa.gz
wget https://us.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz
wget https://us.openslr.org/resources/11/3-gram.pruned.3e-7.arpa.gz
wget https://us.openslr.org/resources/11/4-gram.arpa.gz
wget https://www.openslr.org/resources/11/g2p-model-5

Unzip these 2 files

tar -xvzf librispeech-lm-corpus.tgz
gzip -d librispeech-lm-norm.txt.gz

Now we have all files

[ec2-user@ip-172-31-6-113 s5]$ ls
3-gram.arpa.gz              3-gram.pruned.3e-7.arpa.gz cmd.sh g2p-model-5   librispeech-lexicon.txt librispeech-lm-norm.txt local   RESULTS run.sh utils
3-gram.pruned.1e-7.arpa.gz  4-gram.arpa.gz             conf   g2p-model-5.1 librispeech-lm-corpus   librispeech-vocab.txt   path.sh rnnlm   steps
[ec2-user@ip-172-31-6-113 s5]$ ls

We must put them into specific folders. All kaldi scripts assume the data is in correct folders.

[ec2-user@ip-172-31-6-113 s5]$ mkdir -p data/local
[ec2-user@ip-172-31-6-113 s5]$ cd data/local/
[ec2-user@ip-172-31-6-113 local]$ mkdir dict
[ec2-user@ip-172-31-6-113 local]$ mkdir lm
[ec2-user@ip-172-31-6-113 s5]$ mv 3-gram.arpa.gz data/local/lm/
[ec2-user@ip-172-31-6-113 s5]$ mv 3-gram.pruned.1e-7.arpa.gz data/local/lm/
[ec2-user@ip-172-31-6-113 s5]$ mv 3-gram.pruned.3e-7.arpa.gz data/local/lm/
[ec2-user@ip-172-31-6-113 s5]$ mv 4-gram.arpa.gz data/local/lm
[ec2-user@ip-172-31-6-113 s5]$ mv librispeech-lexicon.txt data/local/dict/
[ec2-user@ip-172-31-6-113 s5]$ mv librispeech-vocab.txt data/local/dict/
[ec2-user@ip-172-31-6-113 s5]$ mv g2p-model-5.1 data/local/dict/

data/local/dict/

[ec2-user@ip-172-31-6-113 dict]$ ls
g2p-model-5 g2p-model-5.1 librispeech-lexicon.txt librispeech-vocab.txt
[ec2-user@ip-172-31-6-113 dict]$

data/local/lm

[ec2-user@ip-172-31-6-113 lm]$ ls
3-gram.arpa.gz  3-gram.pruned.1e-7.arpa.gz  3-gram.pruned.3e-7.arpa.gz  4-gram.arpa.gz
[ec2-user@ip-172-31-6-113 lm]$

librispeech-lm-norm.txt

Normalized text to build LM, we will not need this normalized text because we downloaded language models that are already created for us. In later tutorials I will cover how to make LM from this normalized text. Text was converted into spoken form.

[ec2-user@ip-172-31-6-113 data]$ cd ..
[ec2-user@ip-172-31-6-113 s5]$ ls
cmd.sh conf data librispeech-lm-corpus librispeech-lm-norm.txt local path.sh RESULTS rnnlm run.sh steps utils
[ec2-user@ip-172-31-6-113 s5]$ wc -l librispeech-lm-norm.txt
40418261 librispeech-lm-norm.txt
[ec2-user@ip-172-31-6-113 s5]$ head librispeech-lm-norm.txt

A
A A
A A A
A A A A
A A A A A
A A A A A A A A A A A A A A
A A A A A A ARE THE PARTS OF THE FRAMEWORK THE DIMENSIONS OF WHICH IN FEET AND INCHES ARE GIVEN
A A A A A AH
A A A A A AH THE CRY WAS WRUNG FROM JOHNNIE
[ec2-user@ip-172-31-6-113 s5]$

Comments