Step 1: Download librispeech ASR and i-vector extractor model to you computer
https://kaldi-asr.org/models/m13
This step was covered in previous blog.
Step 2: Download 8 files using the download_lm.sh script .
[ec2-user@ip-172-31-6-113 local]$ ls
chain download_and_untar.sh format_lms.sh lm nnet3 prepare_dict.sh run_cleanup_segmentation.sh run_nnet2_clean_460.sh score.sh
data_prep.sh download_lm.sh g2p lookahead online prepare_example_data.sh run_data_cleaning.sh run_nnet2.sh
decode_example.sh format_data.sh g2p.sh nnet2 online_pitch rnnlm run_nnet2_clean_100.sh run_rnnlm.sh
[ec2-user@ip-172-31-6-113 local]$ nano download_lm.sh
download_lm.sh
https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/local/download_lm.sh
For script to work we need correct paths to needed folders. In kaldi it is a good rule to run all scripts from s5 folder.
./local/download_lm.sh http://www.openslr.org/resources/11 data/local/lm
we can see that we got our language models below
[ec2-user@ip-172-31-6-113 local]$ tree lm
lm
├── 3-gram.arpa.gz
├── 3-gram.pruned.1e-7.arpa.gz
├── 3-gram.pruned.3e-7.arpa.gz
├── 4-gram.arpa.gz
├── g2p-model-5
├── librispeech-lexicon.txt
├── librispeech-lm-corpus.tgz
├── librispeech-vocab.txt
├── lm_fglarge.arpa.gz -> 4-gram.arpa.gz
├── lm_tglarge.arpa.gz -> 3-gram.arpa.gz
├── lm_tgmed.arpa.gz -> 3-gram.pruned.1e-7.arpa.gz
└── lm_tgsmall.arpa.gz -> 3-gram.pruned.3e-7.arpa.gz
0 directories, 12 files
[ec2-user@ip-172-31-6-113 local]$
output downloaded 8 files
[ec2-user@ip-172-31-6-113 s5]$ ./local/download_lm.sh
Usage: ./local/download_lm.sh <base-url> <download_dir>
e.g.: ./local/download_lm.sh http://www.openslr.org/resources/11 data/local/lm
[ec2-user@ip-172-31-6-113 s5]$ ./local/download_lm.sh http://www.openslr.org/resources/11 data/local/lm
Downloading file '3-gram.arpa.gz' into 'data/local/lm'...
--2021-12-12 22:48:20-- http://www.openslr.org/resources/11/3-gram.arpa.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/3-gram.arpa.gz [following]
--2021-12-12 22:48:20-- https://us.openslr.org/resources/11/3-gram.arpa.gz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 759636181 (724M) [application/x-gzip]
Saving to: ‘data/local/lm/3-gram.arpa.gz’
100%[===============================================================================================================================================>] 759,636,181 2.87MB/s in 3m 53s
2021-12-12 22:52:14 (3.11 MB/s) - ‘data/local/lm/3-gram.arpa.gz’ saved [759636181/759636181]
Downloading file '3-gram.pruned.1e-7.arpa.gz' into 'data/local/lm'...
--2021-12-12 22:52:14-- http://www.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz [following]
--2021-12-12 22:52:14-- https://us.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34094057 (33M) [application/x-gzip]
Saving to: ‘data/local/lm/3-gram.pruned.1e-7.arpa.gz’
100%[===============================================================================================================================================>] 34,094,057 2.61MB/s in 15s
2021-12-12 22:52:29 (2.24 MB/s) - ‘data/local/lm/3-gram.pruned.1e-7.arpa.gz’ saved [34094057/34094057]
Downloading file '3-gram.pruned.3e-7.arpa.gz' into 'data/local/lm'...
--2021-12-12 22:52:30-- http://www.openslr.org/resources/11/3-gram.pruned.3e-7.arpa.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/3-gram.pruned.3e-7.arpa.gz [following]
--2021-12-12 22:52:30-- https://us.openslr.org/resources/11/3-gram.pruned.3e-7.arpa.gz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13654242 (13M) [application/x-gzip]
Saving to: ‘data/local/lm/3-gram.pruned.3e-7.arpa.gz’
100%[===============================================================================================================================================>] 13,654,242 1.51MB/s in 16s
2021-12-12 22:52:46 (851 KB/s) - ‘data/local/lm/3-gram.pruned.3e-7.arpa.gz’ saved [13654242/13654242]
Downloading file '4-gram.arpa.gz' into 'data/local/lm'...
--2021-12-12 22:52:46-- http://www.openslr.org/resources/11/4-gram.arpa.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/4-gram.arpa.gz [following]
--2021-12-12 22:52:47-- https://us.openslr.org/resources/11/4-gram.arpa.gz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1355172078 (1.3G) [application/x-gzip]
Saving to: ‘data/local/lm/4-gram.arpa.gz’
100%[=============================================================================================================================================>] 1,355,172,078 2.33MB/s in 6m 24s
2021-12-12 22:59:12 (3.36 MB/s) - ‘data/local/lm/4-gram.arpa.gz’ saved [1355172078/1355172078]
Downloading file 'g2p-model-5' into 'data/local/lm'...
--2021-12-12 22:59:12-- http://www.openslr.org/resources/11/g2p-model-5
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/g2p-model-5 [following]
--2021-12-12 22:59:12-- https://us.openslr.org/resources/11/g2p-model-5
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20098243 (19M)
Saving to: ‘data/local/lm/g2p-model-5’
100%[===============================================================================================================================================>] 20,098,243 3.01MB/s in 9.2s
2021-12-12 22:59:22 (2.09 MB/s) - ‘data/local/lm/g2p-model-5’ saved [20098243/20098243]
Downloading file 'librispeech-lm-corpus.tgz' into 'data/local/lm'...
--2021-12-12 22:59:22-- http://www.openslr.org/resources/11/librispeech-lm-corpus.tgz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/librispeech-lm-corpus.tgz [following]
--2021-12-12 22:59:22-- https://us.openslr.org/resources/11/librispeech-lm-corpus.tgz
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1803499244 (1.7G) [application/x-gzip]
Saving to: ‘data/local/lm/librispeech-lm-corpus.tgz’
100%[=============================================================================================================================================>] 1,803,499,244 3.00MB/s in 8m 55s
2021-12-12 23:08:18 (3.22 MB/s) - ‘data/local/lm/librispeech-lm-corpus.tgz’ saved [1803499244/1803499244]
Downloading file 'librispeech-vocab.txt' into 'data/local/lm'...
--2021-12-12 23:08:18-- http://www.openslr.org/resources/11/librispeech-vocab.txt
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/librispeech-vocab.txt [following]
--2021-12-12 23:08:18-- https://us.openslr.org/resources/11/librispeech-vocab.txt
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1737588 (1.7M) [text/plain]
Saving to: ‘data/local/lm/librispeech-vocab.txt’
100%[===============================================================================================================================================>] 1,737,588 1.10MB/s in 1.5s
2021-12-12 23:08:20 (1.10 MB/s) - ‘data/local/lm/librispeech-vocab.txt’ saved [1737588/1737588]
Downloading file 'librispeech-lexicon.txt' into 'data/local/lm'...
--2021-12-12 23:08:20-- http://www.openslr.org/resources/11/librispeech-lexicon.txt
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://us.openslr.org/resources/11/librispeech-lexicon.txt [following]
--2021-12-12 23:08:21-- https://us.openslr.org/resources/11/librispeech-lexicon.txt
Resolving us.openslr.org (us.openslr.org)... 46.101.158.64
Connecting to us.openslr.org (us.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5627653 (5.4M) [text/plain]
Saving to: ‘data/local/lm/librispeech-lexicon.txt’
100%[===============================================================================================================================================>] 5,627,653 1.50MB/s in 4.2s
2021-12-12 23:08:25 (1.28 MB/s) - ‘data/local/lm/librispeech-lexicon.txt’ saved [5627653/5627653]
[ec2-user@ip-172-31-6-113 s5]$
STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP STOP
Ignore steps below as the script above already does these steps.
Navigate to librispeech
folder in kaldi
folder
navigate to /home/ec2-user/kaldi/egs/librispeech/s5
[ec2-user@ip-172-31-6-113 s5]$ ls
cmd.sh conf local path.sh RESULTS rnnlm run.sh steps utils
[ec2-user@ip-172-31-6-113 s5]$
I will download the ones that are already created by others to help us to decode.
They are located here:
Download all 9 files.
wget https://www.openslr.org/resources/11/librispeech-lm-corpus.tgz
wget https://us.openslr.org/resources/11/librispeech-lm-norm.txt.gz
wget https://us.openslr.org/resources/11/librispeech-vocab.txt
wget https://us.openslr.org/resources/11/librispeech-lexicon.txt
wget https://us.openslr.org/resources/11/3-gram.arpa.gz
wget https://us.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz
wget https://us.openslr.org/resources/11/3-gram.pruned.3e-7.arpa.gz
wget https://us.openslr.org/resources/11/4-gram.arpa.gz
wget https://www.openslr.org/resources/11/g2p-model-5
Unzip these 2 files
tar -xvzf librispeech-lm-corpus.tgz
gzip -d librispeech-lm-norm.txt.gz
Now we have all files
[ec2-user@ip-172-31-6-113 s5]$ ls
3-gram.arpa.gz 3-gram.pruned.3e-7.arpa.gz cmd.sh g2p-model-5 librispeech-lexicon.txt librispeech-lm-norm.txt local RESULTS run.sh utils
3-gram.pruned.1e-7.arpa.gz 4-gram.arpa.gz conf g2p-model-5.1 librispeech-lm-corpus librispeech-vocab.txt path.sh rnnlm steps
[ec2-user@ip-172-31-6-113 s5]$ ls
We must put them into specific folders. All kaldi scripts assume the data is in correct folders.
[ec2-user@ip-172-31-6-113 s5]$ mkdir -p data/local
[ec2-user@ip-172-31-6-113 s5]$ cd data/local/
[ec2-user@ip-172-31-6-113 local]$ mkdir dict
[ec2-user@ip-172-31-6-113 local]$ mkdir lm
[ec2-user@ip-172-31-6-113 s5]$ mv 3-gram.arpa.gz data/local/lm/
[ec2-user@ip-172-31-6-113 s5]$ mv 3-gram.pruned.1e-7.arpa.gz data/local/lm/
[ec2-user@ip-172-31-6-113 s5]$ mv 3-gram.pruned.3e-7.arpa.gz data/local/lm/
[ec2-user@ip-172-31-6-113 s5]$ mv 4-gram.arpa.gz data/local/lm
[ec2-user@ip-172-31-6-113 s5]$ mv librispeech-lexicon.txt data/local/dict/
[ec2-user@ip-172-31-6-113 s5]$ mv librispeech-vocab.txt data/local/dict/
[ec2-user@ip-172-31-6-113 s5]$ mv g2p-model-5.1 data/local/dict/
data/local/dict/
[ec2-user@ip-172-31-6-113 dict]$ ls
g2p-model-5 g2p-model-5.1 librispeech-lexicon.txt librispeech-vocab.txt
[ec2-user@ip-172-31-6-113 dict]$
data/local/lm
[ec2-user@ip-172-31-6-113 lm]$ ls
3-gram.arpa.gz 3-gram.pruned.1e-7.arpa.gz 3-gram.pruned.3e-7.arpa.gz 4-gram.arpa.gz
[ec2-user@ip-172-31-6-113 lm]$
librispeech-lm-norm.txt
Normalized text to build LM, we will not need this normalized text because we downloaded language models that are already created for us. In later tutorials I will cover how to make LM from this normalized text. Text was converted into spoken form.
[ec2-user@ip-172-31-6-113 data]$ cd ..
[ec2-user@ip-172-31-6-113 s5]$ ls
cmd.sh conf data librispeech-lm-corpus librispeech-lm-norm.txt local path.sh RESULTS rnnlm run.sh steps utils
[ec2-user@ip-172-31-6-113 s5]$ wc -l librispeech-lm-norm.txt
40418261 librispeech-lm-norm.txt
[ec2-user@ip-172-31-6-113 s5]$ head librispeech-lm-norm.txt
A
A A
A A A
A A A A
A A A A A
A A A A A A A A A A A A A A
A A A A A A ARE THE PARTS OF THE FRAMEWORK THE DIMENSIONS OF WHICH IN FEET AND INCHES ARE GIVEN
A A A A A AH
A A A A A AH THE CRY WAS WRUNG FROM JOHNNIE
[ec2-user@ip-172-31-6-113 s5]$
Comments
Post a Comment