Not advised to use this fix. Looking at it after, probably just path issue that can be fixed easily.
While running prepare_dict.sh
I got stuck in making g2p work.
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ ./local/prepare_dict.sh data/lm data/g2p data/local/dict
Downloading and preparing CMUdict
Removing the pronunciation variant markers ...
Autogenerating pronunciations for the words in data/local/dict/g2p/vocab_autogen.* ...
run.pl: 4 / 4 failed, log is in data/local/dict/g2p/log/g2p.*.log
The problem looks like the uses python3.7. I used python2.7.
Solution: copying the whole /home/ec2-user/libri_env/lib/python3.7
folder to /home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib/
fixed g2p error.
[ec2-user@ip-172-31-6-113 ~]$ python3 -m venv libri_env
[ec2-user@ip-172-31-6-113 ~]$ source libri_env/bin/activate
(libri_env) [ec2-user@ip-172-31-6-113 ~]$ python3 -m pip install --upgrade pip
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install numpy
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur-g2p
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install six
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ cp -r /home/ec2-user/libri_env/lib/python3.7 /home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib/
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ ./local/prepare_dict.sh data/lm data/g2p data/local/dict
Downloading and preparing CMUdict
Removing the pronunciation variant markers ...
Autogenerating pronunciations for the words in data/local/dict/g2p/vocab_autogen.* ...
134746 pronunciations autogenerated OK
Combining the CMUdict pronunciations with the autogenerated ones ...
Combined lexicon saved to 'data/local/dict/lexicon_raw_nosil.txt'
Preparing phone lists and clustering questions
2 silence phones saved to: data/local/dict/silence_phones.txt
1 optional silence saved to: data/local/dict/optional_silence.txt
39 non-silence phones saved to: data/local/dict/nonsilence_phones.txt
5 extra triphone clustering-related questions saved to: data/local/dict/extra_questions.txt
Lexicon text file saved as: data/local/dict/lexicon.txt
It worked!
Helpful to see the tree structure of the data
folder:
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ tree -L 3 data
data
├── g2p
│ └── g2p-model-5
├── LibriSpeech
│ ├── BOOKS.TXT
│ ├── CHAPTERS.TXT
│ ├── LICENSE.TXT
│ ├── README.TXT
│ ├── SPEAKERS.TXT
│ └── test-clean
│ ├── 1089
│ ├── 1188
│ ├── 121
│ ├── 1221
│ ├── 1284
│ ├── 1320
│ ├── 1580
│ ├── 1995
│ ├── 2094
│ ├── 2300
│ ├── 237
│ ├── 260
│ ├── 2830
│ ├── 2961
│ ├── 3570
│ ├── 3575
│ ├── 3729
│ ├── 4077
│ ├── 4446
│ ├── 4507
│ ├── 4970
│ ├── 4992
│ ├── 5105
│ ├── 5142
│ ├── 5639
│ ├── 5683
│ ├── 61
│ ├── 672
│ ├── 6829
│ ├── 6930
│ ├── 7021
│ ├── 7127
│ ├── 7176
│ ├── 7729
│ ├── 8224
│ ├── 8230
│ ├── 8455
│ ├── 8463
│ ├── 8555
│ └── 908
├── lm
│ ├── 3-gram.arpa.gz
│ ├── 3-gram.pruned.1e-7.arpa.gz
│ ├── 3-gram.pruned.3e-7.arpa.gz
│ ├── 4-gram.arpa.gz
│ ├── cmudict
│ │ ├── 00README_FIRST.txt
│ │ ├── cmudict.0.6d
│ │ ├── cmudict.0.7a
│ │ ├── cmudict.0.7a.phones
│ │ ├── cmudict.0.7a.symbols
│ │ ├── README.developer
│ │ ├── README.old
│ │ ├── README.weide
│ │ ├── scripts
│ │ └── sphinxdict
│ ├── cmudict.0.7a.plain
│ ├── g2p
│ │ ├── lexicon_autogen.1.tmp
│ │ ├── lexicon_autogen.2.tmp
│ │ ├── lexicon_autogen.3.tmp
│ │ ├── lexicon_autogen.4.tmp
│ │ ├── log
│ │ ├── vocab_autogen.1
│ │ ├── vocab_autogen.2
│ │ ├── vocab_autogen.3
│ │ ├── vocab_autogen.4
│ │ └── vocab_autogen.full
│ ├── librispeech-lexicon.txt
│ ├── librispeech-lm-corpus.tgz
│ ├── librispeech-vocab.txt
│ ├── lm_fglarge.arpa.gz -> 4-gram.arpa.gz
│ ├── lm_tglarge.arpa.gz -> 3-gram.arpa.gz
│ ├── lm_tgmed.arpa.gz -> 3-gram.pruned.1e-7.arpa.gz
│ └── lm_tgsmall.arpa.gz -> 3-gram.pruned.3e-7.arpa.gz
└── local
└── dict
├── cmudict
├── cmudict.0.7a.plain
├── extra_questions.txt
├── g2p
├── lexicon_autogen.txt
├── lexicon_raw_nosil.txt
├── lexicon.txt
├── nonsilence_phones.txt
├── optional_silence.txt
├── silence_phones.txt
└── vocab_autogen.txt
53 directories, 44 files
(libri_env) [ec2-user@ip-172-31-6-113 s5]$
Helpful to see the tree structure of the data
folder:
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ tree -L 4 data
data
├── g2p
│ └── g2p-model-5
├── LibriSpeech
│ ├── BOOKS.TXT
│ ├── CHAPTERS.TXT
│ ├── LICENSE.TXT
│ ├── README.TXT
│ ├── SPEAKERS.TXT
│ └── test-clean
│ ├── 1089
│ │ ├── 134686
│ │ └── 134691
│ ├── 1188
│ │ └── 133604
│ ├── 121
│ │ ├── 121726
│ │ ├── 123852
│ │ ├── 123859
│ │ └── 127105
│ ├── 1221
│ │ ├── 135766
│ │ └── 135767
│ ├── 1284
│ │ ├── 1180
│ │ ├── 1181
│ │ └── 134647
│ ├── 1320
│ │ ├── 122612
│ │ └── 122617
│ ├── 1580
│ │ ├── 141083
│ │ └── 141084
│ ├── 1995
│ │ ├── 1826
│ │ ├── 1836
│ │ └── 1837
│ ├── 2094
│ │ └── 142345
│ ├── 2300
│ │ └── 131720
│ ├── 237
│ │ ├── 126133
│ │ ├── 134493
│ │ └── 134500
│ ├── 260
│ │ ├── 123286
│ │ ├── 123288
│ │ └── 123440
│ ├── 2830
│ │ ├── 3979
│ │ └── 3980
│ ├── 2961
│ │ ├── 960
│ │ └── 961
│ ├── 3570
│ │ ├── 5694
│ │ ├── 5695
│ │ └── 5696
│ ├── 3575
│ │ └── 170457
│ ├── 3729
│ │ └── 6852
│ ├── 4077
│ │ ├── 13751
│ │ └── 13754
│ ├── 4446
│ │ ├── 2271
│ │ ├── 2273
│ │ └── 2275
│ ├── 4507
│ │ └── 16021
│ ├── 4970
│ │ ├── 29093
│ │ └── 29095
│ ├── 4992
│ │ ├── 23283
│ │ ├── 41797
│ │ └── 41806
│ ├── 5105
│ │ ├── 28233
│ │ ├── 28240
│ │ └── 28241
│ ├── 5142
│ │ ├── 33396
│ │ ├── 36377
│ │ ├── 36586
│ │ └── 36600
│ ├── 5639
│ │ └── 40744
│ ├── 5683
│ │ ├── 32865
│ │ ├── 32866
│ │ └── 32879
│ ├── 61
│ │ ├── 70968
│ │ └── 70970
│ ├── 672
│ │ └── 122797
│ ├── 6829
│ │ ├── 68769
│ │ └── 68771
│ ├── 6930
│ │ ├── 75918
│ │ ├── 76324
│ │ └── 81414
│ ├── 7021
│ │ ├── 79730
│ │ ├── 79740
│ │ ├── 79759
│ │ └── 85628
│ ├── 7127
│ │ ├── 75946
│ │ └── 75947
│ ├── 7176
│ │ ├── 88083
│ │ └── 92135
│ ├── 7729
│ │ └── 102255
│ ├── 8224
│ │ ├── 274381
│ │ └── 274384
│ ├── 8230
│ │ └── 279154
│ ├── 8455
│ │ └── 210777
│ ├── 8463
│ │ ├── 287645
│ │ ├── 294825
│ │ └── 294828
│ ├── 8555
│ │ ├── 284447
│ │ ├── 284449
│ │ └── 292519
│ └── 908
│ ├── 157963
│ └── 31957
├── lm
│ ├── 3-gram.arpa.gz
│ ├── 3-gram.pruned.1e-7.arpa.gz
│ ├── 3-gram.pruned.3e-7.arpa.gz
│ ├── 4-gram.arpa.gz
│ ├── cmudict
│ │ ├── 00README_FIRST.txt
│ │ ├── cmudict.0.6d
│ │ ├── cmudict.0.7a
│ │ ├── cmudict.0.7a.phones
│ │ ├── cmudict.0.7a.symbols
│ │ ├── README.developer
│ │ ├── README.old
│ │ ├── README.weide
│ │ ├── scripts
│ │ │ ├── CompileDictionary.sh
│ │ │ ├── make_baseform.pl
│ │ │ ├── README.txt
│ │ │ ├── sort_cmudict.pl
│ │ │ ├── test_cmudict.pl
│ │ │ └── test_dict.pl
│ │ └── sphinxdict
│ │ ├── cmudict.0.7a_SPHINX_40
│ │ ├── cmudict_SPHINX_40
│ │ ├── README.txt
│ │ └── SphinxPhones_40
│ ├── cmudict.0.7a.plain
│ ├── g2p
│ │ ├── lexicon_autogen.1.tmp
│ │ ├── lexicon_autogen.2.tmp
│ │ ├── lexicon_autogen.3.tmp
│ │ ├── lexicon_autogen.4.tmp
│ │ ├── log
│ │ │ ├── g2p.1.log
│ │ │ ├── g2p.2.log
│ │ │ ├── g2p.3.log
│ │ │ └── g2p.4.log
│ │ ├── vocab_autogen.1
│ │ ├── vocab_autogen.2
│ │ ├── vocab_autogen.3
│ │ ├── vocab_autogen.4
│ │ └── vocab_autogen.full
│ ├── librispeech-lexicon.txt
│ ├── librispeech-lm-corpus.tgz
│ ├── librispeech-vocab.txt
│ ├── lm_fglarge.arpa.gz -> 4-gram.arpa.gz
│ ├── lm_tglarge.arpa.gz -> 3-gram.arpa.gz
│ ├── lm_tgmed.arpa.gz -> 3-gram.pruned.1e-7.arpa.gz
│ └── lm_tgsmall.arpa.gz -> 3-gram.pruned.3e-7.arpa.gz
└── local
└── dict
├── cmudict
│ ├── 00README_FIRST.txt
│ ├── cmudict.0.6d
│ ├── cmudict.0.7a
│ ├── cmudict.0.7a.phones
│ ├── cmudict.0.7a.symbols
│ ├── README.developer
│ ├── README.old
│ ├── README.weide
│ ├── scripts
│ └── sphinxdict
├── cmudict.0.7a.plain
├── extra_questions.txt
├── g2p
│ ├── lexicon_autogen.1
│ ├── lexicon_autogen.2
│ ├── lexicon_autogen.3
│ ├── lexicon_autogen.4
│ ├── log
│ ├── vocab_autogen.1
│ ├── vocab_autogen.2
│ ├── vocab_autogen.3
│ ├── vocab_autogen.4
│ └── vocab_autogen.full
├── lexicon_autogen.txt
├── lexicon_raw_nosil.txt
├── lexicon.txt
├── nonsilence_phones.txt
├── optional_silence.txt
├── silence_phones.txt
└── vocab_autogen.txt
143 directories, 75 files
(libri_env) [ec2-user@ip-172-31-6-113 s5]$
Various error message are below:
# local/g2p.sh data/local/dict/g2p/vocab_autogen.1 data/local/lm data/local/dict/g2p/lexicon_autogen.1
# Started at Sun Dec 12 23:54:31 UTC 2021
#
File "/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/g2p.py", line 141
print("unknown format in file: %s" % (line), file=stderr)
^
SyntaxError: invalid syntax
# Accounting: time=0 threads=1
# Ended (code 1) at Sun Dec 12 23:54:31 UTC 2021, elapsed time 0 seconds
After creating python3.7 environment and running the code again the same problem persisted.
[ec2-user@ip-172-31-6-113 s5]$ sudo yum install numpy
[ec2-user@ip-172-31-6-113 ~]$ python3 -m venv libri_env
[ec2-user@ip-172-31-6-113 ~]$ source libri_env/bin/activate
(libri_env) [ec2-user@ip-172-31-6-113 ~]$ python3 -m pip install --upgrade pip
(libri_env) [ec2-user@ip-172-31-6-113 ~]$
Trying to mimic the sample input to the script
#/home/ec2-user/kaldi/egs/librispeech/s5/data/g2p
"e.g.: /export/a15/vpanayotov/data/lm /export/a15/vpanayotov/data/g2p data/local/dict"
./local/prepare_dict.sh data/local/lm data/dict/g2p data/local/dict
libri_env) [ec2-user@ip-172-31-6-113 s5]$ ./local/prepare_dict.sh data/local/lm data/g2p data/local/dict
Another error if I use python3.7
# local/g2p.sh data/local/dict/g2p/vocab_autogen.2 /home/ec2-user/kaldi/egs/librispeech/s5/data/g2p data/local/dict/g2p/lexicon_autogen.2
# Started at Tue Dec 14 05:40:29 UTC 2021
#
Can't find '/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib/python3.7/site-packages' - please fix your Sequitur installation
# Accounting: time=0 threads=1
# Ended (code 1) at Tue Dec 14 05:40:29 UTC 2021, elapsed time 0 seconds
Install sequitur to python3.7 virtual env
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur
output:
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur
Collecting sequitur
Downloading sequitur-1.2.4.tar.gz (11 kB)
Preparing metadata (setup.py) ... done
Collecting torch
Downloading torch-1.10.0-cp37-cp37m-manylinux1_x86_64.whl (881.9 MB)
|████████████████████████████████| 881.9 MB 12 kB/s
Collecting typing-extensions
Downloading typing_extensions-4.0.1-py3-none-any.whl (22 kB)
Using legacy 'setup.py install' for sequitur, since package 'wheel' is not installed.
Installing collected packages: typing-extensions, torch, sequitur
Running setup.py install for sequitur ... done
Successfully installed sequitur-1.2.4 torch-1.10.0 typing-extensions-4.0.1
(libri_env) [ec2-user@ip-172-31-6-113 s5]$
sequitur_model=$g2p_model_dir/g2p-model-5
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ ./local/prepare_dict.sh data/lm data/g2p data/local/dict
# local/g2p.sh data/local/dict/g2p/vocab_autogen.2 data/g2p data/local/dict/g2p/lexicon_autogen.2
# Started at Tue Dec 14 06:04:33 UTC 2021
#
Can't find '/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib//site-packages' - please fix your Sequitur installation
# Accounting: time=0 threads=1
# Ended (code 1) at Tue Dec 14 06:04:33 UTC 2021, elapsed time 0 seconds
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ ./local/prepare_dict.sh data/lm data/g2p data/local/dict
Downloading and preparing CMUdict
Removing the pronunciation variant markers ...
Sequitur G2P not found- running /home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/extra/install_sequitur.sh
~/kaldi/tools ~/kaldi/egs/librispeech/s5
extras/install_sequitur.sh : ERROR: python-devel/python-dev not installed
extras/install_sequitur.sh: we recommend that you run (our best guess):
sudo yum install python-devel
(libri_env) [ec2-user@ip-172-31-6-113 s5]$
# local/g2p.sh data/local/dict/g2p/vocab_autogen.1 data/g2p data/local/dict/g2p/lexicon_autogen.1
# Started at Tue Dec 14 06:11:13 UTC 2021
#
Traceback (most recent call last):
File "/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/g2p.py", line 39, in <module>
import SequiturTool
File "/home/ec2-user/kaldi/tools/sequitur-g2p/SequiturTool.py", line 32, in <module>
from six.moves import cPickle as pickle
ModuleNotFoundError: No module named 'six'
# Accounting: time=0 threads=1
# Ended (code 1) at Tue Dec 14 06:11:13 UTC 2021, elapsed time 0 seconds
Ignore notes below:
Just various command below:
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur
Requirement already satisfied: sequitur in /home/ec2-user/libri_env/lib/python3.7/site-packages (1.2.4)
Requirement already satisfied: torch in /home/ec2-user/libri_env/lib/python3.7/site-packages (from sequitur) (1.10.0)
Requirement already satisfied: typing-extensions in /home/ec2-user/libri_env/lib/python3.7/site-packages (from torch->sequitur) (4.0.1)
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ cd data/local/dict/g2p/
(libri_env) [ec2-user@ip-172-31-6-113 g2p]$ ls
lexicon_autogen.1.tmp lexicon_autogen.2.tmp lexicon_autogen.3.tmp lexicon_autogen.4.tmp log vocab_autogen.1 vocab_autogen.2 vocab_autogen.3 vocab_autogen.4 vocab_autogen.full
(libri_env) [ec2-user@ip-172-31-6-113 g2p]$ ls -alh
total 2.3M
drwxrwxr-x 3 ec2-user ec2-user 251 Dec 14 06:11 .
drwxrwxr-x 4 ec2-user ec2-user 58 Dec 14 05:56 ..
-rw-rw-r-- 1 ec2-user ec2-user 0 Dec 14 06:24 lexicon_autogen.1.tmp
-rw-rw-r-- 1 ec2-user ec2-user 0 Dec 14 06:24 lexicon_autogen.2.tmp
-rw-rw-r-- 1 ec2-user ec2-user 0 Dec 14 06:24 lexicon_autogen.3.tmp
-rw-rw-r-- 1 ec2-user ec2-user 0 Dec 14 06:24 lexicon_autogen.4.tmp
drwxrwxr-x 2 ec2-user ec2-user 74 Dec 14 05:56 log
-rw-rw-r-- 1 ec2-user ec2-user 296K Dec 14 06:24 vocab_autogen.1
-rw-rw-r-- 1 ec2-user ec2-user 288K Dec 14 06:24 vocab_autogen.2
-rw-rw-r-- 1 ec2-user ec2-user 291K Dec 14 06:24 vocab_autogen.3
-rw-rw-r-- 1 ec2-user ec2-user 291K Dec 14 06:24 vocab_autogen.4
-rw-rw-r-- 1 ec2-user ec2-user 1.2M Dec 14 06:24 vocab_autogen.full
(libri_env) [ec2-user@ip-172-31-6-113 g2p]$ head vocab_autogen.1
A''S
A'BODY
A'COURT
A'D
A'GHA
A'GOIN
A'LL
A'M
A'MIGHTY
A'MIGHTY'S
(libri_env) [ec2-user@ip-172-31-6-113 g2p]$ wc -l vocab_autogen.1
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur-g2p
Collecting sequitur-g2p
Downloading sequitur_g2p-1.0.1668.21-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB)
|████████████████████████████████| 1.2 MB 6.5 MB/s
Requirement already satisfied: six>=1.11.0 in /home/ec2-user/libri_env/lib/python3.7/site-packages (from sequitur-g2p) (1.16.0)
Requirement already satisfied: numpy>=1.14.2 in /home/ec2-user/libri_env/lib/python3.7/site-packages (from sequitur-g2p) (1.21.4)
Installing collected packages: sequitur-g2p
Successfully installed sequitur-g2p-1.0.1668.21
(libri_env) [ec2-user@ip-172-31-6-113 s5]$
# local/g2p.sh data/local/dict/g2p/vocab_autogen.1 data/g2p data/local/dict/g2p/lexicon_autogen.1
# Started at Tue Dec 14 06:41:22 UTC 2021
#
python3.7
vocab
data/local/dict/g2p/vocab_autogen.1
g2p
data/g2p
lex
data/local/dict/g2p/lexicon_autogen.1
Traceback (most recent call last):
File "/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/g2p.py", line 39, in <module>
import SequiturTool
File "/home/ec2-user/kaldi/tools/sequitur-g2p/SequiturTool.py", line 32, in <module>
from six.moves import cPickle as pickle
ModuleNotFoundError: No module named 'six'
# Accounting: time=0 threads=1
# Ended (code 1) at Tue Dec 14 06:41:22 UTC 2021, elapsed time 0 seconds
/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib/
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ cp -r /home/ec2-user/libri_env/lib/python3.7 /home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib/
Comments
Post a Comment