P7: Librispeech g2p error, strange fix.

Not advised to use this fix. Looking at it after, probably just path issue that can be fixed easily.

While running prepare_dict.sh I got stuck in making g2p work.

(libri_env) [ec2-user@ip-172-31-6-113 s5]$ ./local/prepare_dict.sh data/lm data/g2p data/local/dict
Downloading and preparing CMUdict
Removing the pronunciation variant markers ...
Autogenerating pronunciations for the words in data/local/dict/g2p/vocab_autogen.* ...
run.pl: 4 / 4 failed, log is in data/local/dict/g2p/log/g2p.*.log

The problem looks like the g2p.py uses python3.7. I used python2.7.

Solution: copying the whole /home/ec2-user/libri_env/lib/python3.7 folder to /home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib/ fixed g2p error.

[ec2-user@ip-172-31-6-113 ~]$ python3 -m venv libri_env
[ec2-user@ip-172-31-6-113 ~]$ source libri_env/bin/activate
(libri_env) [ec2-user@ip-172-31-6-113 ~]$ python3 -m pip install --upgrade pip
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install numpy
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur-g2p
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install six
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ cp -r /home/ec2-user/libri_env/lib/python3.7 /home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib/
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ ./local/prepare_dict.sh data/lm data/g2p data/local/dict
Downloading and preparing CMUdict
Removing the pronunciation variant markers ...
Autogenerating pronunciations for the words in data/local/dict/g2p/vocab_autogen.* ...
134746 pronunciations autogenerated OK
Combining the CMUdict pronunciations with the autogenerated ones ...
Combined lexicon saved to 'data/local/dict/lexicon_raw_nosil.txt'
Preparing phone lists and clustering questions
2 silence phones saved to: data/local/dict/silence_phones.txt
1 optional silence saved to: data/local/dict/optional_silence.txt
39 non-silence phones saved to: data/local/dict/nonsilence_phones.txt
5 extra triphone clustering-related questions saved to: data/local/dict/extra_questions.txt
Lexicon text file saved as: data/local/dict/lexicon.txt

It worked!

Helpful to see the tree structure of the data folder:

(libri_env) [ec2-user@ip-172-31-6-113 s5]$ tree -L 3 data 
data
├── g2p
│   └── g2p-model-5
├── LibriSpeech
│   ├── BOOKS.TXT
│   ├── CHAPTERS.TXT
│   ├── LICENSE.TXT
│   ├── README.TXT
│   ├── SPEAKERS.TXT
│   └── test-clean
│       ├── 1089
│       ├── 1188
│       ├── 121
│       ├── 1221
│       ├── 1284
│       ├── 1320
│       ├── 1580
│       ├── 1995
│       ├── 2094
│       ├── 2300
│       ├── 237
│       ├── 260
│       ├── 2830
│       ├── 2961
│       ├── 3570
│       ├── 3575
│       ├── 3729
│       ├── 4077
│       ├── 4446
│       ├── 4507
│       ├── 4970
│       ├── 4992
│       ├── 5105
│       ├── 5142
│       ├── 5639
│       ├── 5683
│       ├── 61
│       ├── 672
│       ├── 6829
│       ├── 6930
│       ├── 7021
│       ├── 7127
│       ├── 7176
│       ├── 7729
│       ├── 8224
│       ├── 8230
│       ├── 8455
│       ├── 8463
│       ├── 8555
│       └── 908
├── lm
│   ├── 3-gram.arpa.gz
│   ├── 3-gram.pruned.1e-7.arpa.gz
│   ├── 3-gram.pruned.3e-7.arpa.gz
│   ├── 4-gram.arpa.gz
│   ├── cmudict
│   │   ├── 00README_FIRST.txt
│   │   ├── cmudict.0.6d
│   │   ├── cmudict.0.7a
│   │   ├── cmudict.0.7a.phones
│   │   ├── cmudict.0.7a.symbols
│   │   ├── README.developer
│   │   ├── README.old
│   │   ├── README.weide
│   │   ├── scripts
│   │   └── sphinxdict
│   ├── cmudict.0.7a.plain
│   ├── g2p
│   │   ├── lexicon_autogen.1.tmp
│   │   ├── lexicon_autogen.2.tmp
│   │   ├── lexicon_autogen.3.tmp
│   │   ├── lexicon_autogen.4.tmp
│   │   ├── log
│   │   ├── vocab_autogen.1
│   │   ├── vocab_autogen.2
│   │   ├── vocab_autogen.3
│   │   ├── vocab_autogen.4
│   │   └── vocab_autogen.full
│   ├── librispeech-lexicon.txt
│   ├── librispeech-lm-corpus.tgz
│   ├── librispeech-vocab.txt
│   ├── lm_fglarge.arpa.gz -> 4-gram.arpa.gz
│   ├── lm_tglarge.arpa.gz -> 3-gram.arpa.gz
│   ├── lm_tgmed.arpa.gz -> 3-gram.pruned.1e-7.arpa.gz
│   └── lm_tgsmall.arpa.gz -> 3-gram.pruned.3e-7.arpa.gz
└── local
  └── dict
      ├── cmudict
      ├── cmudict.0.7a.plain
      ├── extra_questions.txt
      ├── g2p
      ├── lexicon_autogen.txt
      ├── lexicon_raw_nosil.txt
      ├── lexicon.txt
      ├── nonsilence_phones.txt
      ├── optional_silence.txt
      ├── silence_phones.txt
      └── vocab_autogen.txt

53 directories, 44 files
(libri_env) [ec2-user@ip-172-31-6-113 s5]$

Helpful to see the tree structure of the data folder:

(libri_env) [ec2-user@ip-172-31-6-113 s5]$ tree -L 4 data 
data
├── g2p
│   └── g2p-model-5
├── LibriSpeech
│   ├── BOOKS.TXT
│   ├── CHAPTERS.TXT
│   ├── LICENSE.TXT
│   ├── README.TXT
│   ├── SPEAKERS.TXT
│   └── test-clean
│       ├── 1089
│       │   ├── 134686
│       │   └── 134691
│       ├── 1188
│       │   └── 133604
│       ├── 121
│       │   ├── 121726
│       │   ├── 123852
│       │   ├── 123859
│       │   └── 127105
│       ├── 1221
│       │   ├── 135766
│       │   └── 135767
│       ├── 1284
│       │   ├── 1180
│       │   ├── 1181
│       │   └── 134647
│       ├── 1320
│       │   ├── 122612
│       │   └── 122617
│       ├── 1580
│       │   ├── 141083
│       │   └── 141084
│       ├── 1995
│       │   ├── 1826
│       │   ├── 1836
│       │   └── 1837
│       ├── 2094
│       │   └── 142345
│       ├── 2300
│       │   └── 131720
│       ├── 237
│       │   ├── 126133
│       │   ├── 134493
│       │   └── 134500
│       ├── 260
│       │   ├── 123286
│       │   ├── 123288
│       │   └── 123440
│       ├── 2830
│       │   ├── 3979
│       │   └── 3980
│       ├── 2961
│       │   ├── 960
│       │   └── 961
│       ├── 3570
│       │   ├── 5694
│       │   ├── 5695
│       │   └── 5696
│       ├── 3575
│       │   └── 170457
│       ├── 3729
│       │   └── 6852
│       ├── 4077
│       │   ├── 13751
│       │   └── 13754
│       ├── 4446
│       │   ├── 2271
│       │   ├── 2273
│       │   └── 2275
│       ├── 4507
│       │   └── 16021
│       ├── 4970
│       │   ├── 29093
│       │   └── 29095
│       ├── 4992
│       │   ├── 23283
│       │   ├── 41797
│       │   └── 41806
│       ├── 5105
│       │   ├── 28233
│       │   ├── 28240
│       │   └── 28241
│       ├── 5142
│       │   ├── 33396
│       │   ├── 36377
│       │   ├── 36586
│       │   └── 36600
│       ├── 5639
│       │   └── 40744
│       ├── 5683
│       │   ├── 32865
│       │   ├── 32866
│       │   └── 32879
│       ├── 61
│       │   ├── 70968
│       │   └── 70970
│       ├── 672
│       │   └── 122797
│       ├── 6829
│       │   ├── 68769
│       │   └── 68771
│       ├── 6930
│       │   ├── 75918
│       │   ├── 76324
│       │   └── 81414
│       ├── 7021
│       │   ├── 79730
│       │   ├── 79740
│       │   ├── 79759
│       │   └── 85628
│       ├── 7127
│       │   ├── 75946
│       │   └── 75947
│       ├── 7176
│       │   ├── 88083
│       │   └── 92135
│       ├── 7729
│       │   └── 102255
│       ├── 8224
│       │   ├── 274381
│       │   └── 274384
│       ├── 8230
│       │   └── 279154
│       ├── 8455
│       │   └── 210777
│       ├── 8463
│       │   ├── 287645
│       │   ├── 294825
│       │   └── 294828
│       ├── 8555
│       │   ├── 284447
│       │   ├── 284449
│       │   └── 292519
│       └── 908
│           ├── 157963
│           └── 31957
├── lm
│   ├── 3-gram.arpa.gz
│   ├── 3-gram.pruned.1e-7.arpa.gz
│   ├── 3-gram.pruned.3e-7.arpa.gz
│   ├── 4-gram.arpa.gz
│   ├── cmudict
│   │   ├── 00README_FIRST.txt
│   │   ├── cmudict.0.6d
│   │   ├── cmudict.0.7a
│   │   ├── cmudict.0.7a.phones
│   │   ├── cmudict.0.7a.symbols
│   │   ├── README.developer
│   │   ├── README.old
│   │   ├── README.weide
│   │   ├── scripts
│   │   │   ├── CompileDictionary.sh
│   │   │   ├── make_baseform.pl
│   │   │   ├── README.txt
│   │   │   ├── sort_cmudict.pl
│   │   │   ├── test_cmudict.pl
│   │   │   └── test_dict.pl
│   │   └── sphinxdict
│   │       ├── cmudict.0.7a_SPHINX_40
│   │       ├── cmudict_SPHINX_40
│   │       ├── README.txt
│   │       └── SphinxPhones_40
│   ├── cmudict.0.7a.plain
│   ├── g2p
│   │   ├── lexicon_autogen.1.tmp
│   │   ├── lexicon_autogen.2.tmp
│   │   ├── lexicon_autogen.3.tmp
│   │   ├── lexicon_autogen.4.tmp
│   │   ├── log
│   │   │   ├── g2p.1.log
│   │   │   ├── g2p.2.log
│   │   │   ├── g2p.3.log
│   │   │   └── g2p.4.log
│   │   ├── vocab_autogen.1
│   │   ├── vocab_autogen.2
│   │   ├── vocab_autogen.3
│   │   ├── vocab_autogen.4
│   │   └── vocab_autogen.full
│   ├── librispeech-lexicon.txt
│   ├── librispeech-lm-corpus.tgz
│   ├── librispeech-vocab.txt
│   ├── lm_fglarge.arpa.gz -> 4-gram.arpa.gz
│   ├── lm_tglarge.arpa.gz -> 3-gram.arpa.gz
│   ├── lm_tgmed.arpa.gz -> 3-gram.pruned.1e-7.arpa.gz
│   └── lm_tgsmall.arpa.gz -> 3-gram.pruned.3e-7.arpa.gz
└── local
  └── dict
      ├── cmudict
      │   ├── 00README_FIRST.txt
      │   ├── cmudict.0.6d
      │   ├── cmudict.0.7a
      │   ├── cmudict.0.7a.phones
      │   ├── cmudict.0.7a.symbols
      │   ├── README.developer
      │   ├── README.old
      │   ├── README.weide
      │   ├── scripts
      │   └── sphinxdict
      ├── cmudict.0.7a.plain
      ├── extra_questions.txt
      ├── g2p
      │   ├── lexicon_autogen.1
      │   ├── lexicon_autogen.2
      │   ├── lexicon_autogen.3
      │   ├── lexicon_autogen.4
      │   ├── log
      │   ├── vocab_autogen.1
      │   ├── vocab_autogen.2
      │   ├── vocab_autogen.3
      │   ├── vocab_autogen.4
      │   └── vocab_autogen.full
      ├── lexicon_autogen.txt
      ├── lexicon_raw_nosil.txt
      ├── lexicon.txt
      ├── nonsilence_phones.txt
      ├── optional_silence.txt
      ├── silence_phones.txt
      └── vocab_autogen.txt

143 directories, 75 files
(libri_env) [ec2-user@ip-172-31-6-113 s5]$


Various error message are below:

# local/g2p.sh data/local/dict/g2p/vocab_autogen.1 data/local/lm data/local/dict/g2p/lexicon_autogen.1 
# Started at Sun Dec 12 23:54:31 UTC 2021
#
File "/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/g2p.py", line 141
  print("unknown format in file: %s" % (line), file=stderr)
                                                    ^
SyntaxError: invalid syntax
# Accounting: time=0 threads=1
# Ended (code 1) at Sun Dec 12 23:54:31 UTC 2021, elapsed time 0 seconds

After creating python3.7 environment and running the code again the same problem persisted.

[ec2-user@ip-172-31-6-113 s5]$ sudo yum install numpy
[ec2-user@ip-172-31-6-113 ~]$ python3 -m venv libri_env
[ec2-user@ip-172-31-6-113 ~]$ source libri_env/bin/activate
(libri_env) [ec2-user@ip-172-31-6-113 ~]$ python3 -m pip install --upgrade pip
(libri_env) [ec2-user@ip-172-31-6-113 ~]$

Trying to mimic the sample input to the script

#/home/ec2-user/kaldi/egs/librispeech/s5/data/g2p
"e.g.: /export/a15/vpanayotov/data/lm /export/a15/vpanayotov/data/g2p data/local/dict"
./local/prepare_dict.sh data/local/lm data/dict/g2p data/local/dict
libri_env) [ec2-user@ip-172-31-6-113 s5]$ ./local/prepare_dict.sh data/local/lm data/g2p data/local/dict

Another error if I use python3.7

# local/g2p.sh data/local/dict/g2p/vocab_autogen.2 /home/ec2-user/kaldi/egs/librispeech/s5/data/g2p data/local/dict/g2p/lexicon_autogen.2 
# Started at Tue Dec 14 05:40:29 UTC 2021
#
Can't find '/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib/python3.7/site-packages' - please fix your Sequitur installation
# Accounting: time=0 threads=1
# Ended (code 1) at Tue Dec 14 05:40:29 UTC 2021, elapsed time 0 seconds

Install sequitur to python3.7 virtual env

(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur

output:

(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur
Collecting sequitur
Downloading sequitur-1.2.4.tar.gz (11 kB)
Preparing metadata (setup.py) ... done
Collecting torch
Downloading torch-1.10.0-cp37-cp37m-manylinux1_x86_64.whl (881.9 MB)
    |████████████████████████████████| 881.9 MB 12 kB/s              
Collecting typing-extensions
Downloading typing_extensions-4.0.1-py3-none-any.whl (22 kB)
Using legacy 'setup.py install' for sequitur, since package 'wheel' is not installed.
Installing collected packages: typing-extensions, torch, sequitur
  Running setup.py install for sequitur ... done
Successfully installed sequitur-1.2.4 torch-1.10.0 typing-extensions-4.0.1
(libri_env) [ec2-user@ip-172-31-6-113 s5]$
sequitur_model=$g2p_model_dir/g2p-model-5
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ ./local/prepare_dict.sh data/lm data/g2p data/local/dict
# local/g2p.sh data/local/dict/g2p/vocab_autogen.2 data/g2p data/local/dict/g2p/lexicon_autogen.2 
# Started at Tue Dec 14 06:04:33 UTC 2021
#
Can't find '/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib//site-packages' - please fix your Sequitur installation
# Accounting: time=0 threads=1
# Ended (code 1) at Tue Dec 14 06:04:33 UTC 2021, elapsed time 0 seconds
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ ./local/prepare_dict.sh data/lm data/g2p data/local/dict
Downloading and preparing CMUdict
Removing the pronunciation variant markers ...
Sequitur G2P not found- running /home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/extra/install_sequitur.sh
~/kaldi/tools ~/kaldi/egs/librispeech/s5
extras/install_sequitur.sh : ERROR: python-devel/python-dev not installed
extras/install_sequitur.sh: we recommend that you run (our best guess):
sudo yum install python-devel
(libri_env) [ec2-user@ip-172-31-6-113 s5]$
# local/g2p.sh data/local/dict/g2p/vocab_autogen.1 data/g2p data/local/dict/g2p/lexicon_autogen.1 
# Started at Tue Dec 14 06:11:13 UTC 2021
#
Traceback (most recent call last):
File "/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/g2p.py", line 39, in <module>
  import SequiturTool
File "/home/ec2-user/kaldi/tools/sequitur-g2p/SequiturTool.py", line 32, in <module>
  from six.moves import cPickle as pickle
ModuleNotFoundError: No module named 'six'
# Accounting: time=0 threads=1
# Ended (code 1) at Tue Dec 14 06:11:13 UTC 2021, elapsed time 0 seconds

Ignore notes below:

Just various command below:

(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur
Requirement already satisfied: sequitur in /home/ec2-user/libri_env/lib/python3.7/site-packages (1.2.4)
Requirement already satisfied: torch in /home/ec2-user/libri_env/lib/python3.7/site-packages (from sequitur) (1.10.0)
Requirement already satisfied: typing-extensions in /home/ec2-user/libri_env/lib/python3.7/site-packages (from torch->sequitur) (4.0.1)
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ cd data/local/dict/g2p/
(libri_env) [ec2-user@ip-172-31-6-113 g2p]$ ls
lexicon_autogen.1.tmp lexicon_autogen.2.tmp lexicon_autogen.3.tmp lexicon_autogen.4.tmp log vocab_autogen.1 vocab_autogen.2 vocab_autogen.3 vocab_autogen.4 vocab_autogen.full
(libri_env) [ec2-user@ip-172-31-6-113 g2p]$ ls -alh
total 2.3M
drwxrwxr-x 3 ec2-user ec2-user  251 Dec 14 06:11 .
drwxrwxr-x 4 ec2-user ec2-user   58 Dec 14 05:56 ..
-rw-rw-r-- 1 ec2-user ec2-user    0 Dec 14 06:24 lexicon_autogen.1.tmp
-rw-rw-r-- 1 ec2-user ec2-user    0 Dec 14 06:24 lexicon_autogen.2.tmp
-rw-rw-r-- 1 ec2-user ec2-user    0 Dec 14 06:24 lexicon_autogen.3.tmp
-rw-rw-r-- 1 ec2-user ec2-user    0 Dec 14 06:24 lexicon_autogen.4.tmp
drwxrwxr-x 2 ec2-user ec2-user   74 Dec 14 05:56 log
-rw-rw-r-- 1 ec2-user ec2-user 296K Dec 14 06:24 vocab_autogen.1
-rw-rw-r-- 1 ec2-user ec2-user 288K Dec 14 06:24 vocab_autogen.2
-rw-rw-r-- 1 ec2-user ec2-user 291K Dec 14 06:24 vocab_autogen.3
-rw-rw-r-- 1 ec2-user ec2-user 291K Dec 14 06:24 vocab_autogen.4
-rw-rw-r-- 1 ec2-user ec2-user 1.2M Dec 14 06:24 vocab_autogen.full
(libri_env) [ec2-user@ip-172-31-6-113 g2p]$ head vocab_autogen.1
A''S
A'BODY
A'COURT
A'D
A'GHA
A'GOIN
A'LL
A'M
A'MIGHTY
A'MIGHTY'S
(libri_env) [ec2-user@ip-172-31-6-113 g2p]$ wc -l vocab_autogen.1
(libri_env) [ec2-user@ip-172-31-6-113 s5]$ pip3 install sequitur-g2p
Collecting sequitur-g2p
Downloading sequitur_g2p-1.0.1668.21-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB)
    |████████████████████████████████| 1.2 MB 6.5 MB/s            
Requirement already satisfied: six>=1.11.0 in /home/ec2-user/libri_env/lib/python3.7/site-packages (from sequitur-g2p) (1.16.0)
Requirement already satisfied: numpy>=1.14.2 in /home/ec2-user/libri_env/lib/python3.7/site-packages (from sequitur-g2p) (1.21.4)
Installing collected packages: sequitur-g2p
Successfully installed sequitur-g2p-1.0.1668.21
(libri_env) [ec2-user@ip-172-31-6-113 s5]$
# local/g2p.sh data/local/dict/g2p/vocab_autogen.1 data/g2p data/local/dict/g2p/lexicon_autogen.1 
# Started at Tue Dec 14 06:41:22 UTC 2021
#
python3.7
vocab
data/local/dict/g2p/vocab_autogen.1
g2p
data/g2p
lex
data/local/dict/g2p/lexicon_autogen.1
Traceback (most recent call last):
File "/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/g2p.py", line 39, in <module>
  import SequiturTool
File "/home/ec2-user/kaldi/tools/sequitur-g2p/SequiturTool.py", line 32, in <module>
  from six.moves import cPickle as pickle
ModuleNotFoundError: No module named 'six'
# Accounting: time=0 threads=1
# Ended (code 1) at Tue Dec 14 06:41:22 UTC 2021, elapsed time 0 seconds
/home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib/

(libri_env) [ec2-user@ip-172-31-6-113 s5]$ cp -r /home/ec2-user/libri_env/lib/python3.7 /home/ec2-user/kaldi/egs/librispeech/s5/../../../tools/sequitur-g2p/lib/ 

Comments