Because I only have one GPU I got the error below. The original script assumes one has multiple GPUs.
The fix is offered in the Error message.
ERROR (nnet3-chain-train[5.5.997~1-054af]:AllocateNewRegion():cu-allocator.cc:491) Failed to allocate a memory region of 12582912 bytes. Possibly this is due to sharing the GPU. Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py. Memory info: free:10M, used:15099M, total:15109M, free/total:0.000711461 CUDA error: 'out of memory'
The error expalins how to fix it:
Run command below:
sudo nvidia-smi -c 3
and add the line below to local/chain/run_tdnn.sh
--use-gpu=wait
ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5$ pico local/chain/run_tdnn.sh
The third line was added:
steps/nnet3/chain/train.py --stage $train_stage \
--cmd "$decode_cmd" \
--use-gpu=wait \
--feat.online-ivector-dir $train_ivector_dir \
--feat.cmvn-opts "--norm-means=false --norm-vars=false" \
--chain.xent-regularize $xent_regularize \
--chain.leaky-hmm-coefficient 0.1 \
--chain.l2-regularize 0.0 \
--chain.apply-deriv-weights false \
--chain.lm-opts="--num-extra-lm-states=2000" \
--egs.dir "$common_egs_dir" \
--egs.stage $get_egs_stage \
--egs.opts "--frames-overlap-per-eg 0 --constrained false" \
--egs.chunk-width $frames_per_eg \
--trainer.dropout-schedule $dropout_schedule \
--trainer.add-option="--optimization.memory-compression-level=2" \
--trainer.num-chunk-per-minibatch 64 \
--trainer.frames-per-iter 2500000 \
--trainer.num-epochs 4 \
--trainer.optimization.num-jobs-initial 3 \
--trainer.optimization.num-jobs-final 16 \
--trainer.optimization.initial-effective-lrate 0.00015 \
--trainer.optimization.final-effective-lrate 0.000015 \
--trainer.max-param-change 2.0 \
--cleanup.remove-egs $remove_egs \
--feat-dir $train_data_dir \
--tree-dir $tree_dir \
--lat-dir $lat_dir \
--dir $dir || exit 1;
ERROR (nnet3-chain-train[5.5.997~1-054af]:AllocateNewRegion():cu-allocator.cc:491) Failed to allocate a memory region of 12582912 bytes. Possibly this is due to sharing the GPU. Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py. Memory info: free:10M, used:15099M, total:15109M, free/total:0.000711461 CUDA error: 'out of memory'
Comments
Post a Comment