ERROR (nnet3-chain-train[5.5.997~1-054af]: Possibly this is due to sharing the GPU.

 

Because I only have one GPU I got the error below. The original script assumes one has multiple GPUs.

The fix is offered in the Error message.



ERROR (nnet3-chain-train[5.5.997~1-054af]:AllocateNewRegion():cu-allocator.cc:491) Failed to allocate a memory region of 12582912 bytes.  Possibly this is due to sharing the GPU.  Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py.  Memory info: free:10M, used:15099M, total:15109M, free/total:0.000711461 CUDA error: 'out of memory'


The error expalins how to fix it:

Run command below:

sudo nvidia-smi -c 3

and add the line below to local/chain/run_tdnn.sh

--use-gpu=wait

ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5$ pico local/chain/run_tdnn.sh

The third line was added:

steps/nnet3/chain/train.py --stage $train_stage \
   --cmd "$decode_cmd" \
   --use-gpu=wait \
   --feat.online-ivector-dir $train_ivector_dir \
   --feat.cmvn-opts "--norm-means=false --norm-vars=false" \
   --chain.xent-regularize $xent_regularize \
   --chain.leaky-hmm-coefficient 0.1 \
   --chain.l2-regularize 0.0 \
   --chain.apply-deriv-weights false \
   --chain.lm-opts="--num-extra-lm-states=2000" \
   --egs.dir "$common_egs_dir" \
   --egs.stage $get_egs_stage \
   --egs.opts "--frames-overlap-per-eg 0 --constrained false" \
   --egs.chunk-width $frames_per_eg \
   --trainer.dropout-schedule $dropout_schedule \
   --trainer.add-option="--optimization.memory-compression-level=2" \
   --trainer.num-chunk-per-minibatch 64 \
   --trainer.frames-per-iter 2500000 \
   --trainer.num-epochs 4 \
   --trainer.optimization.num-jobs-initial 3 \
   --trainer.optimization.num-jobs-final 16 \
   --trainer.optimization.initial-effective-lrate 0.00015 \
   --trainer.optimization.final-effective-lrate 0.000015 \
   --trainer.max-param-change 2.0 \
   --cleanup.remove-egs $remove_egs \
   --feat-dir $train_data_dir \
   --tree-dir $tree_dir \
   --lat-dir $lat_dir \
   --dir $dir || exit 1;
ERROR (nnet3-chain-train[5.5.997~1-054af]:AllocateNewRegion():cu-allocator.cc:491) Failed to allocate a memory region of 12582912 bytes.  Possibly this is due to sharing the GPU.  Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py.  Memory info: free:10M, used:15099M, total:15109M, free/total:0.000711461 CUDA error: 'out of memory'


Comments