ERROR (nnet3-chain-train[5.5.997~1-054af]: Possibly this is due to sharing the GPU.

Because I only have one GPU I got the error below. The original script assumes one has multiple GPUs.

The fix is offered in the Error message.

ERROR (nnet3-chain-train[5.5.997~1-054af]:AllocateNewRegion():cu-allocator.cc:491) Failed to allocate a memory region of 12582912 bytes. Possibly this is due to sharing the GPU. Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py. Memory info: free:10M, used:15099M, total:15109M, free/total:0.000711461 CUDA error: 'out of memory'

The error expalins how to fix it:

Run command below:

sudo nvidia-smi -c 3

and add the line below to local/chain/run_tdnn.sh

--use-gpu=wait

ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5$ pico local/chain/run_tdnn.sh

The third line was added:

steps/nnet3/chain/train.py --stage $train_stage \
    --cmd "$decode_cmd" \
    --use-gpu=wait \
    --feat.online-ivector-dir $train_ivector_dir \
    --feat.cmvn-opts "--norm-means=false --norm-vars=false" \
    --chain.xent-regularize $xent_regularize \
    --chain.leaky-hmm-coefficient 0.1 \
    --chain.l2-regularize 0.0 \
    --chain.apply-deriv-weights false \
    --chain.lm-opts="--num-extra-lm-states=2000" \
    --egs.dir "$common_egs_dir" \
    --egs.stage $get_egs_stage \
    --egs.opts "--frames-overlap-per-eg 0 --constrained false" \
    --egs.chunk-width $frames_per_eg \
    --trainer.dropout-schedule $dropout_schedule \
    --trainer.add-option="--optimization.memory-compression-level=2" \
    --trainer.num-chunk-per-minibatch 64 \
    --trainer.frames-per-iter 2500000 \
    --trainer.num-epochs 4 \
    --trainer.optimization.num-jobs-initial 3 \
    --trainer.optimization.num-jobs-final 16 \
    --trainer.optimization.initial-effective-lrate 0.00015 \
    --trainer.optimization.final-effective-lrate 0.000015 \
    --trainer.max-param-change 2.0 \
    --cleanup.remove-egs $remove_egs \
    --feat-dir $train_data_dir \
    --tree-dir $tree_dir \
    --lat-dir $lat_dir \
    --dir $dir  || exit 1;

ERROR (nnet3-chain-train[5.5.997~1-054af]:AllocateNewRegion():cu-allocator.cc:491) Failed to allocate a memory region of 12582912 bytes.  Possibly this is due to sharing the GPU.  Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py.  Memory info: free:10M, used:15099M, total:15109M, free/total:0.000711461 CUDA error: 'out of memory'

Nadira Povey

Search This Blog

ERROR (nnet3-chain-train[5.5.997~1-054af]: Possibly this is due to sharing the GPU.

Comments

Post a Comment