Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeroDivisionError: float division by zero #32

Open
JaniceXiong opened this issue Sep 28, 2021 · 9 comments
Open

ZeroDivisionError: float division by zero #32

JaniceXiong opened this issue Sep 28, 2021 · 9 comments

Comments

@JaniceXiong
Copy link

I met the "ZeroDivisionError: float division by zero" when I want to train the model with multi-gpu. And if only 1 gpu, the problem disappear but the training is too slow...
And the detailed traceback is below, do you have any idea about it?

/home/xjw/code/guided_summarization/src/fairseq/fairseq/optim/adam.py:179: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:1025.)
exp_avg.mul_(beta1).add_(1 - beta1, grad)
/home/xjw/code/guided_summarization/src/fairseq/fairseq/optim/adam.py:179: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:1025.)
exp_avg.mul_(beta1).add_(1 - beta1, grad)
Traceback (most recent call last):
File "/home/xjw/miniconda3/envs/gsum/bin/fairseq-train", line 33, in
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 316, in cli_main
nprocs=args.distributed_world_size,
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 283, in distributed_main
main(args, init_distributed=True)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 102, in main
train(args, trainer, task, epoch_itr)
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 178, in train
log_output = trainer.train_step(samples)
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/trainer.py", line 391, in train_step
logging_output = self._reduce_and_log_stats(logging_outputs, sample_size)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/trainer.py", line 718, in _reduce_and_log_stats
self.task.reduce_metrics(logging_outputs, self.get_criterion())
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/tasks/guided_translation.py", line 307, in reduce_metrics
super().reduce_metrics(logging_outputs, criterion)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/tasks/fairseq_task.py", line 406, in reduce_metrics
criterion.class.reduce_metrics(logging_outputs)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 95, in reduce_metrics
metrics.log_scalar('loss', loss_sum / sample_size / math.log(2), sample_size, round=3)
ZeroDivisionError: float division by zero

@zdou0830
Copy link
Collaborator

Hi, could you pip install torch==1.4.0 and try again? I suspect this is a version mismatch problem.

@JaniceXiong
Copy link
Author

pip install torch==1.4.0

Thanks for reply! I use pip install torch==1.4.0 but the program got stuck at the very begining. It seems that all the 4 GPUs did not be used correctly.

1632902491(1)

1632902635(1)

@JaniceXiong
Copy link
Author

And my cuda version is cuda-11.1

@zdou0830
Copy link
Collaborator

Hi, what is your training command?

@JaniceXiong
Copy link
Author

JaniceXiong commented Sep 30, 2021

Hi, what is your training command?

Training command is like below. What is your cuda versio when using pytorch 1.4.0. I found that cuda 11.1 is not supported by pytorch 1.4.0.

TOTAL_NUM_UPDATES=80000
WARMUP_UPDATES=500
LR=3e-05
MAX_TOKENS=1024
DEVICES=3,4,5,6
CUDA_VISIBLE_DEVICES=$DEVICES python train.py $DATA_BIN
--restore-file $BART_PATH
--max-tokens $MAX_TOKENS
--max-sentences 1
--task guided_translation
--source-lang source --target-lang target
--truncate-source
--layernorm-embedding
--share-all-embeddings
--share-decoder-input-output-embed
--reset-optimizer --reset-dataloader --reset-meters
--required-batch-size-multiple 1
--arch guided_bart_large
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--dropout 0.1 --attention-dropout 0.1
--weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08
--clip-norm 0.1
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES
--fp16 --update-freq $UPDATE_FREQ
--max-epoch 10
--skip-invalid-size-inputs-valid-test
--ddp-backend=no_c10d
--save-dir $SAVE_DIR
--save-interval-updates 2500
--find-unused-parameters;

@zdou0830
Copy link
Collaborator

zdou0830 commented Sep 30, 2021

maybe you can try removing the option --max-sentences 1?

@liaimi
Copy link

liaimi commented Dec 9, 2021

Is this resolved? I met the same issue, I didn't set --max-sentences to 1. Somehow the values for logging_outputs are shifted and value for sample_size becomes 0.

@ARDUJS
Copy link

ARDUJS commented Jan 5, 2022

Is this resolved? I met the same issue, ZeroDivisionError: float division by zero

@nargesdel
Copy link

Is this resolved? I met the same issue, ZeroDivisionError: float division by zero

Have you find any solution for this issue? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants