-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZeroDivisionError: float division by zero #32
Comments
Hi, could you |
And my cuda version is cuda-11.1 |
Hi, what is your training command? |
Training command is like below. What is your cuda versio when using pytorch 1.4.0. I found that cuda 11.1 is not supported by pytorch 1.4.0. TOTAL_NUM_UPDATES=80000 |
maybe you can try removing the option |
Is this resolved? I met the same issue, I didn't set --max-sentences to 1. Somehow the values for logging_outputs are shifted and value for sample_size becomes 0. |
Is this resolved? I met the same issue, ZeroDivisionError: float division by zero |
Have you find any solution for this issue? Thanks. |
I met the "ZeroDivisionError: float division by zero" when I want to train the model with multi-gpu. And if only 1 gpu, the problem disappear but the training is too slow...
And the detailed traceback is below, do you have any idea about it?
/home/xjw/code/guided_summarization/src/fairseq/fairseq/optim/adam.py:179: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:1025.)
exp_avg.mul_(beta1).add_(1 - beta1, grad)
/home/xjw/code/guided_summarization/src/fairseq/fairseq/optim/adam.py:179: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:1025.)
exp_avg.mul_(beta1).add_(1 - beta1, grad)
Traceback (most recent call last):
File "/home/xjw/miniconda3/envs/gsum/bin/fairseq-train", line 33, in
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 316, in cli_main
nprocs=args.distributed_world_size,
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 283, in distributed_main
main(args, init_distributed=True)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 102, in main
train(args, trainer, task, epoch_itr)
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq_cli/train.py", line 178, in train
log_output = trainer.train_step(samples)
File "/home/xjw/miniconda3/envs/gsum/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/trainer.py", line 391, in train_step
logging_output = self._reduce_and_log_stats(logging_outputs, sample_size)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/trainer.py", line 718, in _reduce_and_log_stats
self.task.reduce_metrics(logging_outputs, self.get_criterion())
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/tasks/guided_translation.py", line 307, in reduce_metrics
super().reduce_metrics(logging_outputs, criterion)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/tasks/fairseq_task.py", line 406, in reduce_metrics
criterion.class.reduce_metrics(logging_outputs)
File "/home/xjw/code/guided_summarization/src/fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 95, in reduce_metrics
metrics.log_scalar('loss', loss_sum / sample_size / math.log(2), sample_size, round=3)
ZeroDivisionError: float division by zero
The text was updated successfully, but these errors were encountered: