You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Sometimes, it is hard to predict how much data left in the dataloader by the time last batch comes. When doing multigpu training with last_batch=keep, it could happen that the number of items left in the last batch is smaller than numbers of gpus. In that case gluon.utils.split_and_load throws an exception ValueError: Too many slices for data with shape ....
It would be great if it could work transparently. I would expect that if the parameter even_split on DataLoader is set to False, then exception shouldn't happen: the data should be distributed in a way that some arrays are empty, and later on the calculation of forward and backward passes with empty arrays are just silently ignored.
Environment info (Required)
----------Python Info----------
Version : 3.6.4
Compiler : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
Build : ('default', 'Jan 16 2018 12:04:33')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 18.0
Directory : /Users/sssokolo/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/Users/sssokolo/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Version : 1.5.0
Directory : /Users/sssokolo/anaconda3/lib/python3.6/site-packages/mxnet
Commit Hash : fd34dc5f847192dfd522555afdf13be1eb67b72b
----------System Info----------
Platform : Darwin-16.7.0-x86_64-i386-64bit
system : Darwin
node : 8c859074eea0
release : 16.7.0
version : Darwin Kernel Version 16.7.0: Sun Oct 28 22:30:19 PDT 2018; root:xnu-3789.73.27~1/RELEASE_X86_64
----------Hardware Info----------
machine : x86_64
processor : i386
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0299 sec, LOAD: 0.6207 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0008 sec, LOAD: 0.1785 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0008 sec, LOAD: 0.1612 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0007 sec, LOAD: 0.1032 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0007 sec, LOAD: 0.4562 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0006 sec, LOAD: 0.0634 sec.
Package used (Python/R/Scala/Julia):
Python
Error Message:
Traceback (most recent call last):
File "/Volumes/Unix/workspace/exception_small_batch_to_split/main.py", line 25, in <module>
data = utils.split_and_load(data, context, even_split=False)
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mxnet/gluon/utils.py", line 116, in split_and_load
slices = split_data(data, len(ctx_list), batch_axis, even_split)
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mxnet/gluon/utils.py", line 69, in split_data
"num_slice=%d and batch_axis=%d."%(str(data.shape), num_slice, batch_axis))
ValueError: Too many slices for data with shape (1, 5). Arguments are num_slice=2 and batch_axis=0.
Minimum reproducible example
Just regular minimal multicontext training loop is enough:
import mxnet as mx
from mxnet import nd, gluon, autograd
from mxnet.gluon import utils, Trainer
from mxnet.gluon.data import ArrayDataset, DataLoader
from mxnet.gluon.loss import SoftmaxCrossEntropyLoss
context = [mx.cpu(0), mx.cpu(1)]
datasize = 3
batch_size_per_context = 1
data = nd.random.uniform(-1, 1, shape=(datasize, 5))
label = nd.random.uniform(-1, 1, shape=(datasize, 1))
dataset = ArrayDataset(data, label)
dataloader = DataLoader(dataset,
batch_size=len(context) * batch_size_per_context,
last_batch='keep')
net = gluon.nn.Dense(units=2)
net.initialize(ctx=context)
loss_fn = SoftmaxCrossEntropyLoss()
trainer = Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})
for (data, label) in dataloader:
data = utils.split_and_load(data, context, even_split=False)
label = utils.split_and_load(label, context, even_split=False)
losses = []
for d, l in zip(data, label):
with autograd.record():
out = net(d)
losses.append(loss_fn(out, l))
for loss in losses:
loss.backward()
trainer.step(1)
The text was updated successfully, but these errors were encountered:
Description
Sometimes, it is hard to predict how much data left in the dataloader by the time last batch comes. When doing multigpu training with
last_batch=keep
, it could happen that the number of items left in the last batch is smaller than numbers of gpus. In that casegluon.utils.split_and_load
throws an exceptionValueError: Too many slices for data with shape ...
.It would be great if it could work transparently. I would expect that if the parameter
even_split
on DataLoader is set toFalse
, then exception shouldn't happen: the data should be distributed in a way that some arrays are empty, and later on the calculation of forward and backward passes with empty arrays are just silently ignored.Environment info (Required)
Package used (Python/R/Scala/Julia):
Python
Error Message:
Minimum reproducible example
Just regular minimal multicontext training loop is enough:
The text was updated successfully, but these errors were encountered: