Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Use dtype=int for the indices returned by TopK #11031

Closed
sxjscience opened this issue May 23, 2018 · 8 comments
Closed

Use dtype=int for the indices returned by TopK #11031

sxjscience opened this issue May 23, 2018 · 8 comments
Labels

Comments

@sxjscience
Copy link
Member

The dtype of ret_indices in "topk" is currently real_t. Should we revise it to int to avoid possible loss of precision? (https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/ordering_op-inl.h#L432)

@marcoabreu
Copy link
Contributor

@asmushetzel

@szha szha added the Operator label May 24, 2018
@asitstands
Copy link
Contributor

asitstands commented May 25, 2018

This could break existing code. Sometimes the resulting array of indices is subject to another operation. Then the result can become different if the dtype of the index array changes. Or it can make an error as some opertions can be applied to only floating point arrays.

nd.topk(nd.arange(10), k=2).mean()

Currently the result is [8.5], but the result is [8] if the dtype is integer.

nd.linalg.gemm2(nd.topk(x, k=2), y)

Currently this is allowed, but it'll cause an error if the dtype changes to integer.

multinomial has a similar but opposite issue. It returns an index array of integer type, but it would be convenient to have floating point array sometimes (PR #10970). I think that it would be better if we can make multinomial and topk consistent.

@sxjscience
Copy link
Member Author

Yes, this would be backward incompatible (So I create the issue to discuss how we should proceed). Using int32 will avoid bugs in the future when users run topk on some large tensors.

@asmushetzel
Copy link
Contributor

asmushetzel commented May 29, 2018

A few general remarks about indices in mxnet:

There is no specific convention in MXNet about what datatype to be used when encoding indices. It is basically left to the user or inferred from other data. This is generally fine and consistent. For example look at all operators that are indeed consumers of index arrays (pick, take, gather_nd):
They are all written in a way such that the datatype of the indices can be anything, as long as it can be cast to an int internally. In particular, the datatype does not have to be compliant with other inputs to such operators (i.e. take() will accept the data array to be of type double and the index array of type int/float/half_t/whatever). I think this is a really good design and we should not impose any restrictions that the incoming indices must be of integral type. Neither that every index should be integral.

The situation is slightly different for operators that actually produce index arrays (without having any index array as input). Such operators are topk, argsort, argmax, argmin, argmax_channel (hope I didn't forget one here). Currently they infer the index type (used in the output array of the operator) from the datatype used in the input. But that is weird as there is no objective correlation between data type and index type. If data input is float, why should be then the index type also a float? Moreover it can cause weird problems: If you want to get the index of the maximum element of a 10M large array of floats, then the float-type used as output would not be able to represent all potential indices (while an int would be able). To make such thing work, you currently would have to cast the entire data input to doubles, then receive indices as a double, then cast back to int. And even that may not work as the operator may not support doubles (which is the case for topk).

So my proposal would be to make a change for all index-producing operators (listed above) to allow explicit setting of the output index type by a parameter.

We may decide to keep backward compatibility by making the default behaviour match the current behaviour or choose int as the default and accepting non-backward compatibility.

But we should do something;-)

@sxjscience
Copy link
Member Author

sxjscience commented May 29, 2018 via email

@asitstands
Copy link
Contributor

I think that random.multinomial also need to be in the list. I'm positive on the introducion of the new name idx_dtype, but using the old dtype also looks not bad for the consistency, as the operators in random package use dtype to specify the data type of the output array. We need to set the backward compatible default currently, but I agree on changing it as integer in 2.0. Anyway an index is an integer by its nature.

@sxjscience
Copy link
Member Author

sxjscience commented May 29, 2018 via email

@asmushetzel
Copy link
Contributor

I think we should make it consistent from the beginning, i.e. add all affected operators in one shot.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants