-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-688] Fix quantization divide by zero errors #11833
[MXNET-688] Fix quantization divide by zero errors #11833
Conversation
…ithout 1 off errors
for j in range(num_quantized_bins): | ||
start = j * num_merged_bins | ||
if j == num_quantized_bins - 1: | ||
stop = -1 | ||
stop = len(is_nonzeros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an off-by-1 error that can be caught via the quantization tests that I added. Indexing a numpy array with x[a:-1]
excludes the last element.
python/mxnet/contrib/quantization.py
Outdated
else: | ||
stop = start + num_merged_bins | ||
norm = is_nonzeros[start:stop].sum() | ||
if norm != 0: | ||
q[start:stop] = float(quantized_bins[j]) / float(norm) | ||
q[sliced_nd_hist == 0] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not representative of the quantized distribution, as setting values to 0 artificially will not correctly represent the quantized activation output.
python/mxnet/contrib/quantization.py
Outdated
else: | ||
stop = start + num_merged_bins | ||
norm = is_nonzeros[start:stop].sum() | ||
if norm != 0: | ||
q[start:stop] = float(quantized_bins[j]) / float(norm) | ||
q[sliced_nd_hist == 0] = 0 | ||
q[start:stop] = float(quantized_bins[j]) / float(num_quantized_bins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally this was float(norm)
, and that is not appropriate. Suppose you have the distribution:
[0, 0, ... , 0, 1]
If the num_quantized_bins
is 3
, then you theoretically should get:
[0, 0, ... , 1/3, 1/3, 1/3]
instead of:
[0, 0, ..., 1, 1, 1]
To make this more clear, suppose your original dist is:
[0, 0, 0, ... , 1/3, 1/3, 1/3]
This should be equivalent after quantization as the first distribution, but it isn't. Under the rules, you would get the same array back, where you have [..., 1/3, 1/3 ,1/3]
, and the first distribution would give you [..., 1, 1, 1]
, off by the multiplier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your change here. The original implementation is following the explanation here (see page 38):
http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
@@ -274,22 +279,21 @@ def _get_optimal_threshold(arr, num_bins=8001, num_quantized_bins=255): | |||
max_val = np.max(arr) | |||
th = max(abs(min_val), abs(max_val)) | |||
|
|||
hist, hist_edeges = np.histogram(arr, bins=num_bins, range=(-th, th)) | |||
hist, hist_edges = np.histogram(arr, bins=num_bins, range=(-th, th)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edges
is mispelled as edeges
throughout the code
# at one edge: [0, 0, ..., 1000]. (histogram) | ||
# We want to make sure that the optimal threshold in this case is the max. | ||
arr = np.array([2]*1000) | ||
res = mx.contrib.quant._get_optimal_threshold(arr, num_quantized_bins=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the incorrectly implemented code for _get_optimal_threshold
, we would result in a divide by 0 error here.
python/mxnet/contrib/quantization.py
Outdated
try: | ||
q = _smooth_distribution(q) | ||
except ValueError: | ||
divergence[i - num_half_quantized_bins] = float("inf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the distribution is improper, we set the KL divergence to infinity, as it could theoretically model a uniform distribution of parameters [a,b]
with either variables unbounded, which means KL divergence is infinity.
For reference: With new PR, on imagenet, resnet 152, 5 batch, entropy method:
Previously:
This is a proof that there was no degradation in performance. @reminisce |
else: | ||
stop = start + num_merged_bins | ||
norm = is_nonzeros[start:stop].sum() | ||
if norm != 0: | ||
q[start:stop] = float(quantized_bins[j]) / float(norm) | ||
q[sliced_nd_hist == 0] = 0 | ||
q[p == 0] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the slides on page 38: http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf, the zero'd out bins are meant to be w.r.t reference distribution p
, rather than the sliced_nd_hist
.
* Fix quantization bug * Added tests and made sure the edge case is now considered correctly without 1 off errors * Changed back to original truncated distribution but with different kl divergence calc * Reorder back to original format * Reorder back to original format (again) * Change comments * Clarified comments * Changed norm division
Description
The current quantization strategy for
calib_mode='entropy'
is to calculate the KL divergence for different thresholds and choose the best threshold. This assumes that the random variable is nonzero for all reals and is a continuous random variable. Because we are discretizing the distribution, we smooth the distribution over the range[-threshold, threshold]
. What we are not considering is that the entire sampled distribution may be not in the range[-threshold, threshold]
and thus we end up with all zeros in the sampled candidatep
distribution inside of_get_optimal_threshold
.I have added a check that the distribution(possibly unnormalized) is proper before attempting to smooth or else we'll run into a divide by 0 error.
In most cases, activation functions and layers for classification type problems output numbers symmetric around 0. This is not the case for a regressor's last layer, and there are various other examples where the activation distribution is not around 0, and this was a major blockage for airbnb's adoption into mxnet's quantization capabilities.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes