[RFC] Long Term QAT Flow #987

andrewor14 · 2024-10-01T20:54:30Z

Currently torchao QAT has two APIs, tensor subclasses and module swap. The original plan was to deprecate and eventually remove the old module swap API in favor of the tensor subclass API. However, users are starting to rely on the module API for production uses due to gaps in the tensor subclass API. In this RFC, we discuss the few long term plans for these two APIs in torchao.

API Today

We use a quantizer API today to abstract the implementation details from the user. Currently we support both tensor subclass and module swap APIs using different quantizers:

from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer 
from torchao.quantization.prototype.qat._module_swap_api import Int8DynActInt4WeightQATQuantizerModuleSwap

# tensor subclass version
qat_quantizer = Int8DynActInt4WeightQATQuantizer()
# module swap version
qat_quantizer = Int8DynActInt4WeightQATQuantizerModuleSwap()

# prepare: inserts fake quantizes but keeps everything in bf16
fake_quantized_model = qat_quantizer.prepare(model)

# train or fine-tune as before
train(fake_quantized_model)

# convert: actually quantizes the model to lower bit-widths
quantized_model = qat_quantizer.convert(fake_quantized_model)

Module Swap vs Tensor Subclass

Although tensor subclasses are generally adopted in torchao, the main gap today are (1) the lack of general distributed support, and (2) steep learning curve. For these two reasons, some users prefer the module swap flow, and have begun implementing new features in this flow, such as embedding quantization and static QAT.

To summarize the pros and cons of both approaches:

	Tensor subclass	Module swap
Consistency	✓ The rest of torchao uses tensor subclasses, including PTQ, quantized training, float8 inference, and sparsity. These all use the same quantize_ API.	✖ Diverges from PTQ flow (not necessarily a con)
Composability	✓ Better composability with other tensor subclasses such as DTensor and NestedJaggedTensor	✖ Pure module swap misses out on potential composability benefits with other tensor subclasses (no clear benefits for QAT today)
Distributed support	✖ Currently only supports FSDP2. Internal implementation of each distributed strategy is exposed to the subclass. There are problems with how tensor subclasses interact with FSDP1 and DDP. Fixing these is not a priority for the distributed team.	✓ Works with any distribution strategy, including non-PyTorch ones like FAIR FSDP
Developer experience	✖ Steep learning curve, difficult for new users to extend, confusing error messages	✓ Easy to understand and extend. Supports module-level features like range learning

We can separate tensor subclass usage into two categories:

Injection. This refers to how we insert fake quantization logic into the model. For example, tensor subclass injection means we look for nn.Linear modules and swap out the weight tensor, while module swap injection means we look for nn.Linear modules and swap out the whole module with our custom QATLinear. Today, the tensor subclass flow in torchao uses the former, while the module swap flow uses the latter.
Fake quantization implementation. This refers to how we represent fake quantization during training. We can use our custom AffineFakeQuantizedTensor to encode the desired fake quantization configurations, or we can use plain torch.Tensor.
We can combine these two in the same flow. For example, use module swap for injection and tensor subclass for data representation.

Long Term Flow

We propose to use module swap for injection and tensor subclass for implementing fake quantization in the long term. This has the following pros and cons compared to the alternatives:

✓ Single QAT flow in torchao
✓ Lower bar of entry; new users can continue to contribute features quickly
✓ Consistent with float8 training in torchao
✓ Composes well with other tensor subclasses like DTensor (e.g. cast to int8 before all-gather)
✖ Need additional work to support all distributed strategies

Note: In the short term, we will continue to use plain torch.Tensors for fake quantization due to the lack of general distributed support for tensor subclasses. The distributed strategies we should support before migrating to the long term flow include DDP and FSDP1. Additionally, we should migrate only if tensor subclass composability provides meaningful performance benefits, such as faster fake quantization through efficient int8 kernels.

The text was updated successfully, but these errors were encountered:

jerryzh168 · 2024-10-01T21:04:57Z

@andrewor14 by Data Representation I think you meant how fake quantization is implemented right (fake quantized tensor v.s. using modules? not the final quantized tensor right, might be good to clarify (renaming to something else might be better I think)

Summary: Following #987, this commit makes module swap the main QAT flow today. We remove all tensor subclass fake quantize injection logic since this is not needed in both the long term and the short term plans for QAT. In the short term, we will continue to use a full module swap flow, and only migrate to the long term flow once there is general distributed support for tensor subclasses and when tensor subclass composability provides meaningful benefits. Test Plan: python test/quantization/test_qat.py

andrewor14 · 2024-10-01T21:20:38Z

by Data Representation I think you meant how fake quantization is implemented right (fake quantized tensor v.s. using modules? not the final quantized tensor right, might be good to clarify (renaming to something else might be better I think)

Yeah, this is referring to how the data is represented during fake quantization in the training phase (after prepare but before convert), not the final quantized data (after convert). I'm open to renaming suggestions if you have any

Summary: Following #987, this commit makes module swap the main QAT flow today. We remove all tensor subclass fake quantize injection logic since this is not needed in both the long term and the short term plans for QAT. In the short term, we will continue to use a full module swap flow, and only migrate to the long term flow once there is general distributed support for tensor subclasses and when tensor subclass composability provides meaningful benefits. Test Plan: python test/quantization/test_qat.py

jerryzh168 · 2024-10-01T22:25:22Z

Yeah, this is referring to how the data is represented during fake quantization in the training phase (after prepare but before convert), not the final quantized data (after convert). I'm open to renaming suggestions if you have any

maybe just "Fake Quantization Implementation"?

gau-nernst · 2024-10-02T01:02:11Z

Just curious. What are the problems that you observe with tensor subclass + DDP? I have used this combination in my other projects and it seems to work as expected (i.e. no errors, correct results)

vkuzo · 2024-10-02T04:37:45Z

Note: In the short term, we will continue to use module swap for data representation due to the lack of general distributed support for tensor subclasses.

To clarify, not sure that "module swap for data representation makes sense". Should this say "use plain torch.Tensor for data representation"?

For FSDP1 composability, from what I understand as long as you don't have model parameter wrappers, you can still use tensor subclass for data representation, and thus get the benefits of integrating with other distributed paradigms (TP/SP) and benefits of easily using low precision gemms.

andrewor14 · 2024-10-02T14:20:34Z

Just curious. What are the problems that you observe with tensor subclass + DDP? I have used this combination in my other projects and it seems to work as expected (i.e. no errors, correct results)

That's great to know. Can you share the links?

To clarify, not sure that "module swap for data representation makes sense". Should this say "use plain torch.Tensor for data representation"?

Sounds good. For FSDP1, the issue I ran into was moving the model to a different device moves only the outer tensor but not the inner tensor, and this is fundamental to how FSDP1 assigns model.data during model initialization (this line). I think @awgu mentioned this is something they'd like to fix eventually, but it's not the top priority at the moment.

gau-nernst · 2024-10-03T00:31:07Z

@andrewor14 Train script https://github.com/gau-nernst/quantized-training/blob/main/llm_pretrain.py. DDP stuff is pretty standard, no changes. The subclasses are swapped by quantize_model(). This code is where I run some of the experiments for quantized training before opening PRs in torchao.

Summary: Following #987, this commit makes module swap the main QAT flow today. We remove all tensor subclass fake quantize injection logic since this is not needed in both the long term and the short term plans for QAT. In the short term, we will continue to use a full module swap flow, and only migrate to the long term flow once there is general distributed support for tensor subclasses and when tensor subclass composability provides meaningful benefits. Test Plan: python test/quantization/test_qat.py [ghstack-poisoned]

**Summary:** Following #987, this commit makes module swap the main QAT flow today. We remove all tensor subclass fake quantize injection logic since this is not needed in both the long term and the short term plans for QAT. In the short term, we will continue to use a full module swap flow, and only migrate to the long term flow once there is general distributed support for tensor subclasses and when tensor subclass composability provides meaningful benefits. **Test Plan:** python test/quantization/test_qat.py [ghstack-poisoned]

Summary: Following #987, this commit makes module swap the main QAT flow today. We remove all tensor subclass fake quantize injection logic since this is not needed in both the long term and the short term plans for QAT. In the short term, we will continue to use a full module swap flow, and only migrate to the long term flow once there is general distributed support for tensor subclasses and when tensor subclass composability provides meaningful benefits. Test Plan: python test/quantization/test_qat.py [ghstack-poisoned]

* Make module swap the main QAT flow again Summary: Following #987, this commit makes module swap the main QAT flow today. We remove all tensor subclass fake quantize injection logic since this is not needed in both the long term and the short term plans for QAT. In the short term, we will continue to use a full module swap flow, and only migrate to the long term flow once there is general distributed support for tensor subclasses and when tensor subclass composability provides meaningful benefits. Test Plan: python test/quantization/test_qat.py [ghstack-poisoned] * Move and rename GranularityType -> Granularity Summary: Move GranularityType to quant_primitives.py to be consistent with other similar fields like MappingType and ZeroPointDomain. Test Plan: CI [ghstack-poisoned] * Update on "Move and rename GranularityType -> Granularity" Summary: Move GranularityType to quant_primitives.py to be consistent with other similar fields like MappingType and ZeroPointDomain. Test Plan: CI [ghstack-poisoned] * Update on "Move and rename GranularityType -> Granularity" Summary: Move GranularityType to quant_primitives.py to be consistent with other similar fields like MappingType and ZeroPointDomain. Test Plan: CI [ghstack-poisoned] * Update on "Move and rename GranularityType -> Granularity" Summary: Move GranularityType to quant_primitives.py to be consistent with other similar fields like MappingType and ZeroPointDomain. Test Plan: CI [ghstack-poisoned] * Update base for Update on "Move and rename GranularityType -> Granularity" Summary: Move GranularityType to quant_primitives.py to be consistent with other similar fields like MappingType and ZeroPointDomain. Test Plan: CI [ghstack-poisoned]

Summary: Following #987, this commit makes module swap the main QAT flow today. We remove all tensor subclass fake quantize injection logic since this is not needed in both the long term and the short term plans for QAT. In the short term, we will continue to use a full module swap flow, and only migrate to the long term flow once there is general distributed support for tensor subclasses and when tensor subclass composability provides meaningful benefits. Test Plan: python test/quantization/test_qat.py [ghstack-poisoned]

* Make module swap the main QAT flow again Summary: Following #987, this commit makes module swap the main QAT flow today. We remove all tensor subclass fake quantize injection logic since this is not needed in both the long term and the short term plans for QAT. In the short term, we will continue to use a full module swap flow, and only migrate to the long term flow once there is general distributed support for tensor subclasses and when tensor subclass composability provides meaningful benefits. Test Plan: python test/quantization/test_qat.py [ghstack-poisoned] * Move and rename GranularityType -> Granularity Summary: Move GranularityType to quant_primitives.py to be consistent with other similar fields like MappingType and ZeroPointDomain. Test Plan: CI [ghstack-poisoned] * Update on "Move and rename GranularityType -> Granularity" Summary: Move GranularityType to quant_primitives.py to be consistent with other similar fields like MappingType and ZeroPointDomain. Test Plan: CI [ghstack-poisoned] * Update on "Move and rename GranularityType -> Granularity" Summary: Move GranularityType to quant_primitives.py to be consistent with other similar fields like MappingType and ZeroPointDomain. Test Plan: CI [ghstack-poisoned] * Update on "Move and rename GranularityType -> Granularity" Summary: Move GranularityType to quant_primitives.py to be consistent with other similar fields like MappingType and ZeroPointDomain. Test Plan: CI [ghstack-poisoned] * Update base for Update on "Move and rename GranularityType -> Granularity" Summary: Move GranularityType to quant_primitives.py to be consistent with other similar fields like MappingType and ZeroPointDomain. Test Plan: CI [ghstack-poisoned]

* Make module swap the main QAT flow again Summary: Following #987, this commit makes module swap the main QAT flow today. We remove all tensor subclass fake quantize injection logic since this is not needed in both the long term and the short term plans for QAT. In the short term, we will continue to use a full module swap flow, and only migrate to the long term flow once there is general distributed support for tensor subclasses and when tensor subclass composability provides meaningful benefits. Test Plan: python test/quantization/test_qat.py [ghstack-poisoned] * Add generic fake quantized linear for QAT Summary: This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig( bit_width=8, granularity="per_token", symmetric=False, dynamic=True, ) weight_config = FakeQuantizeConfig( bit_width=4, group_size=8, symmetric=True, dynamic=True, ) fq_linear = FakeQuantizedLinear( 16, 32, False, activation_config, weight_config, ) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. Test Plan: python test/quantization/test_qat.py -k test_fake_quantize_config python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig( bit_width=8, granularity="per_token", symmetric=False, dynamic=True, ) weight_config = FakeQuantizeConfig( bit_width=4, group_size=8, symmetric=True, dynamic=True, ) fq_linear = FakeQuantizedLinear( 16, 32, False, activation_config, weight_config, ) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig( bit_width=8, granularity="per_token", symmetric=False, dynamic=True, ) weight_config = FakeQuantizeConfig( bit_width=4, group_size=8, symmetric=True, dynamic=True, ) fq_linear = FakeQuantizedLinear( 16, 32, False, activation_config, weight_config, ) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig( bit_width=8, granularity="per_token", symmetric=False, dynamic=True, ) weight_config = FakeQuantizeConfig( bit_width=4, group_size=8, symmetric=True, dynamic=True, ) fq_linear = FakeQuantizedLinear( 16, 32, False, activation_config, weight_config, ) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig( bit_width=8, granularity="per_token", symmetric=False, dynamic=True, ) weight_config = FakeQuantizeConfig( bit_width=4, group_size=8, symmetric=True, dynamic=True, ) fq_linear = FakeQuantizedLinear( 16, 32, False, activation_config, weight_config, ) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig( bit_width=8, granularity="per_token", symmetric=False, dynamic=True, ) weight_config = FakeQuantizeConfig( bit_width=4, group_size=8, symmetric=True, dynamic=True, ) fq_linear = FakeQuantizedLinear( 16, 32, False, activation_config, weight_config, ) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=8) fq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config_granularity python test/quantization/test_qat.py -k test_fake_quantize_config_granularity_error_cases python test/quantization/test_qat.py -k test_fake_quantize_config_mapping_type python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=8) fq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config_granularity python test/quantization/test_qat.py -k test_fake_quantize_config_granularity_error_cases python test/quantization/test_qat.py -k test_fake_quantize_config_mapping_type python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=8) fq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config_granularity python test/quantization/test_qat.py -k test_fake_quantize_config_granularity_error_cases python test/quantization/test_qat.py -k test_fake_quantize_config_mapping_type python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=8) fq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config_granularity python test/quantization/test_qat.py -k test_fake_quantize_config_granularity_error_cases python test/quantization/test_qat.py -k test_fake_quantize_config_mapping_type python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=8) fq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config_granularity python test/quantization/test_qat.py -k test_fake_quantize_config_granularity_error_cases python test/quantization/test_qat.py -k test_fake_quantize_config_mapping_type python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=8) fq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config_granularity python test/quantization/test_qat.py -k test_fake_quantize_config_granularity_error_cases python test/quantization/test_qat.py -k test_fake_quantize_config_mapping_type python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=8) fq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config_granularity python test/quantization/test_qat.py -k test_fake_quantize_config_granularity_error_cases python test/quantization/test_qat.py -k test_fake_quantize_config_mapping_type python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=8) fq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config_granularity python test/quantization/test_qat.py -k test_fake_quantize_config_granularity_error_cases python test/quantization/test_qat.py -k test_fake_quantize_config_mapping_type python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned] * Update base for Update on "Add generic fake quantized linear for QAT" **Summary:** This commit adds a generic fake quantized linear module to replace the uses of the existing more specific QAT linears. For example, `Int8DynActInt4WeightQATLinear` can be expressed as follows: ``` from torchao.quantization.prototype.qat.api import FakeQuantizeConfig from torchao.quantization.prototype.qat.linear import FakeQuantizedLinear activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=8) fq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config) ``` The main motivation is to provide a more flexible way to perform QAT on models with linear layers. Previously, we would have to create a new linear class every time we wish to experiment with different fake quantization settings, e.g. different group size or different bit width. Now we can express this easily using a single linear module. **Test Plan:** python test/quantization/test_qat.py -k test_fake_quantize_config_granularity python test/quantization/test_qat.py -k test_fake_quantize_config_granularity_error_cases python test/quantization/test_qat.py -k test_fake_quantize_config_mapping_type python test/quantization/test_qat.py -k test_fake_quantized_linear_8da4w python test/quantization/test_qat.py -k test_fake_quantized_linear_4w [ghstack-poisoned]

Summary: pytorch/ao#1091 moved QAT out of prototype in torchao. This is a BC-breaking change so torchtune also needs to update its QAT imports. Additionally, after pytorch/ao#987 we decided that QAT in torchao will use module swaps to insert fake quantizes, so there is no need to have a separate module swap quantizer, so this commit removes the `*ModuleSwapQuantizer` option. Test Plan: pytest -m integration_test tests/recipes/test_qat_distributed.py should work

* CLI: Remove unsafe access of unused args * Annotate the args conditional on subcommands in functions * Typo in generate.py

andrewor14 mentioned this issue Oct 1, 2024

[BAD VERSION] Make module swap the main QAT flow again #989

Closed

andrewor14 added the rfc label Oct 2, 2024

andrewor14 mentioned this issue Oct 4, 2024

[BAD VERSION2] Make module swap the main QAT flow again #1019

Merged

andrewor14 mentioned this issue Oct 8, 2024

Make module swap the main QAT flow again #1037

Merged

andrewor14 mentioned this issue Oct 22, 2024

Update imports after QAT was moved out of prototype pytorch/torchtune#1883

Merged

ebsmothers mentioned this issue Nov 28, 2024

Integrate INT8 mixed-precision from torchao 0.7 pytorch/torchtune#1552

Open

13 tasks

yanbing-j pushed a commit to yanbing-j/ao that referenced this issue Dec 9, 2024

CLI: Fix unsafe arg access of unused args (pytorch#987)

f0b2df7

* CLI: Remove unsafe access of unused args * Annotate the args conditional on subcommands in functions * Typo in generate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Long Term QAT Flow #987

[RFC] Long Term QAT Flow #987

andrewor14 commented Oct 1, 2024 •

edited

Loading

jerryzh168 commented Oct 1, 2024 •

edited

Loading

andrewor14 commented Oct 1, 2024

jerryzh168 commented Oct 1, 2024

gau-nernst commented Oct 2, 2024

vkuzo commented Oct 2, 2024

andrewor14 commented Oct 2, 2024

gau-nernst commented Oct 3, 2024

[RFC] Long Term QAT Flow #987

[RFC] Long Term QAT Flow #987

Comments

andrewor14 commented Oct 1, 2024 • edited Loading

API Today

Module Swap vs Tensor Subclass

Long Term Flow

jerryzh168 commented Oct 1, 2024 • edited Loading

andrewor14 commented Oct 1, 2024

jerryzh168 commented Oct 1, 2024

gau-nernst commented Oct 2, 2024

vkuzo commented Oct 2, 2024

andrewor14 commented Oct 2, 2024

gau-nernst commented Oct 3, 2024

andrewor14 commented Oct 1, 2024 •

edited

Loading

jerryzh168 commented Oct 1, 2024 •

edited

Loading