diff --git a/docs/source/examples/e2e_flow.rst b/docs/source/examples/e2e_flow.rst index 01806ef3da..c7b2ec1136 100644 --- a/docs/source/examples/e2e_flow.rst +++ b/docs/source/examples/e2e_flow.rst @@ -6,7 +6,7 @@ End-to-End Workflow with TorchTune In this tutorial, we'll walk through an end-to-end example of how you can fine-tune, evaluate, optionally quantize and then run generation with your favorite LLM using -TorchTune. We'll also go over how you can use some of your favorite tools and libraries +TorchTune. We'll also go over how you can use some popular tools and libraries from the community seemlessly with TorchTune. .. grid:: 2 @@ -27,16 +27,16 @@ from the community seemlessly with TorchTune. Overview -------- -Fine-tuning an LLM is almost never itself the end goal. Usually this is one step in a much -larger worfklow. An example workflow might look something like this: +Fine-tuning an LLM is usually only one step in a larger workflow. An example workflow that you +might have can look something like this: -- Download a popular model from HF Hub +- Download a popular model from `HF Hub `_ - Fine-tune the model using a relevant fine-tuning technique. The exact technique used will depend on factors such as the model, amount and nature of training data, your hardware setup and the end task for which the model will be used - Evaluate the model on some benchmarks to validate model quality - Run some generations to make sure the model output looks reasonable -- Quantize the model for efficient inference followed by optionally exporting it for specific +- Quantize the model for efficient inference followed by exporting it for specific environments such as inference on a mobile phone In this tutorial we'll cover each of these items and give examples of how you can do this using @@ -84,66 +84,44 @@ and use the standard settings from the This will fine-tune our model on the `Alpaca Dataset `_ -using a ``batch_size`` of ``2`` and ``dtype`` of ``bfloat16``. With these settings the model +using a ``batch_size=2`` and ``dtype=bfloat16``. With these settings the model should have a peak memory usage of ~16GB and total training time of around 2hours for each epoch. We'll need to make some changes to the config to make sure our recipe can access the right checkpoints. -Let's first copy over the config to our local working director so we can make changes. - +Let's look for the right config for this use case by using the tune CLI. .. code-block:: bash - tune cp llama2/7B_lora_single_device ./custom_lora_config.yaml - + tune ls -Let's modify ``custom_lora_config.yaml`` to include the following changes. + RECIPE CONFIG + full_finetune_single_device llama2/7B_full_single_device + llama2/7B_full_single_device_low_memory + mistral/7B_full + full_finetune_distributed llama2/7B_full + llama2/13B_full + mistral/7B_full + lora_finetune_single_device llama2/7B_lora_single_device + llama2/7B_qlora_single_device + mistral/7B_lora + ... -.. code-block:: yaml - - checkpointer: - # checkpointer to use - _component_: torchtune.utils.FullModelHFCheckpointer - # directory with the checkpoint files - # this should match the output_dir above - checkpoint_dir: +For this tutorial we'll use the ``llama2/7B_lora_single_device`` config. - # checkpoint files. For the llama2-7b-hf model we have - # 2 .bin files - checkpoint_files: [ - pytorch_model-00001-of-00002.bin, - pytorch_model-00002-of-00002.bin, - ] - - # since we're starting a new training run, there's no - # recipe state and so set this to null - recipe_checkpoint: null - - # dir for saving the output checkpoints. Usually set - # to be the same as checkpoint_dir - output_dir: - - # model_type which specifies how to convert the state_dict - # into a format which TorchTune understands - model_type: LLAMA2 - - resume_from_checkpoint: False - - # Make sure to update the tokenizer path to the right - # checkpoint directory as well - tokenizer: - _component_: torchtune.models.llama2.llama2_tokenizer - path: /tokenizer.model - - -Once the config is updated, let's kick off training! +The config already points to the HF Checkpointer and the right checkpoint files. +All we need to do is update the checkpoint directory for both the model and the +tokenizer. Let's do this using the overrides in the tune CLI while starting training! .. code-block:: bash tune run lora_finetune_single_device \ - --config ./custom_lora_config.yaml + --config llama2/7B_lora_single_device \ + checkpointer.checkpoint_dir=/tmp/Llama-2-7b-hf \ + tokenizer.path=/tmp/Llama-2-7b-hf/tokenizer.model \ + checkpointer.output_dir=/tmp/Llama-2-7b-hf Once training is complete, you'll see the following in the logs. @@ -157,10 +135,11 @@ Once training is complete, you'll see the following in the logs. [_checkpointer.py:484] Adapter checkpoint of size 0.01 GB saved to /adapter_0.pt -The "merged weights" (see the :ref:`LoRA Tutorial ` for more details) -are split across two checkpoint files similar to the source checkpoints from the HF Hub. -In fact the keys would be identical between these checkpoints. For more details see the -checkpointing tutorial. We also have a third checkpoint file which is much smaller in size +The final trained weights are merged with the original model and split across two checkpoint files +similar to the source checkpoints from the HF Hub +(see the :ref:`LoRA Tutorial ` for more details). +In fact the keys will be identical between these checkpoints. +We also have a third checkpoint file which is much smaller in size and contains the learnt LoRA adapter weights. For this tutorial, we'll only use the model checkpoints and not the adapter weights. @@ -171,18 +150,37 @@ Run Evaluation using EleutherAI's Eval Harness We've fine-tuned a model. But how well does this model really do? Let's run some Evaluations! -Evaluation is a hard problem. Instead of re-inventing the wheel, TorchTune integrates with +Instead of re-inventing the wheel on Evals, TorchTune integrates with EleutherAI's evaluation harness. An example of this is available through the ``eleuther_eval`` recipe. In this tutorial, we're going to directly use this recipe by -modifying it's associated config ``eleuther_eval.yaml``. +modifying it's associated config ``eleuther_evaluation.yaml``. + +Since we plan to update all of the checkpoint files to point to our fine-tuned checkpoints, +let's first copy over the config to our local working directory so we can make changes. This +will be easier than overriding all of these elements through the CLI. + +.. code-block:: bash + + tune cp eleuther_evaluation ./custom_eval_config.yaml + +For this tutorial we'll use the ``truthfulqa_mc2`` task from the harness. +The Truthful QA dataset measures a model's propensity to be truthful when answering questions. +This task measures the model's zero-shot accuracy on a question followed by one or more true +responses and one or more false responses. Let's first run a baseline without fine-tuning. -Let's first copy over the config to our local working director so we can make changes. .. code-block:: bash - tune cp eleuther_eval ./custom_eval_config.yaml + tune run eleuther_eval --config ./custom_eval_config.yaml + + [evaluator.py:324] Running loglikelihood requests + [eleuther_eval.py:195] Eval completed in 121.27 seconds. + [eleuther_eval.py:197] truthfulqa_mc2: {'acc,none': 0.388... -Let's modify ``custom_eval_config.yaml`` to include the following changes. +The model has an accuracy around 38.8%. Let's compare this with the fine-tuned model. + + +First, we modify ``custom_eval_config.yaml`` to include the fine-tuned checkpoints. .. code-block:: yaml @@ -211,45 +209,43 @@ Let's modify ``custom_eval_config.yaml`` to include the following changes. path: /tokenizer.model -Once the config is updated, let's kick off evaluation! We'll use the -``truthfulqa_mc2`` task which is also the default in the config. +Now, let's run the recipe. .. code-block:: bash tune run eleuther_eval --config ./custom_eval_config.yaml -Once evaluation is complete, you'll see the following in the logs. +The results should look something like this. .. code-block:: bash [evaluator.py:324] Running loglikelihood requests [eleuther_eval.py:195] Eval completed in 121.27 seconds. - [eleuther_eval.py:197] truthfulqa_mc2: {'acc,none': 0.48919958543950917 ...} + [eleuther_eval.py:197] truthfulqa_mc2: {'acc,none': 0.489 ... -So seems like our fine-tuned model gets ~48% on this task. Which is pretty good. -An exercise for you to do is to modify the config and run this eval using the -original model from HF. You should get somewhere around 39-40%. +So seems like our fine-tuned model gets ~48% on this task, which is ~10 points +better than the baseline. Great! Seems like our fine-tuning helped. | -Generation! +Generation ----------- -We've run some evaluations and the model seems to be doing well. But does is really +We've run some evaluations and the model seems to be doing well. But does it really generate meaningful text for the prompts you care about? Let's find out! For this, we'll use the `generate recipe `_ and the associated -`config `_. +`config `_. -Let's first copy over the config to our local working director so we can make changes. +Let's first copy over the config to our local working directory so we can make changes. .. code-block:: bash - tune cp generate ./custom_generation_config.yaml + tune cp generation ./custom_generation_config.yaml Let's modify ``custom_generation_config.yaml`` to include the following changes. @@ -290,9 +286,8 @@ We'll use a different prompt from the one in the config .. code-block:: bash - tune run generate \ - --config generate \ - prompt="What are some interesting sites to visit in th Bay Area?" + tune run generate --config ./custom_generation_config.yaml \ + prompt="What are some interesting sites to visit in the Bay Area?" Once generation is complete, you'll see the following in the logs. @@ -307,8 +302,8 @@ Once generation is complete, you'll see the following in the logs. [generate.py:99] Memory used: 15.72 GB -Indeed, the bridge is pretty cool! Seems like our LLM knows what it's talking -about. +Indeed, the bridge is pretty cool! Seems like our LLM knows a little something about the +Bay Area! | @@ -317,19 +312,20 @@ Speeding up Generation using Quantization We saw that the generation recipe took around 11.6 seconds to generate 300 tokens. One technique commonly used to speed up inference is quantization. TorchTune provides -an integration with the TorchAO quantization APIs. Let's first quantize the model using -4-bit weights-only quantization and see if this improves generation. +an integration with the `TorchAO `_ +quantization APIs. Let's first quantize the model using 4-bit weights-only quantization +and see if this improves generation speed. For this, we'll use the `quantization recipe `_. -Let's first copy over the config to our local working director so we can make changes. +Let's first copy over the config to our local working directory so we can make changes. .. code-block:: bash - tune cp quantize ./custom_quantization_config.yaml + tune cp quantization ./custom_quantization_config.yaml Let's modify ``custom_quantization_config.yaml`` to include the following changes. @@ -360,7 +356,7 @@ quantization method from the config. .. code-block:: bash - tune run quantize --config quantize + tune run quantize --config /custom_quantization_config.yaml Once quantization is complete, you'll see the following in the logs. @@ -372,16 +368,16 @@ Once quantization is complete, you'll see the following in the logs. .. note:: - Unlike the fine-tuned checkpoints, this output a single checkpoint file. This is + Unlike the fine-tuned checkpoints, this outputs a single checkpoint file. This is because our quantization APIs currently don't support any conversion across formats. As a result you won't be able to use these quantized models outside of TorchTune. But you should be able to use these with the generation and evaluation recipes within TorchTune. These results will help inform which quantization methods you should use with your favorite inference engine. -Now that we have the quantized model. Let's rerun generation. +Now that we have the quantized model, let's re-run generation. -Let's modify ``custom_generation_config.yaml`` to include the following changes. +Modify ``custom_generation_config.yaml`` to include the following changes. .. code-block:: yaml @@ -418,9 +414,8 @@ We'll use a different prompt from the one in the config .. code-block:: bash - tune run generate \ - --config generate \ - prompt="What are some interesting sites to visit in th Bay Area?" + tune run generate --config ./custom_generation_config.yaml \ + prompt="What are some interesting sites to visit in the Bay Area?" Once generation is complete, you'll see the following in the logs. @@ -510,7 +505,10 @@ The output should look something like this: at WS Middle School ... Time for inference 5: 1.94 sec total, 103.28 tokens/sec - Bandwidth achieved: 1391.84 GB/ + Bandwidth achieved: 1391.84 GB/sec And thats it! Try your own prompt! + +Hope this tutorial gave you some insights into how you can use TorchTune for +your own workflows. Happy Tuning! diff --git a/recipes/README.md b/recipes/README.md index 6c6e524fda..fef09c9956 100644 --- a/recipes/README.md +++ b/recipes/README.md @@ -114,12 +114,12 @@ This will output a gguf file in the same precision which can be used for running ### Architecture Optimization -TorchTune integrates with `torchao`(https://github.com/pytorch-labs/ao/) for architecture optimization techniques including quantization and sparsity. Currently only some quantization techniques are integrated, see `receipes/configs/quantize.yaml` for more details. +TorchTune integrates with `torchao`(https://github.com/pytorch-labs/ao/) for architecture optimization techniques including quantization and sparsity. Currently only some quantization techniques are integrated, see `receipes/configs/quantization.yaml` for more details. #### Quantize To quantize a model (default is int4 weight only quantization): ``` -tune run quantize --config quantize +tune run quantize --config quantization ``` #### Eval @@ -139,11 +139,11 @@ quantizer: and run the eval command: ``` -tune run eleuther_eval --config eleuther_eval +tune run eleuther_eval --config eleuther_evaluation ``` #### Generate -Changes in `receipes/configs/generate.yaml` +Changes in `receipes/configs/generatation.yaml` ``` # Model arguments checkpointer: @@ -160,14 +160,14 @@ quantizer: and run generate command: ``` -tune run generate --config generate +tune run generate --config generatation ``` #### GPTQ GPTQ is an algorithm to improve the accuracy of quantized model through optimizing the loss of (activation * weight) together, here are the changes that's needed to use it for int4 weight only quantization -`receipes/configs/quantize.yaml` +`receipes/configs/quantization.yaml` We'll publish doc pages for different quantizers in torchao a bit later. Please check `receipes/configs/quantized.yaml for how to use them for now. @@ -207,7 +207,7 @@ def quantize(self, cfg: DictConfig): Run quantize ``` -tune run quantize --config quantize +tune run quantize --config quantization ``` `recipes/eleuther_eval.py` diff --git a/recipes/configs/eleuther_eval.yaml b/recipes/configs/eleuther_evaluation.yaml similarity index 70% rename from recipes/configs/eleuther_eval.yaml rename to recipes/configs/eleuther_evaluation.yaml index f3f42f042c..01563e304d 100644 --- a/recipes/configs/eleuther_eval.yaml +++ b/recipes/configs/eleuther_evaluation.yaml @@ -5,24 +5,23 @@ # Model Arguments model: - _component_: torchtune.models.llama2.llama2_13b + _component_: torchtune.models.llama2.llama2_7b checkpointer: _component_: torchtune.utils.FullModelHFCheckpointer - checkpoint_dir: /tmp/Llama-2-13b-hf + checkpoint_dir: /tmp/Llama-2-7b-hf checkpoint_files: [ - pytorch_model-00001-of-00003.bin, - pytorch_model-00002-of-00003.bin, - pytorch_model-00003-of-00003.bin + pytorch_model-00001-of-00002.bin, + pytorch_model-00002-of-00002.bin, ] recipe_checkpoint: null - output_dir: /tmp/Llama-2-13b-hf + output_dir: /tmp/Llama-2-7b-hf model_type: LLAMA2 # Tokenizer tokenizer: _component_: torchtune.models.llama2.llama2_tokenizer - path: /tmp/Llama-2-13b-hf/tokenizer.model + path: /tmp/Llama-2-7b-hf/tokenizer.model # Environment device: cuda diff --git a/recipes/configs/generate.yaml b/recipes/configs/generation.yaml similarity index 59% rename from recipes/configs/generate.yaml rename to recipes/configs/generation.yaml index dffec77852..8b467147cb 100644 --- a/recipes/configs/generate.yaml +++ b/recipes/configs/generation.yaml @@ -1,17 +1,16 @@ # Model arguments model: - _component_: torchtune.models.llama2.llama2_13b + _component_: torchtune.models.llama2.llama2_7b checkpointer: _component_: torchtune.utils.FullModelHFCheckpointer - checkpoint_dir: /tmp/Llama-2-13b-hf/ + checkpoint_dir: /tmp/Llama-2-7b-hf/ checkpoint_files: [ - pytorch_model-00001-of-00003.bin, - pytorch_model-00002-of-00003.bin, - pytorch_model-00003-of-00003.bin + pytorch_model-00001-of-00002.bin, + pytorch_model-00002-of-00002.bin, ] - output_dir: /tmp/Llama-2-13b-hf/ + output_dir: /tmp/Llama-2-7b-hf/ model_type: LLAMA2 device: cuda @@ -22,7 +21,7 @@ seed: 1234 # Tokenizer arguments tokenizer: _component_: torchtune.models.llama2.llama2_tokenizer - path: /tmp/Llama-2-13b-hf/tokenizer.model + path: /tmp/Llama-2-7b-hf/tokenizer.model # Generation arguments; defaults taken from gpt-fast prompt: "Hello, my name is" diff --git a/recipes/configs/llama2/7B_lora_single_device.yaml b/recipes/configs/llama2/7B_lora_single_device.yaml index eb5d0448e2..fbb6c7f900 100644 --- a/recipes/configs/llama2/7B_lora_single_device.yaml +++ b/recipes/configs/llama2/7B_lora_single_device.yaml @@ -30,21 +30,23 @@ model: lora_rank: 8 lora_alpha: 16 +tokenizer: + _component_: torchtune.models.llama2.llama2_tokenizer + path: /tmp/Llama-2-7b-hf/tokenizer.model + checkpointer: - _component_: torchtune.utils.FullModelMetaCheckpointer - checkpoint_dir: /tmp/llama2/ - checkpoint_files: [consolidated.00.pth] + _component_: torchtune.utils.FullModelHFCheckpointer + checkpoint_dir: /tmp/Llama-2-7b-hf + checkpoint_files: [ + pytorch_model-00001-of-00002.bin, + pytorch_model-00002-of-00002.bin + ] adapter_checkpoint: null recipe_checkpoint: null - output_dir: /tmp/llama2/ + output_dir: /tmp/Llama-2-7b-hf model_type: LLAMA2 resume_from_checkpoint: False -# Tokenizer -tokenizer: - _component_: torchtune.models.llama2.llama2_tokenizer - path: /tmp/llama2/tokenizer.model - # Dataset and Sampler dataset: _component_: torchtune.datasets.alpaca_cleaned_dataset diff --git a/recipes/configs/quantize.yaml b/recipes/configs/quantization.yaml similarity index 88% rename from recipes/configs/quantize.yaml rename to recipes/configs/quantization.yaml index 06c77a5b60..656ffa1bcf 100644 --- a/recipes/configs/quantize.yaml +++ b/recipes/configs/quantization.yaml @@ -34,18 +34,17 @@ # # Model arguments model: - _component_: torchtune.models.llama2.llama2_13b + _component_: torchtune.models.llama2.llama2_7b checkpointer: _component_: torchtune.utils.FullModelHFCheckpointer - checkpoint_dir: /tmp/Llama-2-13b-hf + checkpoint_dir: /tmp/Llama-2-7b-hf checkpoint_files: [ - pytorch_model-00001-of-00003.bin, - pytorch_model-00002-of-00003.bin, - pytorch_model-00003-of-00003.bin + pytorch_model-00001-of-00002.bin, + pytorch_model-00002-of-00002.bin, ] recipe_checkpoint: null - output_dir: /tmp/Llama-2-13b-hf + output_dir: /tmp/Llama-2-7b-hf model_type: LLAMA2 device: cuda diff --git a/torchtune/_recipe_registry.py b/torchtune/_recipe_registry.py index 004cbd6f4e..84ad6e195e 100644 --- a/torchtune/_recipe_registry.py +++ b/torchtune/_recipe_registry.py @@ -96,7 +96,7 @@ class Recipe: name="generate", file_path="generate.py", configs=[ - Config(name="generate", file_path="generate.yaml"), + Config(name="generation", file_path="generation.yaml"), ], supports_distributed=False, ), @@ -104,7 +104,7 @@ class Recipe: name="eleuther_eval", file_path="eleuther_eval.py", configs=[ - Config(name="eleuther_eval", file_path="eleuther_eval.yaml"), + Config(name="eleuther_evaluation", file_path="eleuther_evaluation.yaml"), ], supports_distributed=False, ), @@ -120,7 +120,7 @@ class Recipe: name="quantize", file_path="quantize.py", configs=[ - Config(name="quantize", file_path="quantize.yaml"), + Config(name="quantization", file_path="quantization.yaml"), ], supports_distributed=False, ),