🎯The goal is to be able to calculate the minimum GPU requirements for Training(Fine Tuning and Continued Pre Training) and Inference for any LLM along with Comparison to Self-Host these models across different GPU Cloud Platforms and Optimizations. Eventually to Calculate tokens/$ for every possible combinations of Model, Platform and Optimizations!
Note: I do realize there's always gonna be proprietary/funky techniques to boil down inference/training costs at production. This aims to be a naive attempt at estimating the cost ;)
The current formula to calculate the minimum size for inference and training is from the findings in Reducing Activation Recomputation in Large Transformer Models and simplified by Transformer Math 101
Model_size * 1.2
# according to the blog, where 20% is a good buffer to accomodate for any overheads
To provide a rough estimate of the total cost of running an LLM on a per 1000 token basis, the logic from https://cursor.sh/blog/llama-inference is followed in the calculator:
-
Retrieve User Inputs:
- The following parameters are required:
- FLOPs per Token
- FLOPs per GPU (TFLOPs)
- Number of GPUs
- Cost per Hour (USD)
- Memory Bandwidth per GPU (TB/s)
- The following parameters are required:
-
Compute Tokens per Second (Compute Bound):
- The number of tokens processed per second when compute bound is calculated using the formula:
tokens_p_sec_compute = (flops_per_gpu * num_gpus * 10**12) / (flops_per_token * 10**9)
- The number of tokens processed per second when compute bound is calculated using the formula:
-
Compute Tokens per Second (Memory Bound):
- The number of tokens processed per second when memory bound is calculated using the formula:
tokens_p_sec_memory = (memory_bandwidth_per_gpu * num_gpus * 10**12) / (flops_per_token * 10**9)
- The number of tokens processed per second when memory bound is calculated using the formula:
-
Calculate Cost per Token:
- The cost per token for both compute and memory bounds is determined using the formula:
cost_p_token_compute = cost_per_hour / (3600 * tokens_p_sec_compute) cost_p_token_memory = cost_per_hour / (3600 * tokens_p_sec_memory)
- The cost per token for both compute and memory bounds is determined using the formula:
-
Calculate Cost per 1,000 Tokens:
- The cost per 1,000 tokens for both compute and memory bounds is determined using the formula:
cost_p_1k_tokens_compute = cost_p_token_compute * 1000 cost_p_1k_tokens_memory = cost_p_token_memory * 1000
- The cost per 1,000 tokens for both compute and memory bounds is determined using the formula:
Upon inputting the necessary parameters and clicking on the "Calculate" button, the calculator will display the estimated cost per 1,000 input tokens and cost per 1,000 output tokens based on the above logic.
Training large transformer models often poses challenges in terms of GPU memory requirements. This tool estimates the memory consumption given a range of parameters, such as model size, sequence length, batch size, and parallelization settings.
-
Model Memory:
- Memory occupied by the model's weights.
- Influenced by the number of parameters and their data types (e.g., FP32 vs. FP16).
- Adjusted based on parallelism settings, like tensor and pipeline parallel degrees.
-
Gradient Memory:
- Memory needed to store gradients during backpropagation.
- Affected by the number of parameters, their data types, and the ZeRO optimizer stage.
- Reduced with pipeline parallelism.
-
Optimizer Memory:
- Memory occupied by the optimizer's states.
- For mixed-precision Adam/AdamW, three copies (parameters, momentum, variance) are stored, which means 12 bytes per parameter.
- Modified by the ZeRO optimizer stage.
-
Activation Memory:
- Memory used for storing activations during forward and backward passes.
- Calculated based on sequence length, batch size, hidden size, and the number of layers.
- Can be influenced by techniques like activation checkpointing and activation partitioning.
-
Communication Memory:
- Additional memory needed for operations like all-reduce during distributed training.
- Particularly pertinent when using ZeRO optimization.
-
Miscellaneous Memory:
- An overhead term to account for other unaccounted memory usage by deep learning frameworks, communication libraries, etc.
To get memory estimates:
- Input your desired model configurations.
- The tool will calculate memory requirements for each component and provide a total estimate.
Looking For Contributions to implement the logic and Crowdsource relevant data!
All props to https://github.com/cloud-gpus/cloud-gpus.github.io from where I stole the list of available GPUs and their pricing, https://huggingface.co/spaces/mithril-security/TCO_calculator from where I used the TCO logic and https://gist.github.com/Quentin-Anthony/f43939791a7ceb0b01a4937308317be5 for Transformer memory logic as is ;) And to Dr. Pratik Desai for the idea!