From d634846b5e4f309032b51b90c93ce6faf826ca15 Mon Sep 17 00:00:00 2001 From: Andrea Tupini Date: Mon, 30 Jan 2023 11:40:03 -0600 Subject: [PATCH] Minor formatting fix on model_parallel docs (#16565) --- docs/source-pytorch/advanced/model_parallel.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source-pytorch/advanced/model_parallel.rst b/docs/source-pytorch/advanced/model_parallel.rst index 789490e76118d..fd1610a232dda 100644 --- a/docs/source-pytorch/advanced/model_parallel.rst +++ b/docs/source-pytorch/advanced/model_parallel.rst @@ -48,6 +48,7 @@ When Shouldn't I use an Optimized Distributed Strategy? ======================================================= Sharding techniques help when model sizes are fairly large; roughly 500M+ parameters is where we've seen benefits. However, in the following cases, we recommend sticking to ordinary distributed strategies + * When your model is small (ResNet50 of around 80M Parameters), unless you are using unusually large batch sizes or inputs. * Due to high distributed communication between devices, if running on a slow network/interconnect, the training might be much slower than expected and then it's up to you to determince the tradeoff here.