마지막 업데이트: 2022.03.05
(강력 추천) Amazon SageMaker를 통한 딥러닝 분산 학습 및 디버거 프로파일링 활용하기 – 최영준:: AWS Innovate 2021 (30 분)
AWS re:Invent 2020: Fast training and near-linear scaling with DataParallel in Amazon SageMaker (30분)
AWS re:Invent 2020: Train billion-parameter models with model parallelism on Amazon SageMaker (30분 소요)
AWS on Air 2020: AWS What’s Next ft. new libraries for distributed training on Amazon SageMaker (31분 소요)
Scale up Training of Your ML Models with Distributed Training on Amazon SageMaker (Sep 2019)
-
기본적인 분산 훈련의 개념을 소개 (15분 소요)
-
https://www.youtube.com/watch?v=CDg55-GkIm4&list=PLhr1KZpdzukcOr_6j_zmSrvYnLUtgqsZz&index=8
-
Amazon SageMaker Simplifies Training Deep Learning Models With Billions of Parameters (Dec 2020)
-
New – Data Parallelism Library in Amazon SageMaker Simplifies Training on Large Datasets (Dec 2020)
-
New – Profile Your Machine Learning Training Jobs With Amazon SageMaker Debugger (Dec 2020)
-
How Latent Space used the Amazon SageMaker model parallelism library to push the frontiers of large-scale transformers (Mar 2020)
-
Distributed Training: Train BART/T5 for Summarization using 🤗 Transformers and Amazon SageMaker (Apr 2021)
-
Multi-GPU distributed deep learning training at scale with Ubuntu18 DLAMI, EFA on P3dn instances, and Amazon FSx for Lustre (Jun 2020)
-
(강추) Hyundai reduces ML model training time for autonomous driving models using Amazon SageMaker (Jun 2021)
-
[강추] Choose the best data source for your Amazon SageMaker training job (Feb 2022)
- 훈련을 할시에 여러가지 데이터 소스(s3, Fsx Luster, EFS 등 설명) 특장점 및 선택 방법 가이드
- https://aws.amazon.com/blogs/machine-learning/choose-the-best-data-source-for-your-amazon-sagemaker-training-job/
- Transfer Learning with Amazon SageMaker and FSx for Lustre (Feb 2022)
- 샘플 코드는 제공을 안하나 전체적인 가이드를 제공 하고 있음.
- https://medium.com/@sayons/transfer-learning-with-amazon-sagemaker-and-fsx-for-lustre-378fa8977cc1
SageMaker Distributed Training
아마존 세이지메이커 예제 (공식 예제)
TensorFlow
- SageMaker-Tensorflow-Step-by-Step 워크샵
- 세이지 메이커 TF Getting Started, Horovod, Data Distributed Parallelism 포함
- https://github.com/gonsoomoon-ml/SageMaker-Tensorflow-Step-By-Step
- SageMaker Debugger, Profiler, and Data Distributed Parallelism
PyTorch 예제
- SageMaker-PyTorch-Step-By-Step
- 세이지 메이커 Pytorch Getting Started, Horovod, Data Distributed Parallelism 포함
- https://github.com/gonsoomoon-ml/SageMaker-PyTorch-Step-By-Step
- Oxford-IIT Pet Dataset 사용하여 Data Distributed Parallelism
-
Distributed Training with TensorFlow on AWS
-
SageMaker Prep Data (Image, Tabular, Text)
-
Accelerate computer vision training using GPU preprocessing with NVIDIA DALI on Amazon SageMaker