Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GPU Discovery. #4522

Merged
merged 3 commits into from
Aug 29, 2023
Merged

Fix GPU Discovery. #4522

merged 3 commits into from
Aug 29, 2023

Conversation

AndrewJGaut
Copy link
Contributor

Recently, the image we used to run nvidia-smi to to GPU discovery was deprecated. To address this, a PR was made to update the image; however, this broke GPU discovery since the nvidia-smi output changed (it now included a header).

This PR does the following:
(1) Fix GPU discovery so that GPU workers will now be able to run
(2) Use a smaller NVIDIA/CUDA image
(3) Downgrade the CUDA version.

Regarding (2), the official NVIDIA Dockerhub page includes the following blurb at the time this PR was created:

Overview of Images
Three flavors of images are provided:

base: Includes the CUDA runtime (cudart)
runtime: Builds on the base and includes the [CUDA math libraries](https://developer.nvidia.com/gpu-accelerated-libraries), and [NCCL](https://developer.nvidia.com/nccl). A runtime image that also includes [cuDNN](https://developer.nvidia.com/cudnn) is available.
devel: Builds on the runtime and includes headers, development tools for building CUDA images. These images are particularly useful for multi-stage builds.

Thus, the base image is the smallest. Since we need only run nvidia-smi, which is supported by all three images, it is best to use the smallest possible image to minimize download and container startup times.

Regarding (3), at the time of this PR, the Stanford NLP machines use CUDA version 11.5. Using an image with CUDA 12.2 yields an error since the machines do not support that version of CUDA. Thus, I downgraded the version.

@AndrewJGaut
Copy link
Contributor Author

I should note: we end up not needing any extra regex parsing (e.g. like that used here) because the headers are no longer added. I believe this is because we aren't using a devel image, and as the Dockerhub page states, only those images include "headers" (which I think may be that copyright header we were getting when running nvidia-smi with that devel image).

@wwwjn wwwjn merged commit 528db0f into master Aug 29, 2023
@wwwjn wwwjn deleted the fix-gpu-discovery branch August 29, 2023 01:38
@yifanmai yifanmai mentioned this pull request Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants