-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support kubernetes runtime for launching jobs #4316
Conversation
This reverts commit 7e5a2aa.
|
||
logger: logging.Logger = logging.getLogger(__name__) | ||
|
||
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#create-pod-v1-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this?
docker.errors.ImageNotFound if the CUDA image cannot be pulled | ||
docker.errors.APIError if another server error occurs | ||
""" | ||
return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you're planning on adding this in a later PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. For GPU support
|
||
def get_container_stats(self, pod_name: str): | ||
# TODO: implement | ||
return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also assuming you're saving this for a later PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :)
def get_container_stats_with_docker_stats(self, pod_name: str): | ||
"""Returns the cpu usage and memory limit of a container using the Docker Stats API.""" | ||
# TODO: implement | ||
return 0.0, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :)
@@ -326,7 +326,8 @@ def has_callable_default(self): | |||
name='worker_manager_worker_checkin_frequency_seconds', | |||
help='Number of seconds to wait between check-ins for a worker of the worker manager', | |||
type=int, | |||
default=20, | |||
# TODO(Ashwin): Change this back to 20 once we get websockets working with kubernetes workers. | |||
default=5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Percy asked to keep this at 5 regardless, right? Just so we still have frequent checkins?
@@ -445,7 +447,7 @@ def has_callable_default(self): | |||
CodalabArg( | |||
name='worker_manager_{}_tag'.format(worker_manager_type), | |||
help='Tag of worker for {} jobs'.format(worker_manager_type), | |||
default='codalab-{}'.format(worker_manager_type), | |||
default='', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think otherwise you can't pass in an empty tag, or it will always use the default value (codalab-cpu or codalab-gpu).
@@ -42,6 +42,7 @@ def test_create(self): | |||
bundle_id = data[0]["id"] | |||
data[0]["attributes"].pop("state") | |||
data[0]["attributes"].pop("state_details") | |||
data[0]["attributes"]["metadata"].pop("staged_status", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to fix that test that fails intermittently? Because I've had that issue, too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should add this in. Because state
is nondeterministic, we pop it. And though @leilenah added staged_status
in an earlier PR, we didn't update this test, so we should also pop staged_status
(since staged_status
is linked to state
).
@@ -152,7 +168,8 @@ def start_worker_job(self) -> None: | |||
} | |||
|
|||
# Start a worker pod on the k8s cluster | |||
logger.debug('Starting worker {} with image {}'.format(worker_id, worker_image)) | |||
logger.error('Starting worker {} with image {}'.format(worker_id, worker_image)) | |||
print('starting...') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to remove this
Continuation of #4090 (it's the exact same as that PR).
@AndrewJGaut can you approve?