Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support kubernetes runtime for launching jobs #4316

Merged
merged 154 commits into from
Dec 13, 2022
Merged

Support kubernetes runtime for launching jobs #4316

merged 154 commits into from
Dec 13, 2022

Conversation

epicfaace
Copy link
Member

@epicfaace epicfaace commented Nov 30, 2022

Continuation of #4090 (it's the exact same as that PR).

@AndrewJGaut can you approve?

@epicfaace epicfaace requested a review from AndrewJGaut December 8, 2022 03:52

logger: logging.Logger = logging.getLogger(__name__)

# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#create-pod-v1-core
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this?

docker.errors.ImageNotFound if the CUDA image cannot be pulled
docker.errors.APIError if another server error occurs
"""
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you're planning on adding this in a later PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. For GPU support


def get_container_stats(self, pod_name: str):
# TODO: implement
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also assuming you're saving this for a later PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :)

def get_container_stats_with_docker_stats(self, pod_name: str):
"""Returns the cpu usage and memory limit of a container using the Docker Stats API."""
# TODO: implement
return 0.0, 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :)

@@ -326,7 +326,8 @@ def has_callable_default(self):
name='worker_manager_worker_checkin_frequency_seconds',
help='Number of seconds to wait between check-ins for a worker of the worker manager',
type=int,
default=20,
# TODO(Ashwin): Change this back to 20 once we get websockets working with kubernetes workers.
default=5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percy asked to keep this at 5 regardless, right? Just so we still have frequent checkins?

@@ -445,7 +447,7 @@ def has_callable_default(self):
CodalabArg(
name='worker_manager_{}_tag'.format(worker_manager_type),
help='Tag of worker for {} jobs'.format(worker_manager_type),
default='codalab-{}'.format(worker_manager_type),
default='',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think otherwise you can't pass in an empty tag, or it will always use the default value (codalab-cpu or codalab-gpu).

@@ -42,6 +42,7 @@ def test_create(self):
bundle_id = data[0]["id"]
data[0]["attributes"].pop("state")
data[0]["attributes"].pop("state_details")
data[0]["attributes"]["metadata"].pop("staged_status", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to fix that test that fails intermittently? Because I've had that issue, too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should add this in. Because state is nondeterministic, we pop it. And though @leilenah added staged_status in an earlier PR, we didn't update this test, so we should also pop staged_status (since staged_status is linked to state).

@@ -152,7 +168,8 @@ def start_worker_job(self) -> None:
}

# Start a worker pod on the k8s cluster
logger.debug('Starting worker {} with image {}'.format(worker_id, worker_image))
logger.error('Starting worker {} with image {}'.format(worker_id, worker_image))
print('starting...')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to remove this

@epicfaace epicfaace merged commit d351485 into master Dec 13, 2022
@epicfaace epicfaace deleted the k8s-runtime branch December 13, 2022 18:27
@AndrewJGaut AndrewJGaut mentioned this pull request Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants