Support kubernetes runtime for launching jobs #4316

epicfaace · 2022-11-30T21:04:42Z

Continuation of #4090 (it's the exact same as that PR).

…nto k8s-runtime

This reverts commit 7e5a2aa.

AndrewJGaut · 2022-12-13T05:59:52Z

codalab/worker/runtime/kubernetes_runtime.py

+
+logger: logging.Logger = logging.getLogger(__name__)
+
+# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#create-pod-v1-core


Remove this?

AndrewJGaut · 2022-12-13T06:00:29Z

codalab/worker/runtime/kubernetes_runtime.py

+            docker.errors.ImageNotFound if the CUDA image cannot be pulled
+            docker.errors.APIError if another server error occurs
+        """
+        return {}


I assume you're planning on adding this in a later PR?

Yes. For GPU support

AndrewJGaut · 2022-12-13T06:02:57Z

codalab/worker/runtime/kubernetes_runtime.py

+
+    def get_container_stats(self, pod_name: str):
+        # TODO: implement
+        return {}


Also assuming you're saving this for a later PR?

AndrewJGaut · 2022-12-13T06:03:11Z

codalab/worker/runtime/kubernetes_runtime.py

+    def get_container_stats_with_docker_stats(self, pod_name: str):
+        """Returns the cpu usage and memory limit of a container using the Docker Stats API."""
+        # TODO: implement
+        return 0.0, 0


AndrewJGaut · 2022-12-13T06:06:06Z

codalab_service.py

@@ -326,7 +326,8 @@ def has_callable_default(self):
        name='worker_manager_worker_checkin_frequency_seconds',
        help='Number of seconds to wait between check-ins for a worker of the worker manager',
        type=int,
-        default=20,
+        # TODO(Ashwin): Change this back to 20 once we get websockets working with kubernetes workers.
+        default=5,


Percy asked to keep this at 5 regardless, right? Just so we still have frequent checkins?

AndrewJGaut · 2022-12-13T06:07:00Z

codalab_service.py

@@ -445,7 +447,7 @@ def has_callable_default(self):
        CodalabArg(
            name='worker_manager_{}_tag'.format(worker_manager_type),
            help='Tag of worker for {} jobs'.format(worker_manager_type),
-            default='codalab-{}'.format(worker_manager_type),
+            default='',


Why remove this?

I think otherwise you can't pass in an empty tag, or it will always use the default value (codalab-cpu or codalab-gpu).

AndrewJGaut · 2022-12-13T06:08:57Z

tests/unit/rest/bundles_test.py

@@ -42,6 +42,7 @@ def test_create(self):
        bundle_id = data[0]["id"]
        data[0]["attributes"].pop("state")
        data[0]["attributes"].pop("state_details")
+        data[0]["attributes"]["metadata"].pop("staged_status", None)


Is this to fix that test that fails intermittently? Because I've had that issue, too

Yes, I think we should add this in. Because state is nondeterministic, we pop it. And though @leilenah added staged_status in an earlier PR, we didn't update this test, so we should also pop staged_status (since staged_status is linked to state).

AndrewJGaut · 2022-12-13T06:10:05Z

codalab/worker_manager/kubernetes_worker_manager.py

@@ -152,7 +168,8 @@ def start_worker_job(self) -> None:
        }

        # Start a worker pod on the k8s cluster
-        logger.debug('Starting worker {} with image {}'.format(worker_id, worker_image))
+        logger.error('Starting worker {} with image {}'.format(worker_id, worker_image))
+        print('starting...')


Reminder to remove this

epicfaace added 30 commits April 30, 2022 17:18

initial change

3cfbee3

fixes, kind config

e692925

local setup

c3e3251

fix setup issue

dc36209

fix

d937e77

init

5702a73

update

f9766cd

update

384a721

Merge branch 'k8s' into k8s-runtime

18ee3c9

imagemanager, args

8dd841e

finish scaffolding

9e369d2

update

8a010a3

fix

a4f8525

fix

6edf5f4

Merge branch 'k8s' into k8s-runtime

458cfb6

fix

16c6522

CI

2bae8e4

update

fa55fa7

Add start container code

1a9f5ff

updates

61174ac

fixes

89f7937

update

09623df

fixes

96897e8

fixes

2522dc8

fix

38459b0

fixes

e5e66d0

update

bcf6f1f

Update setup-ci.sh

a59a441

Update test.yml

4ffc83c

Update setup-ci.sh

f35fa75

epicfaace added 8 commits December 7, 2022 19:32

Update slurm_batch_worker_manager_test.py

03ba7a8

don't use --ws-server arg for worker manager

86336e5

Merge branch 'k8s-runtime' of github.com:codalab/codalab-worksheets i…

9ef942a

…nto k8s-runtime

Update slurm_batch_worker_manager_test.py

d431cfb

Update main.py

d9f5421

fix

e15a67c

fmt

1c88571

Simplify setup

27d475e

epicfaace requested a review from AndrewJGaut December 8, 2022 03:52

epicfaace and others added 14 commits December 8, 2022 14:40

ws fixes, remove $(hostname -s), fixes #4332

7752da5

fix docs

602f9c7

fix logging

094ebce

fix test

463677d

test cl, logs

f331191

print container_id

ae5151d

fmt

e5addde

fix

925246b

better logging

80e4174

Update kubernetes_runtime.py

d05e15f

Update worker.py

7e5a2aa

Revert "Update worker.py"

c690769

This reverts commit 7e5a2aa.

Fix

034ae22

Merge branch 'master' into k8s-runtime

d7092d6

AndrewJGaut requested changes Dec 13, 2022

View reviewed changes

code review changes

fab9de7

epicfaace requested a review from AndrewJGaut December 13, 2022 16:33

AndrewJGaut approved these changes Dec 13, 2022

View reviewed changes

epicfaace merged commit d351485 into master Dec 13, 2022

epicfaace deleted the k8s-runtime branch December 13, 2022 18:27

AndrewJGaut mentioned this pull request Dec 21, 2022

Rc1.5.13 #4321

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support kubernetes runtime for launching jobs #4316

Support kubernetes runtime for launching jobs #4316

epicfaace commented Nov 30, 2022 •

edited

Loading

AndrewJGaut Dec 13, 2022

AndrewJGaut Dec 13, 2022

epicfaace Dec 13, 2022

AndrewJGaut Dec 13, 2022

epicfaace Dec 13, 2022

AndrewJGaut Dec 13, 2022

epicfaace Dec 13, 2022

AndrewJGaut Dec 13, 2022

AndrewJGaut Dec 13, 2022

epicfaace Dec 13, 2022

AndrewJGaut Dec 13, 2022

epicfaace Dec 13, 2022

AndrewJGaut Dec 13, 2022


		logger: logging.Logger = logging.getLogger(__name__)

		# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#create-pod-v1-core

Support kubernetes runtime for launching jobs #4316

Support kubernetes runtime for launching jobs #4316

Conversation

epicfaace commented Nov 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epicfaace commented Nov 30, 2022 •

edited

Loading