Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support kubernetes runtime for launching jobs #4316

Merged
merged 154 commits into from
Dec 13, 2022
Merged
Show file tree
Hide file tree
Changes from 128 commits
Commits
Show all changes
154 commits
Select commit Hold shift + click to select a range
3cfbee3
initial change
epicfaace Apr 30, 2022
e692925
fixes, kind config
epicfaace Apr 30, 2022
c3e3251
local setup
epicfaace Apr 30, 2022
dc36209
fix setup issue
epicfaace Apr 30, 2022
d937e77
fix
epicfaace Apr 30, 2022
5702a73
init
epicfaace Apr 30, 2022
f9766cd
update
epicfaace Apr 30, 2022
384a721
update
epicfaace Apr 30, 2022
18ee3c9
Merge branch 'k8s' into k8s-runtime
epicfaace Apr 30, 2022
8dd841e
imagemanager, args
epicfaace Apr 30, 2022
9e369d2
finish scaffolding
epicfaace Apr 30, 2022
8a010a3
update
epicfaace Apr 30, 2022
a4f8525
fix
epicfaace Apr 30, 2022
6edf5f4
fix
epicfaace Apr 30, 2022
458cfb6
Merge branch 'k8s' into k8s-runtime
epicfaace Apr 30, 2022
16c6522
fix
epicfaace Apr 30, 2022
2bae8e4
CI
epicfaace Apr 30, 2022
fa55fa7
update
epicfaace Apr 30, 2022
1a9f5ff
Add start container code
epicfaace Apr 30, 2022
61174ac
updates
epicfaace May 1, 2022
89f7937
fixes
epicfaace May 1, 2022
09623df
update
epicfaace May 1, 2022
96897e8
fixes
epicfaace May 1, 2022
2522dc8
fixes
epicfaace May 1, 2022
38459b0
fix
epicfaace May 1, 2022
e5e66d0
fixes
epicfaace May 1, 2022
bcf6f1f
update
epicfaace May 1, 2022
a59a441
Update setup-ci.sh
epicfaace May 1, 2022
4ffc83c
Update test.yml
epicfaace May 1, 2022
f35fa75
Update setup-ci.sh
epicfaace May 1, 2022
c8fda8e
update
epicfaace May 4, 2022
585754d
Merge branch 'k8s-runtime' of github.com:codalab/codalab-worksheets i…
epicfaace May 4, 2022
0c0e45d
Update Server-Setup.md
epicfaace May 11, 2022
a63311f
Merge branch 'master' of github.com:codalab/codalab-worksheets into k…
epicfaace May 17, 2022
04f29ac
fix kind config, move docs location
epicfaace May 31, 2022
b58e5de
Merge branch 'master' of github.com:codalab/codalab-worksheets into k…
epicfaace May 31, 2022
a644d00
k8s: organize docs, fix version, add CI setup
epicfaace May 31, 2022
17eedfe
Merge
epicfaace May 31, 2022
ca989eb
Update kubernetes_runtime.py
epicfaace Jul 7, 2022
c0b0fab
update
epicfaace Jul 7, 2022
b19196d
hardcode gpus to 1
epicfaace Jul 21, 2022
f4153be
mount cert path
epicfaace Jul 21, 2022
95c44c4
Merge branch 'master' of github.com:codalab/codalab-worksheets into k…
epicfaace Aug 10, 2022
4fcea91
updates
epicfaace Aug 10, 2022
03eb226
fix
epicfaace Aug 10, 2022
c911cf8
Undo changes
epicfaace Aug 10, 2022
7fb5602
todo
epicfaace Aug 16, 2022
2c0b4b9
fix
epicfaace Aug 17, 2022
9ba6908
fmt
epicfaace Aug 17, 2022
141b72a
Merge branch 'master' into k8s-runtime
epicfaace Oct 12, 2022
e821969
hi
epicfaace Oct 12, 2022
144365e
netcat / netcurl not supported for kubernetes.
epicfaace Oct 12, 2022
da345d1
Merge branch 'master' of github.com:codalab/codalab-worksheets into k…
epicfaace Oct 26, 2022
775e224
rename to NoOpImageManager
epicfaace Oct 26, 2022
5fd71fb
fix port
epicfaace Oct 26, 2022
d8bc86b
rm default value
epicfaace Oct 26, 2022
bace18a
Fix port issues
epicfaace Nov 2, 2022
508f9d6
fix
epicfaace Nov 2, 2022
7b56b52
Use kind instead of k8s, revert #4214
epicfaace Nov 9, 2022
724a83c
fix
epicfaace Nov 9, 2022
c5de9b7
Merge branch 'master' into k8s-runtime
epicfaace Nov 30, 2022
3b59bb3
fix
epicfaace Nov 30, 2022
0ede801
Merge branch 'k8s-runtime' of github.com:codalab/codalab-worksheets i…
epicfaace Nov 30, 2022
9f3e134
cmt out
epicfaace Nov 30, 2022
d1c6b44
fix
epicfaace Nov 30, 2022
b634d11
save kubernetes logs
epicfaace Nov 30, 2022
148a31c
update
epicfaace Nov 30, 2022
7ccaaba
fix
epicfaace Nov 30, 2022
cc72206
test
epicfaace Nov 30, 2022
e03f952
Update test.yml
epicfaace Nov 30, 2022
0e93f74
Merge branch 'master' into k8s-runtime
epicfaace Dec 1, 2022
97a0802
Merge branch 'master' of github.com:codalab/codalab-worksheets into k…
epicfaace Dec 1, 2022
b90e4b2
Load worker Docker image
epicfaace Dec 1, 2022
edb1357
Fix version
epicfaace Dec 1, 2022
544a202
Update setup-ci.sh
epicfaace Dec 1, 2022
1aadac3
Update setup-ci.sh
epicfaace Dec 1, 2022
8652a24
fix
epicfaace Dec 1, 2022
dba3266
Update setup-ci.sh
epicfaace Dec 1, 2022
a8e697d
rm bash
epicfaace Dec 2, 2022
215dbcf
Merge branch 'master' into k8s-runtime
epicfaace Dec 3, 2022
bfca65d
Merge branch 'k8s-runtime' of github.com:codalab/codalab-worksheets i…
epicfaace Dec 3, 2022
377f193
cmt
epicfaace Dec 3, 2022
51bc915
Add back in logs for kubernetes pods
AndrewJGaut Dec 4, 2022
6e0e862
Comment out logging message that was crowding the worker logs on Gith…
AndrewJGaut Dec 4, 2022
72e4e00
Suppress one more warning/exception
AndrewJGaut Dec 4, 2022
e1b1a60
quick test to see if fixes issue
AndrewJGaut Dec 4, 2022
6530d7f
Revert "quick test to see if fixes issue"
epicfaace Dec 4, 2022
e0c046a
fmt
epicfaace Dec 4, 2022
a44b31c
try
epicfaace Dec 4, 2022
09a06bc
Update ws_server.py
epicfaace Dec 5, 2022
39984c1
Update
epicfaace Dec 6, 2022
782a41c
git pull originMerge branch 'k8s-runtime' of github.com:codalab/codal…
epicfaace Dec 6, 2022
3a979c3
Merge branch 'master' of github.com:codalab/codalab-worksheets into k…
epicfaace Dec 6, 2022
b3c9b2b
last_state -> state
epicfaace Dec 6, 2022
e6980b8
fix
epicfaace Dec 6, 2022
82b024c
Fix
epicfaace Dec 6, 2022
e1f4c62
replace with $(hostname -s)
epicfaace Dec 6, 2022
15bd7ea
uncomment test
epicfaace Dec 6, 2022
93e3990
fix
epicfaace Dec 6, 2022
a4a1a31
Fix typo
epicfaace Dec 6, 2022
655eb02
Fix
epicfaace Dec 6, 2022
60a0db7
fix
epicfaace Dec 6, 2022
f420746
Update kubernetes_runtime.py
epicfaace Dec 6, 2022
17c601a
Update kubernetes_runtime.py
epicfaace Dec 6, 2022
851a05e
Update kubernetes_runtime.py
epicfaace Dec 7, 2022
0a124fa
Update kubernetes_runtime.py
epicfaace Dec 7, 2022
657e92e
Update kubernetes_runtime.py
epicfaace Dec 7, 2022
4c9fdd2
Update kubernetes_runtime.py
epicfaace Dec 7, 2022
17cdcd8
Update kubernetes_runtime.py
epicfaace Dec 7, 2022
557881d
Update kubernetes_runtime.py
epicfaace Dec 7, 2022
1ad991a
Update kubernetes_runtime.py
epicfaace Dec 7, 2022
246623f
Modify tests to use macos. Thisi s because the defualt vms for macos …
AndrewJGaut Dec 7, 2022
c9bd753
Revert commit
AndrewJGaut Dec 7, 2022
f88f386
Update test.yml
epicfaace Dec 7, 2022
8d9d61b
Save *all* logs
epicfaace Dec 7, 2022
b23d0ae
faster CI
epicfaace Dec 7, 2022
ba4a85d
update
epicfaace Dec 7, 2022
270ffe5
Update kubernetes_runtime.py
epicfaace Dec 7, 2022
cca5cb2
dumps
epicfaace Dec 7, 2022
aaa68f8
fix
epicfaace Dec 7, 2022
33d72bc
Better test
epicfaace Dec 7, 2022
d4f3cd6
fix
epicfaace Dec 7, 2022
2389cde
longer timeout
epicfaace Dec 7, 2022
c9b9d2b
fix
epicfaace Dec 7, 2022
10ccb8f
Fix CPU requests
epicfaace Dec 7, 2022
03f3963
It worked! Uncomment everything
epicfaace Dec 7, 2022
0711e64
Increase checkin freq for now
epicfaace Dec 7, 2022
ba5489d
Add todo
epicfaace Dec 7, 2022
57436a6
Update Development-Setup.md
epicfaace Dec 7, 2022
1cf893e
Fix: properly pass in --ws-server
epicfaace Dec 8, 2022
2579f7b
Merge branch 'k8s-runtime' of github.com:codalab/codalab-worksheets i…
epicfaace Dec 8, 2022
03ba7a8
Update slurm_batch_worker_manager_test.py
epicfaace Dec 8, 2022
86336e5
don't use --ws-server arg for worker manager
epicfaace Dec 8, 2022
9ef942a
Merge branch 'k8s-runtime' of github.com:codalab/codalab-worksheets i…
epicfaace Dec 8, 2022
d431cfb
Update slurm_batch_worker_manager_test.py
epicfaace Dec 8, 2022
d9f5421
Update main.py
epicfaace Dec 8, 2022
e15a67c
fix
epicfaace Dec 8, 2022
1c88571
fmt
epicfaace Dec 8, 2022
27d475e
Simplify setup
epicfaace Dec 8, 2022
7752da5
ws fixes, remove $(hostname -s), fixes https://github.com/codalab/cod…
epicfaace Dec 8, 2022
602f9c7
fix docs
epicfaace Dec 8, 2022
094ebce
fix logging
epicfaace Dec 8, 2022
463677d
fix test
epicfaace Dec 8, 2022
f331191
test cl, logs
epicfaace Dec 8, 2022
ae5151d
print container_id
epicfaace Dec 8, 2022
e5addde
fmt
epicfaace Dec 8, 2022
925246b
fix
epicfaace Dec 8, 2022
80e4174
better logging
epicfaace Dec 8, 2022
d05e15f
Update kubernetes_runtime.py
epicfaace Dec 11, 2022
7e5a2aa
Update worker.py
epicfaace Dec 11, 2022
c690769
Revert "Update worker.py"
epicfaace Dec 12, 2022
034ae22
Fix
epicfaace Dec 12, 2022
d7092d6
Merge branch 'master' into k8s-runtime
AndrewJGaut Dec 13, 2022
fab9de7
code review changes
epicfaace Dec 13, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,11 +113,16 @@ jobs:
- search link read kill write mimic workers edit_user sharing_workers
- resources
- memoize
- copy netcat netcurl
- copy
- netcat netcurl
- edit
- open wopen
- store_add
runtime: [docker, kubernetes]
exclude:
# netcat / netcurl not supported for kubernetes.
- test: netcat netcurl
runtime: kubernetes
steps:
- name: Clear free space
run: |
Expand Down Expand Up @@ -150,14 +155,16 @@ jobs:
TEST: ${{ matrix.test }}
VERSION: ${{ github.head_ref || 'master' }}
CODALAB_LINK_MOUNTS: /tmp
- uses: medyagh/setup-minikube@latest
- uses: actions/setup-go@v3
if: matrix.runtime == 'kubernetes'
with:
go-version: '1.18.1'
- name: Run tests using Kubernetes runtime
if: matrix.runtime == 'kubernetes'
run: |
sh ./tests/test-setup.sh
sh ./scripts/local-k8s/setup-ci.sh
#python3 test_runner.py --version ${VERSION} ${TEST}
python3 test_runner.py --version ${VERSION} ${TEST}
env:
TEST: ${{ matrix.test }}
VERSION: ${{ github.head_ref || 'master' }}
Expand All @@ -167,11 +174,16 @@ jobs:
run: |
mkdir /tmp/logs
for c in $(docker ps -a --format="{{.Names}}"); do docker logs $c > /tmp/logs/$c.log 2> /tmp/logs/$c.err.log; done
- name: Save kubernetes logs
if: always() && matrix.runtime == 'kubernetes'
run: |
kubectl config use-context kind-codalab
kubectl cluster-info dump --output-directory /tmp/logs
- name: Upload logs
if: always()
uses: actions/upload-artifact@v1
with:
name: logs-test-${{ matrix.test }}
name: logs-test-${{ matrix.runtime }}-${{ matrix.test }}
path: /tmp/logs

test_backend_on_worker_restart:
Expand Down
1 change: 1 addition & 0 deletions codalab/bin/ws_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

logger = logging.getLogger(__name__)
logger.setLevel(logging.WARNING)
logging.basicConfig(format='%(asctime)s %(message)s %(pathname)s %(lineno)d')

worker_to_ws: Dict[str, Any] = {}

Expand Down
5 changes: 5 additions & 0 deletions codalab/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,10 @@
logger.setLevel(logging.WARNING)
logger = logging.getLogger('apache_beam')
logger.setLevel(logging.WARNING)
logger = logging.getLogger('kubernetes')
logger.setLevel(logging.WARNING)
logger = logging.getLogger('urllib3')
logger.setLevel(logging.ERROR)


class IntegrityError(ValueError):
Expand Down Expand Up @@ -408,4 +412,5 @@ class BundleRuntime(Enum):
"""

DOCKER = "docker"
KUBERNETES = "kubernetes"
SINGULARITY = "singularity"
41 changes: 34 additions & 7 deletions codalab/worker/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
from codalab.worker.dependency_manager import DependencyManager
from codalab.worker.docker_image_manager import DockerImageManager
from codalab.worker.singularity_image_manager import SingularityImageManager
from codalab.worker.noop_image_manager import NoOpImageManager
from codalab.worker.runtime.kubernetes_runtime import KubernetesRuntime

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -131,9 +133,13 @@ def parse_args():
)
parser.add_argument(
'--bundle-runtime',
choices=[BundleRuntime.DOCKER.value, BundleRuntime.SINGULARITY.value,],
choices=[
BundleRuntime.DOCKER.value,
BundleRuntime.KUBERNETES.value,
BundleRuntime.SINGULARITY.value,
],
default=BundleRuntime.DOCKER.value,
help='The runtime through which the worker will run bundles. The options are docker (default) or singularity',
help='The runtime through which the worker will run bundles. The options are docker (default), kubernetes, or singularity',
)
parser.add_argument(
'--idle-seconds',
Expand Down Expand Up @@ -201,6 +207,21 @@ def parse_args():
parser.add_argument(
'--preemptible', action='store_true', help='Whether the worker is preemptible.',
)
parser.add_argument(
'--kubernetes-cluster-host',
type=str,
help='Host address of the Kubernetes cluster. Only applicable if --bundle-runtime is set to kubernetes.',
)
parser.add_argument(
'--kubernetes-auth-token',
type=str,
help='Kubernetes cluster authorization token. Only applicable if --bundle-runtime is set to kubernetes.',
)
parser.add_argument(
'--kubernetes-cert-path',
type=str,
help='Path to the SSL cert for the Kubernetes cluster. Only applicable if --bundle-runtime is set to kubernetes.',
)
return parser.parse_args()


Expand Down Expand Up @@ -297,6 +318,15 @@ def main():
# todo workers with singularity don't work because this is set to none -- handle this
bundle_runtime_class = None
docker_runtime = None
elif args.bundle_runtime == BundleRuntime.KUBERNETES.value:
image_manager = NoOpImageManager()
bundle_runtime_class = KubernetesRuntime(
args.work_dir,
args.kubernetes_auth_token,
args.kubernetes_cluster_host,
args.kubernetes_cert_path,
)
docker_runtime = None
else:
image_manager = DockerImageManager(
os.path.join(args.work_dir, 'images-state.json'),
Expand Down Expand Up @@ -332,7 +362,7 @@ def main():
args.checkin_frequency_seconds,
bundle_service,
args.shared_file_system,
args.tag_exclusive,
args.tag_exclusive if args.tag else False,
args.group,
ws_server=args.ws_server,
docker_runtime=docker_runtime,
Expand All @@ -342,7 +372,7 @@ def main():
exit_on_exception=args.exit_on_exception,
shared_memory_size_gb=args.shared_memory_size_gb,
preemptible=args.preemptible,
bundle_runtime=DockerRuntime(),
bundle_runtime=bundle_runtime_class,
)

# Register a signal handler to ensure safe shutdown.
Expand Down Expand Up @@ -394,9 +424,6 @@ def parse_gpuset_args(arg):
"""
Parse given arg into a set of strings representing gpu UUIDs
By default, we will try to start a Docker container with nvidia-smi to get the GPUs.
If we get an exception that the Docker socket does not exist, which will be the case
on Singularity workers, because they do not have root access, and therefore, access to
the Docker socket, we should try to get the GPUs with Singularity.

Arguments:
arg: comma separated string of ints, or "ALL" representing all gpus
Expand Down
18 changes: 18 additions & 0 deletions codalab/worker/noop_image_manager.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from types import SimpleNamespace
from codalab.worker.fsm import DependencyStage


class NoOpImageManager:
"""A "no-op" ImageManager. Doesn't do any downloading of images.
This is used by the Kubernetes runtime, because Kubernetes itself will take care of image downloading once
a pod is launched later.
"""

def start(self):
pass

def stop(self):
pass

def get(self, image_spec: str):
return SimpleNamespace(stage=DependencyStage.READY, digest=image_spec)
3 changes: 2 additions & 1 deletion codalab/worker/runtime/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
from typing import Optional, Tuple
import docker
from kubernetes.client.rest import ApiException

DEFAULT_RUNTIME = 'runc' # copied from docker_utils to avoid a circular import


# Any errors that relate to runtime API calls failing.
RuntimeAPIError = (docker.errors.APIError,)
RuntimeAPIError = (docker.errors.APIError, ApiException)


class Runtime:
Expand Down
Loading