Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix E2E Intel MPI integ tests #676

Merged
merged 1 commit into from
Jan 10, 2025

Conversation

GonzaloSaez
Copy link
Contributor

@GonzaloSaez GonzaloSaez commented Jan 10, 2025

This should fix #675 by pinning the intel mpi packages to 2021.13 instead of the 2021.14 version that was being installed in the github actions. It is not clear to me what changed in the 2021.14 to make these tests fail but I don't think it should be a big deal. See https://github.com/GonzaloSaez/mpi-operator/actions/runs/12709381529/job/35428272102 for a build e2e run that shows the intel mpi tests passing.

build/base/entrypoint.sh Outdated Show resolved Hide resolved
@@ -18,7 +18,7 @@ RUN apt update \
&& apt install -y --no-install-recommends \
libstdc++-12-dev binutils procps clang \
intel-oneapi-compiler-dpcpp-cpp \
intel-oneapi-mpi-devel \
intel-oneapi-mpi-devel-2021.13 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this as a temporary solution, but eventually this version will be out of support or have CVEs, so we still need to figure what's wrong with newer versions.

WDYT @tenzen-y?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find the exhaustive changelog for release 2021.14, but it seems they mention that they changed the default IPC exchange mechanism to be pidfd instead of drmfd which was the default. It's not clear to me if that could be the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great investigation!
I'm ok with this solution as a short term. Could we open an issue to upgrade to 2021-.14?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the reason why we face this issue only in the github actions, though.

@alculquicondor
Copy link
Collaborator

Could you squash all commits please?

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit c50eb45 into kubeflow:master Jan 10, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed IntelMPI E2E tests
3 participants