Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Building wheels failing on self hosted actions, working on github's runners #33506

Open
2 of 21 tasks
damccorm opened this issue Jan 6, 2025 · 3 comments
Open
2 of 21 tasks

Comments

@damccorm
Copy link
Contributor

damccorm commented Jan 6, 2025

What happened?

Starting Dec 10, building Python wheels started failing with a bunch of seg faults. It seems likely due to something with the underlying hardware or image. I tried updating to using GitHub hosted runners and it seems like this works. We can do this as a workaround, but we should understand the problem and switch back to self-hosted to avoid being blocked on GitHub quota.

Example failure - https://github.com/apache/beam/actions/runs/12625457564

This also impacts some other workflows which I will switch over to GitHub hosted, but we should similarly switch back

Workflows where we've seen this issue along with the PR used to temporarily mitigate:

We should figure out what is causing the problem and then revert all these PRs

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@damccorm
Copy link
Contributor Author

damccorm commented Jan 6, 2025

@claudevdm did some good investigation here - Claude would you mind adding in any investigation you've done/things you've tried

@claudevdm
Copy link
Contributor

Sure, here is what I found

  • The workflow started failing consistently on Dec 10 image
  • The only difference I could find since they started failing is a runner/host Kernel Version: 6.1.100+ to Kernel Version: 6.1.112+ change
  • The kernel date change lines up with a release on Dec 10 image
  • The arm workflows have to do some cross-compilation thing using qemu
  • cibuildwheels pulls quay.io/pypa/manylinux2014_aarch64 which uses gcc 10 to compile the cyton code
  • It seems there are some incompatible changes with the new kernel release and cross compiling with gcc 10

@damccorm
Copy link
Contributor Author

FYI @Amar3tto @mrshakirov @akashorabek - this should fix a few flaky workloads, but it is just a patch. It would be great to dig in on this one and try to figure out what in our self-hosted runner infrastructure is causing the flakes so that we can move these workloads back to our self-hosted runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants