Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Platform Latest Actions -- Python tests seg-fault on Windows #2040

Open
doug-walker opened this issue Sep 18, 2024 · 12 comments
Open

Platform Latest Actions -- Python tests seg-fault on Windows #2040

doug-walker opened this issue Sep 18, 2024 · 12 comments
Labels
Build Issue Issues related to build or environment problems on any platform. good first issue Standard label for new developers to locate good issues to tackle to learn about OCIO development. help wanted Issues that the TSC has decided are worth implementing, but don't currently have the dev resources.

Comments

@doug-walker
Copy link
Collaborator

This issue is to solve the issue that is causing the GitHub Actions Platform Latest CI runs to fail due to a seg-fault in the Python 3.11 unit tests on Windows.

@doug-walker doug-walker added good first issue Standard label for new developers to locate good issues to tackle to learn about OCIO development. help wanted Issues that the TSC has decided are worth implementing, but don't currently have the dev resources. labels Sep 18, 2024
@doug-walker doug-walker changed the title Platform Latest CI -- Python tests seg-fault on Windows Platform Latest Actions -- Python tests seg-fault on Windows Sep 18, 2024
@doug-walker doug-walker added the Build Issue Issues related to build or environment problems on any platform. label Sep 18, 2024
@remia
Copy link
Collaborator

remia commented Sep 23, 2024

I believe this is the same issue that was affecting the Wheel workflow, the issue disappear when one set OCIO_PYTHON_LOAD_DLLS_FROM_PATH to 0 which seem to indicate one the path added to the Python DLL directories cause the crash. On platform latest, the crash only occurs on Release build it seems.

Through some iterative tests, I have determined that one need to explicitly exclude these two paths from being added to the DLL directories for the workflow succeed:

'C:\\Program Files\\Microsoft Service Fabric\\bin\\Fabric\\Fabric.Code'
'C:\\Program Files\\LLVM\\bin'

Dumpbin reports the following dependencies on PyOpenColorIO:

python310.dll
OpenColorIO_2_4.dll
MSVCP140.dll
VCRUNTIME140.dll
VCRUNTIME140_1.dll
api-ms-win-crt-runtime-l1-1-0.dll
api-ms-win-crt-heap-l1-1-0.dll
api-ms-win-crt-math-l1-1-0.dll
api-ms-win-crt-stdio-l1-1-0.dll
api-ms-win-crt-filesystem-l1-1-0.dll
KERNEL32.dll

When PyOpenColorIO is loaded, the following DLL are loaded (as seen from dlltracer), obviously this is on a build that's working, otherwise the segfault prevent from having any log from dlltracer (even with python -u for un-buffered outputs, which seem to indicate the crash is quite early in the import process):

LoadLibrary \Device\HarddiskVolume2\Windows\System32\kernel.appcore.dll
LoadLibrary \Device\HarddiskVolume2\hostedtoolcache\windows\Python\3.11.9\x64\DLLs\_socket.pyd
LoadLibrary \Device\HarddiskVolume2\hostedtoolcache\windows\Python\3.11.9\x64\DLLs\select.pyd
LoadLibrary \Device\HarddiskVolume3\a\OpenColorIO\OpenColorIO\_build\src\bindings\python\PyOpenColorIO\PyOpenColorIO.pyd
LoadLibrary \Device\HarddiskVolume2\Windows\System32\msvcp140.dll
LoadLibrary \Device\HarddiskVolume2\hostedtoolcache\windows\Python\3.11.9\x64\vcruntime140_1.dll
LoadLibrary \Device\HarddiskVolume3\a\OpenColorIO\OpenColorIO\_build\src\OpenColorIO\Release\OpenColorIO_2_4.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\user32.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\win32u.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\gdi32.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\gdi32full.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\msvcp_win.dll
LoadLibrary \Device\HarddiskVolume2\Windows\System32\imm32.dll

Looking at those two folders mentioned above, I see two interesting things:

  • msvcp140.dll are present in both folders
  • api-ms-win-crt-* dll are present in the LLVM folder

I wonder if there is some kind of incompatibility going on here.

I have to admit, I find the OCIO_PYTHON_LOAD_DLLS_FROM_PATH where we add a lot of directories quite strange, I don't know how common this is for other ASWF projects or in the general Python bindings landscape. It seems quite fragile.

@doug-walker
Copy link
Collaborator Author

Regarding OCIO_PYTHON_LOAD_DLLS_FROM_PATH, that was added in this PR. Windows users of the bindings were having trouble and so Anders Langlands had proposed that we follow what OpenImageIO had done to work around the issue. So there is at least one other ASWF project doing something similar. But I agree with you Remi that it is fragile and, if it is causing other problems, I would not object to removing it and requiring Windows users to manage their path on their own.

@KelSolaar
Copy link
Contributor

I would not object to removing it and requiring Windows users to manage their path on their own.

I would probably recommend people doing so but maybe document that it needs to be done somewhere.

Thanks for digging this @remia, I was almost there on my end and those 2 paths are in the short list I had bisected.

@remia
Copy link
Collaborator

remia commented Sep 26, 2024

Maybe we could also change the default behaviour to have OCIO_PYTHON_LOAD_DLLS_FROM_PATH disabled, then, if no feedback, remove it in a future version. I also wonder if there is a clever solution where PyOpenColorIO could know some of the DLL paths needed from the compilation (in case someone doesn't have all the dependencies statically linked into OpenColorIO mostly), but this would still be fragile.

The exact cause of the crash is still unclear here, maybe something like https://github.com/mxschmitt/action-tmate could help to dig further.

@KelSolaar
Copy link
Contributor

Those dynamic DLLs might themselves have dependencies that also need to be added. As we build everything from scratch we can track and bring all the DLLs together where required and I would recommend that people do that or statically compile although I always had problems on Windows compiling the VFX Reference Platform statically.

@remia
Copy link
Collaborator

remia commented Sep 26, 2024

Yeah agree it can get rather complex. The dumpbin export I pasted above is supposed to print recursively all dependents DLLs if I understood correctly by the way, but that is only for statically linking dependencies.

@zachlewis
Copy link
Collaborator

Are you guys familiar with delvewheel?
I'm not. But I've been using repairwheel for MacOS and Linux wheels in a custom cibuildwheel step; ostensibly, it would work for Windows too (via delvewheel)...

@KelSolaar
Copy link
Contributor

I think that the issue here might be that some incompatible DLLs having the same name than some OCIO depends/it dependencies depend on are added to PATH and cause it/them to break rather than missing them. I think that the Platform build here works if there is only the OCIO path for the built lib added.

@KelSolaar
Copy link
Contributor

E.g., build with only the OCIO built lib added: https://github.com/colour-science/OpenColorIO/actions/runs/10977085917, this should left us wondering what is being tried to be loaded when the two offending paths are added? Maybe some incompatible Windows API SDK libs?

@remia
Copy link
Collaborator

remia commented Sep 26, 2024

Is there benefits using explicit repairwheel step over the default auditwheel / delocate_wheel steps done by cibuildwheel? As for the Windows delvewheel, it's a good suggestion, but not sure I understand correctly, is it going to pull every dependents DLL into the wheel? What about the size increase?

Edit: Ah no, I think I understand, it only pulls non-system libraries. This would not fix the issue here as some systems library are added to the DLL directories and take precedence over the one present in standard locations. We wouldn't need it for our own wheel either I think as we build everything statically (except for the OpenColorIO.dll which gets added to DLL search path manually in our init.py). But could be useful for people building their own PyOpenColorIO bindings under Windows.

@zachlewis
Copy link
Collaborator

I do know that on MacOS and Linux, auditwheel and delocate also patch the module rpath with the relative paths to the dependencies. Obviously, that doesn't happen with Windows, but... I'm not sure what does happen with Windows, to be honest.

Is there benefits using explicit repairwheel step over the default auditwheel / delocate_wheel steps done by cibuildwheel?

In OIIO's case, somehow, either auditwheel or delocate_wheel actually managed to mangle otherwise perfectly working wheels. At a point, I'd actually provided an empty string for the custom repair-wheel command, just to prevent cibuildwheel's default repair-wheels command from breaking the wheels!

In the end, what I'm really doing is using the repair-wheel-command as an opportunity to slim down + modify the wheels. It's a bit heavy-handed, but the path of least resistance right now to getting rid of everything that can be gotten rid of (like duplicate "versioned" dynamic libraries that are normally symlinked to each other). I'm blasting everything except the module and the CLI binary executables; and then I'm pointing repairwheel back to the build directory to scoop up only and exactly what the cpython modules and binaries need.

@remia
Copy link
Collaborator

remia commented Sep 26, 2024

Yeah I remember I had some issues as well with the various repair steps, can't remember what was the issue now (though I think building the OCIO library itself as shared instead of static was somehow helping)! As for the size, what I remember doing in OCIO is to strip the binaries / libraries produced and disable SO versioning in CMake to get rid of the symlinks.

To come back to the main discussion, I guess there is no strong objection to disabling / removing OCIO_PYTHON_LOAD_DLLS_FROM_PATH then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Issue Issues related to build or environment problems on any platform. good first issue Standard label for new developers to locate good issues to tackle to learn about OCIO development. help wanted Issues that the TSC has decided are worth implementing, but don't currently have the dev resources.
Projects
None yet
Development

No branches or pull requests

4 participants