Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get visibilities from shared objects to use them in filtering #89

Merged
merged 6 commits into from
Jun 5, 2024

Conversation

nicholasjng
Copy link
Contributor

This is the final step in selective auditing: Obtaining the symbol visibility info, and then deciding if this symbol is an ABI3 violation with the found visibility.

The rationale is that the non-ABI3-tagged symbols which we know can be legal are all static inline, which means they should show up as local in a symbol table.


Currently only Linux - I took the map values from this SO thread quoting man elf.

Mach-O is harder. It gives me a bunch of binary values as ints, and I'm not sure how to interpret them. The returned type for symbol is NList64, which is a Python version of this struct: https://llvm.org/doxygen/structllvm_1_1MachO_1_1nlist__64.html

According to https://en.wikipedia.org/wiki/Mach-O#LINKEDIT_Symbol_table, the visibility should be found in the symbol type value (n_type) - I tested on the dockerfile wheels mentioned in the abi3audit blog post, and got an example value of 14 (or 0b1110) for n_type, not sure what this means.

This is very much work in progress.

@nicholasjng
Copy link
Contributor Author

nicholasjng commented Apr 10, 2024

@woodruffw could you take a look? I still have to do the Windows stuff, but I'd welcome a first opinion, especially on all of the Mach-O spaghetti.

EDIT:
I get a lot of hits on this branch, because the Symbol class hashes differently now with the added visibility slot. As it stands, I don't think that the visibility should be part of the hash, or do we change the FUNCTIONS map to include the visibility in the symbol key? It's not strictly part of the SABI, but more of a linker/compiler detail. -> This is part of abi3info, maybe that needs to be updated/reverted.

@nicholasjng nicholasjng marked this pull request as ready for review April 12, 2024 13:35
@nicholasjng nicholasjng requested a review from woodruffw as a code owner April 12, 2024 13:35
@nicholasjng
Copy link
Contributor Author

I think this is ready now.

@woodruffw
Copy link
Member

woodruffw commented Apr 16, 2024 via email

@nicholasjng
Copy link
Contributor Author

nicholasjng commented Apr 16, 2024

Notes and observations:

  1. I changed the audit condition for CPython symbol detection to name.startswith(("Py", "_Py")), since that takes in symbols that do not immediately start with an underscore. Reading the Python coding standards guide, there is apparently no naming convention that distinguishes methods with no underscore after Py from those with underscore.
  2. I tested with the [email protected] CPython 3.8 wheels mentioned in the blog post, here are the results:
Wheel main filter-symbols
macosx_13_0_arm64.whl 0 version mismatches, 3 violations (_Py_INCREF, _Py_DECREF, _Py_XDECREF) 0 mismatches, 0 violations
manylinux_2_5_x86_64.manylinux1_x86_64.whl 0 mismatches, 0 violations 0 mismatches, 7 violations ("PyDockerfile_GoIOError", "PyDockerfile_NewCommand", "PyInit_dockerfile", "PyDockerfile_Py_RETURN_NONE", "PyDockerfile_GoParseError", "PyDockerfile_Command", "PyDockerfile_PyArg_ParseTuple_U"
win_amd64.whl 0 mismatches, 0 violations 0 mismatches, 0 violations

Meaning that dockerfile regressed on Linux, but improved on Mac. I'm not sure why the MacOS private symbols are ignored now, could be because of the N_EXT bit check? EDIT: These guys are local text-section symbols (shown as type "t" in nm output), so we're ignoring them by the new Mach-O filtering condition using N_EXT, and we would also not flag them with our new visibility condition.

Now, flagging the module entry point on Linux is wrong, but it seems that is marked as either weak or global in the extension (I don't have the nm output at hand).

FWIW, google_benchmark and nanobind_example (for which I wrote the C/C++ build system configs) are green on either branch.

@nicholasjng
Copy link
Contributor Author

Anything I can do to support you @woodruffw? I've been thinking about a CI setup that regression-tests some "golden" wheels against the current PR, although that presents the problem of hand-curating "good" wheels.

@woodruffw
Copy link
Member

Anything I can do to support you @woodruffw? I've been thinking about a CI setup that regression-tests some "golden" wheels against the current PR, although that presents the problem of hand-curating "good" wheels.

That would be helpful, thank you! I think the cryptography wheels would be good "golden" examples.

(I'm sorry for the delay here -- I'll try and find some time to review tomorrow.)

@nicholasjng nicholasjng force-pushed the filter-symbols-by-visibility branch from a03ea38 to 28831db Compare May 9, 2024 11:49
@nicholasjng
Copy link
Contributor Author

I added a complete test audit of the latest cryptography in 28831db.

@nicholasjng
Copy link
Contributor Author

What do you say @woodruffw? Should we just flat-out ignore all symbols starting with PyInit_?

@woodruffw
Copy link
Member

What do you say @woodruffw? Should we just flat-out ignore all symbols starting with PyInit_?

Yeah, that sounds right to me -- I think in practice a single foo.so should only have PyInit_foo, but there are probably exceptions to this so we can blanket ignore PyInit_* 🙂

@nicholasjng
Copy link
Contributor Author

Nice! Now cryptography audits are green on this branch, too.

Copy link
Member

@woodruffw woodruffw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nicholasjng, LGTM! Needs update/rebase but otherwise should be good to go.

(Sorry for the delay here -- I've been on PTO.)

This is the final step in selective auditing: Obtaining the symbol
visibility info, and then deciding if this symbol is an ABI3 violation
with the found visibility.

The rationale is that the non-ABI3-tagged symbols which we know can be
legal are all static inline, which means they should show up as local
in a symbol table.
To the best of my knowledge, taken from derivative works on the Mach-O
documentation found in Google Search.

One nice side effect of these bit hacks is that we can address the TODO
in the loop, because the external marker check is nothing more than
seeing if the first bit in N_TYPE is set.
Also simplify the startswith() check for Python symbols.
Also, Windows symbol visibilities are all assumed global, because of
the way Windows library exports work.
Downloads wheels from PyPI, caching them for local development, and
audits them across platforms. For one version only, currently v42.0.7.
These are global, but do not have predetermined names - a shared object's
entry point symbol is always named PyInit_$EXTNAME, with $EXTNAME being
the C++ extension name.

As such, we exclude them in a separate if-branch in the iteration over
all symbols. Since there should only be one entry point per extension,
we expect to find at maximum one match.
@nicholasjng nicholasjng force-pushed the filter-symbols-by-visibility branch from 2d8863d to 102cdba Compare June 5, 2024 17:13
@nicholasjng
Copy link
Contributor Author

Thanks @nicholasjng, LGTM! Needs update/rebase but otherwise should be good to go.

(Sorry for the delay here -- I've been on PTO.)

Rebased. My pleasure, hope you had a great time! Happy to see this go through, the CPython upstream bit of work for this is now also on my plate, I hope to get to it soon.

@woodruffw woodruffw merged commit e103150 into pypa:main Jun 5, 2024
11 checks passed
@woodruffw
Copy link
Member

Thanks again @nicholasjng! If you'd like a release for this, I can do one today or tomorrow. Just let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants