-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: allow simdutf::convert_* functions to return zero #47471
src: allow simdutf::convert_* functions to return zero #47471
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we write a test that fails with the current main branch but passes with this change?
I guess the answer depends on what is found once someone is able to reproduce #47457 and find the source of the invalid UTF-8. |
Such a test would trigger a It seems that the current tests, those that stress the functions in It is possible that in issue #47457, we want the user to come to node and complain, rather than hide the issue as the current PR would do. |
Co-authored-by: Yagiz Nizipli <[email protected]>
Co-authored-by: Yagiz Nizipli <[email protected]>
Landed in 63ee335 |
PR-URL: #47471 Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Luigi Pinca <[email protected]>
PR-URL: #47471 Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Luigi Pinca <[email protected]>
PR-URL: nodejs#47471 Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Luigi Pinca <[email protected]>
When transcoding with the
simdutf
library, you first scan the input to determine the size of the output (e.g., you scan the UTF-8 input to determine the size of the UTF-16 output). In a second step, you call a transcoding function. This transcoding function normally returns how many words were written. This number of words should match the size of the output computed during the first scan.So you get three-line routines like as follow (scan, allocate, transcode):
The scan to determine the size of the output does not validate the Unicode input: the validation occurs during the transcoding. For performance purposes, it will only seek to tell you how much memory you need to allocate, counting on the transcoding step to do the validation.
When the transcoding fails, the
simdutf::convert_utf8_to_utf16
andsimdutf::convert_utf16_to_utf8
functions return zero by convention, indicating an error. So you either have a successful transcoding (from valid Unicode to valid Unicode) in which case the transcoding function returns the number of written words, which matches exactly the expected number of output words, or you get zero, indicating that the input is invalid Unicode.Currently, the
simdutf
library is used withinsrc/inspector/node_string.cc
with checks such asCHECK_EQ(expected_utf16_length, utf16_length);
. In effect, these checks are true if and only if the inputs are valid Unicode. That should almost always be the case within Node. However, @danpeixoto reports that the check fail in their case, see #47457I cannot reproduce @danpeixoto's issue. See my comments on the issue. Nevertheless, it seems warranted to make the code more robust in case we do have bad Unicode inputs.
This is what this PR does: it checks whether the transcoding functions return 0, and if it does, then it assumes that the input was invalid.
By convention, the routines return the empty string or a null, when the input was invalid. This could be changed to some other convention.