-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WG21 P2029: Unclear behavior for octal and hex escape sequences in Unicode character and string literals #30
Comments
Additional related wording is in [lex.ccon]p8 (http://eel.is/c++draft/lex.ccon#8):
The note suggests that |
I finally got around to filing a core issue for this as discussed in our July 15th, 2018 meeting (https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#july-25th-2018) and was informed that CWG 2333 already tracks this: Core discussed it during their August 2017 teleconference and determined that hexadecimal and octal escape sequences should not be allowed in I'm fine with this outcome. If anyone would like to argue for a different outcome, we can discuss in SG16 or take it directly to core (or EWG potentially). |
Isn't this the opposite from what we had said we would want? E.g., we wanted to have an explicit backdoor for inserting malformed sequences in unicode literals: hex and octals were the way for that backdoor, right? |
And if this is taken out, do we have any form of inserting code units manually into a unicode string literal? |
I don't think we considered the idea of making hex and octal escapes ill-formed completely. But you're right. The meeting notes don't reflect this, but I think we specifically discussed wanting the ability to create ill-formed UTF-8 sequences for testing purposes. With this resolution, I think you are right - it would not be possible to create an ill-formed string literal. One would have to construct an array instead. Let's discuss at our next meeting. Mike Miller offered to reopen the issue if we would like to discuss a different resolution. |
I would rather we forced people to construct it as an array, if you want to do testing. It seems unfortunate, especially if people expect |
I think it promotes the idea that you can trust char{8,16,32}_t to be well formed Unicode, and that just is not the case. If you want the codepoint 80, that's spelled \u0080. We already have the problem that u8"ς" may not mean what you think it means. U+00E7 is the intent, but as we've discovered, your editor may be showing you something other than what the compiler sees. I also think u8'ς' or u8'\u00E7' is ill-formed, however. |
There's a second one, CWG 1656, that was resolved the way I think it ought to be:
|
SG16 discussed this issue at our meeting on 17-Oct-2018. We identified the following reasons for allowing hex and octal escapes in UTF string and character literals:
We polled the following: Continue to allow hex and octal escapes that indicate code unit values, requiring only that they fit into the range of the code unit type.
I followed up with Mike Miller. CWG will revisit CWG #2333 either in their issue processing meeting on Monday 22-Oct-2018 or in San Diego. |
I drafted a proposed resolution for this issue for the CWG issues telecon scheduled for Monday, January 7th. It can be found at http://wiki.edg.com/pub/Wg21kona2019/CoreWorkingGroup/cwg2333.html. |
I submitted P2029R0 for the pre-Belfast mailing. CWG reviewed at their January 16th, 2020 core issues processing teleconference. A revision addressing their feedback is now available on the Prague CWG wiki page at http://wiki.edg.com/pub/Wg21prague/CoreWorkingGroup/d2029r1.html |
P2029R1 has been submitted for the Prague post-meeting mailing and is planned to be discussed at the next core issues processing telecon. |
P2029R4 was adopted for C++23 at the November, 2020 virtual plenary. Closing. |
The standard is unclear regarding how octal and hex escape sequences are to be handled in UTF-8, char16_t, and char32_t character and string literals.
[lex.ccon]p3 states (http://eel.is/c++draft/lex.literal#lex.ccon-3):
This appears to state that
u8'\x80'
andu8"\x80"
are ill-formed since0x80
, when interpreted as a ISO 10646 code point value, is not representable with a single UTF-8 code unit.Similarly, [lex.ccon]p4 states (http://eel.is/c++draft/lex.literal#lex.ccon-4):
This appears to state that
u'\xd800'
andu"\xd800"
are ill-formed since0xD800
, when interpreted as a ISO 10646 code point value, is not representable with a single 16-bit code unit; at least not in UTF-16 since the0xD800-0xDFFF
range of code points are not representable in a single 16-bit code unit (or any sequence of code units; these code points are reserved as surrogate code points).Similar wording issues are present for char32_t literals in [lex.ccon]p5 (http://eel.is/c++draft/lex.literal#lex.ccon-5).
Existing practice is that octal and hexadecimal escape sequences specify code unit values rather than code point values, and thus should be exempted from wording intended to address encoding of code point values specified by other c-char productions.
The text was updated successfully, but these errors were encountered: