WG21 P2029: Unclear behavior for octal and hex escape sequences in Unicode character and string literals #30

tahonermann · 2018-06-29T03:58:22Z

The standard is unclear regarding how octal and hex escape sequences are to be handled in UTF-8, char16_t, and char32_t character and string literals.

[lex.ccon]p3 states (http://eel.is/c++draft/lex.literal#lex.ccon-3):

... The value of a UTF-8 character literal is equal to its ISO 10646 code point value, provided that the code point value is representable with a single UTF-8 code unit (that is, provided it is in the C0 Controls and Basic Latin Unicode block). If the value is not representable with a single UTF-8 code unit, the program is ill-formed. ...

This appears to state that u8'\x80' and u8"\x80" are ill-formed since 0x80, when interpreted as a ISO 10646 code point value, is not representable with a single UTF-8 code unit.

Similarly, [lex.ccon]p4 states (http://eel.is/c++draft/lex.literal#lex.ccon-4):

... The value of a char16_t character literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point value is representable with a single 16-bit code unit (that is, provided it is in the basic multi-lingual plane). If the value is not representable with a single 16-bit code unit, the program is ill-formed. ...

This appears to state that u'\xd800' and u"\xd800" are ill-formed since 0xD800, when interpreted as a ISO 10646 code point value, is not representable with a single 16-bit code unit; at least not in UTF-16 since the 0xD800-0xDFFF range of code points are not representable in a single 16-bit code unit (or any sequence of code units; these code points are reserved as surrogate code points).

Similar wording issues are present for char32_t literals in [lex.ccon]p5 (http://eel.is/c++draft/lex.literal#lex.ccon-5).

Existing practice is that octal and hexadecimal escape sequences specify code unit values rather than code point values, and thus should be exempted from wording intended to address encoding of code point values specified by other c-char productions.

The text was updated successfully, but these errors were encountered:

tahonermann · 2018-08-26T02:53:33Z

Additional related wording is in [lex.ccon]p8 (http://eel.is/c++draft/lex.ccon#8):

The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for character literals with no prefix) or wchar_t (for character literals prefixed by L). [ Note: If the value of a character literal prefixed by u, u8, or U is outside the range defined for its type, the program is ill-formed. — end note ]

The note suggests that u8'\x80' is intended to be well-formed, but being a note, it has no normative significance.

tahonermann · 2018-10-11T03:00:58Z

I finally got around to filing a core issue for this as discussed in our July 15th, 2018 meeting (https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#july-25th-2018) and was informed that CWG 2333 already tracks this:

http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#2333

Core discussed it during their August 2017 teleconference and determined that hexadecimal and octal escape sequences should not be allowed in u8, u, and U literals. The issue is still pending wording.

I'm fine with this outcome. If anyone would like to argue for a different outcome, we can discuss in SG16 or take it directly to core (or EWG potentially).

ThePhD · 2018-10-11T03:04:54Z

Isn't this the opposite from what we had said we would want? E.g., we wanted to have an explicit backdoor for inserting malformed sequences in unicode literals: hex and octals were the way for that backdoor, right?

ThePhD · 2018-10-11T03:07:57Z

And if this is taken out, do we have any form of inserting code units manually into a unicode string literal?

tahonermann · 2018-10-11T03:15:58Z

I don't think we considered the idea of making hex and octal escapes ill-formed completely. But you're right. The meeting notes don't reflect this, but I think we specifically discussed wanting the ability to create ill-formed UTF-8 sequences for testing purposes.

With this resolution, I think you are right - it would not be possible to create an ill-formed string literal. One would have to construct an array instead.

Let's discuss at our next meeting. Mike Miller offered to reopen the issue if we would like to discuss a different resolution.

strega-nil · 2018-10-11T03:32:26Z

I would rather we forced people to construct it as an array, if you want to do testing. It seems unfortunate, especially if people expect u8"\x80" to give them a string which contains a single code point, and instead we give them a string which contains a single code unit.

steve-downey · 2018-10-11T20:36:24Z

I think it promotes the idea that you can trust char{8,16,32}_t to be well formed Unicode, and that just is not the case. If you want the codepoint 80, that's spelled \u0080. We already have the problem that u8"ς" may not mean what you think it means. U+00E7 is the intent, but as we've discovered, your editor may be showing you something other than what the compiler sees.
I would want u8"\xCF\x82" to be equal to u8"\u00E7", and to be equivalent to char8_t const a[3] = {0xcf, 0x82, 0};

I also think u8'ς' or u8'\u00E7' is ill-formed, however.

steve-downey · 2018-10-11T21:15:42Z

There's a second one, CWG 1656, that was resolved the way I think it ought to be:
http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1656

For example, assuming the source encoding is Latin-1, is u8"\xff" supposed to specify a three-byte string whose first two bytes are 0xc3 0xbf (the UTF-8 encoding of \u00ff) or a two-byte string whose first byte has the value 0xff? (At least some current implementations assume the latter interpretation.)

Notes from the September, 2013 meeting:

The second interpretation (that the escape sequence specifies the execution-time code unit) is intended.

tahonermann · 2018-10-19T19:55:23Z

SG16 discussed this issue at our meeting on 17-Oct-2018. We identified the following reasons for allowing hex and octal escapes in UTF string and character literals:

so that ill-formed UTF code unit sequences can be produced for test purposes.
so that null characters can be embedded: u8"\0".
so that Modified UTF-8, CESU-8, and WTF-8 string literals can be created. This entails two abilities:
- Embedding U+0000 as an overlong UTF-8 sequence: u8"\xC0\x80"
- Embedding lone surrogate code points as individual UTF-8 code unit sequences. For example, encoding U+D800 as u8"\xED\xA0\x80". (Note that use of \u escapes specifying surrogate code points is ill-formed).
so that output from existing log/debug systems that output literals with non-printable characters represented with escapes can be copy/pasted into code.

We polled the following:

Continue to allow hex and octal escapes that indicate code unit values, requiring only that they fit into the range of the code unit type.

SF  F  N  A SA
 8  1  0  0  0

I followed up with Mike Miller. CWG will revisit CWG #2333 either in their issue processing meeting on Monday 22-Oct-2018 or in San Diego.

tahonermann · 2019-01-06T04:43:29Z

I drafted a proposed resolution for this issue for the CWG issues telecon scheduled for Monday, January 7th. It can be found at http://wiki.edg.com/pub/Wg21kona2019/CoreWorkingGroup/cwg2333.html.

tahonermann · 2020-02-13T16:04:28Z

I submitted P2029R0 for the pre-Belfast mailing. CWG reviewed at their January 16th, 2020 core issues processing teleconference. A revision addressing their feedback is now available on the Prague CWG wiki page at http://wiki.edg.com/pub/Wg21prague/CoreWorkingGroup/d2029r1.html

tahonermann · 2020-03-01T02:04:12Z

P2029R1 has been submitted for the Prague post-meeting mailing and is planned to be discussed at the next core issues processing telecon.

tahonermann · 2020-11-23T16:19:21Z

P2029R4 was adopted for C++23 at the November, 2020 virtual plenary. Closing.

tahonermann added clarification Something isn't clear help wanted Extra attention is needed labels Jun 29, 2018

tahonermann added the paper needed A paper proposing a specific solution is needed label Aug 6, 2018

tahonermann added CWG issue Issue tracked by CWG and removed help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed labels Nov 18, 2018

tahonermann self-assigned this Mar 1, 2020

tahonermann changed the title ~~Unclear behavior for octal and hex escape sequences in Unicode character and string literals~~ WG21 P2029: Unclear behavior for octal and hex escape sequences in Unicode character and string literals May 13, 2020

tahonermann added the WG21-tracked This issue is tracked as a WG21 github issue label Nov 23, 2020

tahonermann closed this as completed Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WG21 P2029: Unclear behavior for octal and hex escape sequences in Unicode character and string literals #30

WG21 P2029: Unclear behavior for octal and hex escape sequences in Unicode character and string literals #30

tahonermann commented Jun 29, 2018

tahonermann commented Aug 26, 2018

tahonermann commented Oct 11, 2018

ThePhD commented Oct 11, 2018

ThePhD commented Oct 11, 2018

tahonermann commented Oct 11, 2018

strega-nil commented Oct 11, 2018

steve-downey commented Oct 11, 2018 •

edited

Loading

steve-downey commented Oct 11, 2018

tahonermann commented Oct 19, 2018

tahonermann commented Jan 6, 2019

tahonermann commented Feb 13, 2020

tahonermann commented Mar 1, 2020

tahonermann commented Nov 23, 2020

WG21 P2029: Unclear behavior for octal and hex escape sequences in Unicode character and string literals #30

WG21 P2029: Unclear behavior for octal and hex escape sequences in Unicode character and string literals #30

Comments

tahonermann commented Jun 29, 2018

tahonermann commented Aug 26, 2018

tahonermann commented Oct 11, 2018

ThePhD commented Oct 11, 2018

ThePhD commented Oct 11, 2018

tahonermann commented Oct 11, 2018

strega-nil commented Oct 11, 2018

steve-downey commented Oct 11, 2018 • edited Loading

steve-downey commented Oct 11, 2018

tahonermann commented Oct 19, 2018

tahonermann commented Jan 6, 2019

tahonermann commented Feb 13, 2020

tahonermann commented Mar 1, 2020

tahonermann commented Nov 23, 2020

steve-downey commented Oct 11, 2018 •

edited

Loading