Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WG21 P2029: Unclear behavior for octal and hex escape sequences in Unicode character and string literals #30

Closed
tahonermann opened this issue Jun 29, 2018 · 13 comments
Assignees
Labels
clarification Something isn't clear CWG issue Issue tracked by CWG WG21-tracked This issue is tracked as a WG21 github issue

Comments

@tahonermann
Copy link
Member

The standard is unclear regarding how octal and hex escape sequences are to be handled in UTF-8, char16_t, and char32_t character and string literals.

[lex.ccon]p3 states (http://eel.is/c++draft/lex.literal#lex.ccon-3):

... The value of a UTF-8 character literal is equal to its ISO 10646 code point value, provided that the code point value is representable with a single UTF-8 code unit (that is, provided it is in the C0 Controls and Basic Latin Unicode block). If the value is not representable with a single UTF-8 code unit, the program is ill-formed. ...

This appears to state that u8'\x80' and u8"\x80" are ill-formed since 0x80, when interpreted as a ISO 10646 code point value, is not representable with a single UTF-8 code unit.

Similarly, [lex.ccon]p4 states (http://eel.is/c++draft/lex.literal#lex.ccon-4):

... The value of a char16_­t character literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point value is representable with a single 16-bit code unit (that is, provided it is in the basic multi-lingual plane). If the value is not representable with a single 16-bit code unit, the program is ill-formed. ...

This appears to state that u'\xd800' and u"\xd800" are ill-formed since 0xD800, when interpreted as a ISO 10646 code point value, is not representable with a single 16-bit code unit; at least not in UTF-16 since the 0xD800-0xDFFF range of code points are not representable in a single 16-bit code unit (or any sequence of code units; these code points are reserved as surrogate code points).

Similar wording issues are present for char32_t literals in [lex.ccon]p5 (http://eel.is/c++draft/lex.literal#lex.ccon-5).

Existing practice is that octal and hexadecimal escape sequences specify code unit values rather than code point values, and thus should be exempted from wording intended to address encoding of code point values specified by other c-char productions.

@tahonermann tahonermann added clarification Something isn't clear help wanted Extra attention is needed labels Jun 29, 2018
@tahonermann tahonermann added the paper needed A paper proposing a specific solution is needed label Aug 6, 2018
@tahonermann
Copy link
Member Author

Additional related wording is in [lex.ccon]p8 (http://eel.is/c++draft/lex.ccon#8):

The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for character literals with no prefix) or wchar_­t (for character literals prefixed by L). [ Note: If the value of a character literal prefixed by u, u8, or U is outside the range defined for its type, the program is ill-formed. — end note ]

The note suggests that u8'\x80' is intended to be well-formed, but being a note, it has no normative significance.

@tahonermann
Copy link
Member Author

I finally got around to filing a core issue for this as discussed in our July 15th, 2018 meeting (https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#july-25th-2018) and was informed that CWG 2333 already tracks this:

Core discussed it during their August 2017 teleconference and determined that hexadecimal and octal escape sequences should not be allowed in u8, u, and U literals. The issue is still pending wording.

I'm fine with this outcome. If anyone would like to argue for a different outcome, we can discuss in SG16 or take it directly to core (or EWG potentially).

@ThePhD
Copy link
Collaborator

ThePhD commented Oct 11, 2018

Isn't this the opposite from what we had said we would want? E.g., we wanted to have an explicit backdoor for inserting malformed sequences in unicode literals: hex and octals were the way for that backdoor, right?

@ThePhD
Copy link
Collaborator

ThePhD commented Oct 11, 2018

And if this is taken out, do we have any form of inserting code units manually into a unicode string literal?

@tahonermann
Copy link
Member Author

I don't think we considered the idea of making hex and octal escapes ill-formed completely. But you're right. The meeting notes don't reflect this, but I think we specifically discussed wanting the ability to create ill-formed UTF-8 sequences for testing purposes.

With this resolution, I think you are right - it would not be possible to create an ill-formed string literal. One would have to construct an array instead.

Let's discuss at our next meeting. Mike Miller offered to reopen the issue if we would like to discuss a different resolution.

@strega-nil
Copy link

I would rather we forced people to construct it as an array, if you want to do testing. It seems unfortunate, especially if people expect u8"\x80" to give them a string which contains a single code point, and instead we give them a string which contains a single code unit.

@steve-downey
Copy link
Collaborator

steve-downey commented Oct 11, 2018

I think it promotes the idea that you can trust char{8,16,32}_t to be well formed Unicode, and that just is not the case. If you want the codepoint 80, that's spelled \u0080. We already have the problem that u8"ς" may not mean what you think it means. U+00E7 is the intent, but as we've discovered, your editor may be showing you something other than what the compiler sees.
I would want u8"\xCF\x82" to be equal to u8"\u00E7", and to be equivalent to char8_t const a[3] = {0xcf, 0x82, 0};

I also think u8'ς' or u8'\u00E7' is ill-formed, however.

@steve-downey
Copy link
Collaborator

There's a second one, CWG 1656, that was resolved the way I think it ought to be:
http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1656

For example, assuming the source encoding is Latin-1, is u8"\xff" supposed to specify a three-byte string whose first two bytes are 0xc3 0xbf (the UTF-8 encoding of \u00ff) or a two-byte string whose first byte has the value 0xff? (At least some current implementations assume the latter interpretation.)

Notes from the September, 2013 meeting:

The second interpretation (that the escape sequence specifies the execution-time code unit) is intended.

@tahonermann
Copy link
Member Author

SG16 discussed this issue at our meeting on 17-Oct-2018. We identified the following reasons for allowing hex and octal escapes in UTF string and character literals:

  • so that ill-formed UTF code unit sequences can be produced for test purposes.
  • so that null characters can be embedded: u8"\0".
  • so that Modified UTF-8, CESU-8, and WTF-8 string literals can be created. This entails two abilities:
    • Embedding U+0000 as an overlong UTF-8 sequence: u8"\xC0\x80"
    • Embedding lone surrogate code points as individual UTF-8 code unit sequences. For example, encoding U+D800 as u8"\xED\xA0\x80". (Note that use of \u escapes specifying surrogate code points is ill-formed).
  • so that output from existing log/debug systems that output literals with non-printable characters represented with escapes can be copy/pasted into code.

We polled the following:

Continue to allow hex and octal escapes that indicate code unit values, requiring only that they fit into the range of the code unit type.

SF  F  N  A SA
 8  1  0  0  0

I followed up with Mike Miller. CWG will revisit CWG #2333 either in their issue processing meeting on Monday 22-Oct-2018 or in San Diego.

@tahonermann tahonermann added CWG issue Issue tracked by CWG and removed help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed labels Nov 18, 2018
@tahonermann
Copy link
Member Author

I drafted a proposed resolution for this issue for the CWG issues telecon scheduled for Monday, January 7th. It can be found at http://wiki.edg.com/pub/Wg21kona2019/CoreWorkingGroup/cwg2333.html.

@tahonermann
Copy link
Member Author

I submitted P2029R0 for the pre-Belfast mailing. CWG reviewed at their January 16th, 2020 core issues processing teleconference. A revision addressing their feedback is now available on the Prague CWG wiki page at http://wiki.edg.com/pub/Wg21prague/CoreWorkingGroup/d2029r1.html

@tahonermann
Copy link
Member Author

P2029R1 has been submitted for the Prague post-meeting mailing and is planned to be discussed at the next core issues processing telecon.

@tahonermann tahonermann self-assigned this Mar 1, 2020
@tahonermann tahonermann changed the title Unclear behavior for octal and hex escape sequences in Unicode character and string literals WG21 P2029: Unclear behavior for octal and hex escape sequences in Unicode character and string literals May 13, 2020
@tahonermann tahonermann added the WG21-tracked This issue is tracked as a WG21 github issue label Nov 23, 2020
@tahonermann
Copy link
Member Author

P2029R4 was adopted for C++23 at the November, 2020 virtual plenary. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Something isn't clear CWG issue Issue tracked by CWG WG21-tracked This issue is tracked as a WG21 github issue
Development

No branches or pull requests

4 participants