-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lex.charset] Change "short name" to "short identifier" to match ISO 10646 #2201
Conversation
source/lex.tex
Outdated
\tcode{NNNNNNNN}; the character designated by the \grammarterm{universal-character-name} | ||
\tcode{\textbackslash uNNNN} is that character whose character short name in | ||
ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a | ||
U00NNNNNN} is that character whose character short identifer in ISO/IEC 10646 is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "identifer"
\tcode{\textbackslash uNNNN} is that character whose character short name in | ||
ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a | ||
U00NNNNNN} is that character whose character short identifer in ISO/IEC 10646 is | ||
\tcode{NNNNNN}; the character designated by the \grammarterm{universal-character-name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I write \U00000041
in my source code? Does the character short identifier "000041" exist in Unicode? If it does exist, what about \u0041
; is the character short identifier here "0041"? Why are there two identifiers naming the same thing? What about lowercase vs. uppercase hex digits? Should we refer to the hexadecimal value somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the short identifier concept gives more than one identifier for each character. For the character with scalar value 0x4A, all of the following are valid short identifiers: 00004A, 004A, +00004A, +004A, U00004A, U004A, U+00004A, U+004A, 00004a, 004a, +00004a, +004a, U00004a, U004a, U+00004a, U+004a, u00004A, u004A, u+00004A, u+004A, u00004a, u004a, u+00004a, u+004a. Any of those unambiguously identifies the same character. If there were more A-F digits, there would be even more possible identifiers (there are 384 possible short identifiers for 0xAAAAA). The syntax is given with the description I quoted above, and also with the following BNF:
{ U | u } {+}(xxxx | xxxxx | xxxxxx)
where “x” represents one hexadecimal digit (0 to 9, A to F, or a to f), and with the additional requirement that the 5-digit form is not allowed to have leading zeros (so 0041 and 000041 are both valid, but 00041 isn't). I don't know why the choice was made to have this much flexibility.
Referring to the hexadecimal value may actually be a better choice; that wording is actually used in the very next sentence to forbid surrogates. If we want to do it this way, I would rewrite in the following manner.
The character designated by the universal-character-name \UNNNNNNNN is that character whose code point in ISO/IEC 10646 has the hexadecimal value NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose code point in ISO/IEC 10646 has the hexadecimal value 0000NNNN.
(That last bit can also be "has the hexadecimal value NNNN", without the leading zeros.)
Also for clarity, ISO 10646 defines "code point" as "value in the UCS codespace", (UCS being short for the character set specified by ISO 10646).
I originally just did s/short name/short identifier/ because that produced minimal changes, but I can rephrase it in terms of hexadecimal value as above if that's preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I think the current set of changes is good, although I have a few questions that should be answered by a core issue. For example, what happens if I say \U99004141
. Is that ill-formed or undefined behavior or something else? Also, I would be very much in favor of harmonizing towards U+1234 references when talking about Unicode characters.
Oh, could you please squash all commits and force-push? Thanks. And the commit message should have "[lex.charset]" in front. |
U00NNNNNN} is that character whose character short identifier in ISO/IEC 10646 is | ||
\tcode{NNNNNN}; the character designated by the \grammarterm{universal-character-name} | ||
\tcode{\textbackslash uNNNN} is that character whose character short identifier in | ||
ISO/IEC 10646 is \tcode{NNNN}. If the hexadecimal value for a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the driveby, but why do we say "hexadecimal value"? Why not just "value"? In which way does the value depend on a particular serialization format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep that separate, please? This is enough of a tar pit already, and might benefit from a more wholesale rework.
I'll squash and fix the commit message. |
@rmartinho, I think with the editorial change we're currently looking at, we've got a good improvement: from "undefined term" to "well-defined term". |
…10646 ISO 10646 doesn't have "short name".
Squashed and fixed the commit message. I'll give that paper a thought, then. |
I would like to change from NNNNNN to U+NNNNNN (for this particular wording) in this change; we're already using U+NNNN in other places, and it seems to be the more common form for unambiguously writing Unicode character short identifiers (though I don't know if ISO/IEC 10646 specifies a preference between the valid forms). |
We should also agree on what typeface to use for Unicode short identifiers. In [time.duration.io]p4, we use body text font, complete with its not-especially-aesthetically-appealing plus sign with slightly unsatisfying kerning. In Table 2, we use teletype font (and no U+ prefix). http://www.unicode.org/versions/Unicode11.0.0/appA.pdf says that dropping the U+ prefix is appropriate in tables and in ranges, so what we're doing in Table 2 seems fine. It uses body text font, but has a more appealing plus sign than appears in our body font. |
\tcode{NNNNNNNN}; the character designated by the \grammarterm{universal-character-name} | ||
\tcode{\textbackslash uNNNN} is that character whose character short name in | ||
ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a | ||
U00NNNNNN} is that character whose character short identifier in ISO/IEC 10646 is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rewording has lost the specification for a universal-character-name beginning with \U01
(etc). I think we need a normative change to properly address this -- it doesn't seem right to just remove the specification for these cases, but the old specification is clearly wrong, as there is no character with the specified short identifier.
Question for CWG: what is the status of a program like:
|
e3dbfe2
to
1a21a65
Compare
Regarding @zygoloid questions, it seems we should not require the compiler to contain a list of valid characters. (In particular, since that list is updated from time to time.) Thus, "x" should be syntactically valid and produce the expected number. In contrast, "y" should be ill-formed. |
I have submitted P1139R0 to address the remaining issues as discussed here. |
ISO 10646 doesn't have a "short name" concept (there is a "Jamo short name" but that's something specific to the Hangul script; clearly not the intended meaning here). What ISO 10646 does have, is a "short identifier" concept, which is clearly what is intended here. I have made minimal changes to this wording in order to use the "short identifier" concept.
For clarity, I am reproducing here the relevant text from ISO 10646.
Also note that "short identifier" is already used in [cpp.predefined], 2.4 (http://eel.is/c++draft/cpp.predefined#2.4)
Fixes #2109.