-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits (LWG#2959) #32
Comments
This was discovered during discussion on Twitter with Billy O'Neal over Microsoft's use of |
According to the Unicode FAQ, non-characters (and
|
LWG thinks it's NAD: https://cplusplus.github.io/LWG/issue1200 |
To me this sounds like the decision was made without a proper understanding of the issue: all 16-bit values are valid UTF-16 code units. The proposed wording betrays both a lack of understanding of what "permanently reserved" means (it doesn't mean it cannot be used, nor that it can be lost in interchange; only that it won't be assigned), and a misguided focus on UCS-2. I think we should bring it up again with more expertise on hand. |
Thanks for those links, Sergey. The wiki notes for the Rapperswil review of LWG1200 are available at http://wiki.edg.com/bin/view/Wg21rapperswil/LWGSubGroup1. Unfortunately, those notes don't provide any more useful information regarding their review. I agree with LWG's position that the proposed resolutions were overspecified. But Martinho is right, it looks like LWG didn't understand the concern. They appear to admit that with their "it is not clear what problem is being solved" comment. I think our concern is slightly different than lwg1200. It appears that lwg1200 is complaining about the situation in which the |
On Wed, 25 Jul 2018 at 16:39, Tom Honermann wrote:
I think our concern is slightly different than lwg1200. It appears that
lwg1200 is complaining about the situation in which the eof() value can
be held in a char_type value regardless of whether that value constitutes
a valid code unit for the encoding. Our concern is that there is no value
that eof() can return that is not a valid code unit (unless uint_least16_t
is larger than 16 bits).
Right, that's the problem I solved in GCC.
I think we should proceed with a new issue that explains this view point
but refers to lwg1200 as a related concern.
|
Thanks, Jonathan.
... by changing Just to be clear, this is an improvement over spurious lwg2959 looks like an exact match for this issue. Thanks! |
... doesn't actually solve lwg2959 because the requirements for |
But the wording already allows The defect in the standard is the implication that there is any 16-bit value that "cannot appear as a valid UTF-16 code unit". |
If I understand correctly, you're suggesting that, for |
Ah yes. I don't see how that can be solved without an ABI break to make |
I'd like to give my comment on this. Changing the type to 32 bit unsigned integer is the only real solution. Indeed, it changes
I will now explain why the iostreams that work with
|
In theory implementations can provide ctype<char16_t> and other facets. If
the implementation doesn't, users can define their own facets and imbue a
custom locale into streams.
|
In theory, yes. In theory one could use AFAIK there are only 4 implementations that provide modern C++ implementations, GCC, Clang, MSVC and EDG. Other implementations like Intel's are based on EDG. For the first 3 I can safely say then they don't provide facets that work with char16_t and char32_t. I'd expect EDG is the same. Then there is Boost::Locale which claims that has experimental support for I guess a Debian code search should be done to see if there is a place that uses iostreams with And, after all, GCC once did ABI break on std::string, and that went well with the tricks employed there. |
You're not really adding useful information here.
That wouldn't work, because you don't control the code inside the standard library.
EDG is a compiler front-end, it has nothing to do with providing facets, locales, iostreams etc. and similarly Clang is a compiler. What matters is the standard library implementation (and there are only three relevant implementations of that).
Firstly, it was not a "break" because the old ABI is still present and supported, and how well it went is debatable. Secondly, that isn't an option in this case. Source: me. |
Ok then. Do we all agree that |
I think we had agreement on that before your recent comments :) See Jonathan's comment on July 26th. |
How should we proceed? Is a classic paper with number Pxxxx needed, or there is different procedure for defect reports? |
@jwakely I believe there are four - Sony uses dinkumware's STL. |
@dimztimz There is already an active LWG issue for this so, to some degree, this is effectively in LWG's court. A paper arguing for a particular resolution could be helpful in moving things along, but attending an LWG issues processing session could also have the same effect. @ubsan, Microsoft uses Dinkumware's STL (heavily customized). |
@tahonermann not anymore - it's a fork of dinkumware's STL that hasn't been in alignment with mainline for ~three years. |
same issue with char8_t - it should return a 32 bits int whose value is not in the range 00-10FFFF |
@cor3ntin |
There is no requirement that u8string stores valid utf-8 data. So now if a iostream function returns eof(), do I have invalid data or actually reach the end of the file? No way for me to know, forcing me to check the state of the string. Any 8 bit value is a utf8 code unit and can appear in input data. Same thing for utf 16 and 32 actually. But 32 is a bit complicated, I suppose we couldn't require a 64 bits value? I am missing something? |
But a u8string stores 8-bit code units. An 8-bit code unit cannot have the value 10FFFF.
|
Hum, nevermind, I originally missunderstood your reply, everything is fine 🥺 |
Was there any progress on this issue in the last ISO meeting? Can NB comments be given for issues? |
No, we didn’t spend any time on this in cologne. You can certainly submit an NB comment for it. I think the only potentially controversial aspect is whether fixing this is worth an ABI break. I don’t have a good sense of what the committee will feel about that. |
On ABI breaks, did P1863 get discussed at Belfast? I suspect "minor" reasons for breakage like this issue stand or fall with that larger question and ought to get included in a large batch of changes if done at all. |
No, it didn't. I think there is intent for it to get discussed in Prague. |
For historical reference, Unicode Corrigendum #9 (Clarification About Noncharacters) clarified that noncharacter code points are not prohibited in interchange. |
A temporary solution can be to deprecate the type alias |
Thanks, @dimztimz. Deprecating |
Per the last comment, I reached out to implementors to collect their thoughts on whether it is feasible to just fix |
The specialization for
std::char_traits<char16_t>
requires a memberint_type
defined asuint_least16_t
and for the member functioneof()
to return "an implementation-defined constant that cannot appear as a valid UTF-16 code unit."However, all 16 bit values are valid UTF-16 code unit values, so the only way for an implementation to be conforming would be if its
uint_least16_t
type is larger than 16 bits.The text was updated successfully, but these errors were encountered: