std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits (LWG#2959) #32

tahonermann · 2018-07-24T12:49:46Z

The specialization for std::char_traits<char16_t> requires a member int_type defined as uint_least16_t and for the member function eof() to return "an implementation-defined constant that cannot appear as a valid UTF-16 code unit."

http://eel.is/c++draft/char.traits.specializations.char16_t

However, all 16 bit values are valid UTF-16 code unit values, so the only way for an implementation to be conforming would be if its uint_least16_t type is larger than 16 bits.

The text was updated successfully, but these errors were encountered:

tahonermann · 2018-07-24T12:55:13Z

This was discovered during discussion on Twitter with Billy O'Neal over Microsoft's use of unsigned short for wint_t and definition of WEOF as 0xFFFF. The specification of std::char_traits<char16_t> creates the same issue he faced with Microsoft's wchar_t implementation choices.

https://twitter.com/MalwareMinigun/status/1021579466777157632

tahonermann · 2018-07-24T13:01:13Z

According to the Unicode FAQ, non-characters (and U+FFFF in particular) are valid in UTF-16 strings and cannot be used as sentinels. Relevant FAQ entries:

Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts. Particularly for implementations prior to Unicode 3.1, it should not be surprising to find legacy behavior treating U+FFFE and U+FFFF as invalid in Unicode 16-bit strings. And U+FFFF and U+10FFFF are, indeed, known to be used in various implementations as sentinels. For example, the value FFFF is used for WEOF in Windows implementations.

cubbimew · 2018-07-25T15:15:25Z

LWG thinks it's NAD: https://cplusplus.github.io/LWG/issue1200
GCC (@jwakely specifically) thinks it's a defect: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80624

rmartinho · 2018-07-25T15:22:30Z

int_least16_t is likely to be the same size as char16_t, which may lead to surprising behavior, even if eof() is not a valid UTF-16 code unit.

To me this sounds like the decision was made without a proper understanding of the issue: all 16-bit values are valid UTF-16 code units.

The proposed wording betrays both a lack of understanding of what "permanently reserved" means (it doesn't mean it cannot be used, nor that it can be lost in interchange; only that it won't be assigned), and a misguided focus on UCS-2.

I think we should bring it up again with more expertise on hand.

tahonermann · 2018-07-25T15:39:20Z

Thanks for those links, Sergey. The wiki notes for the Rapperswil review of LWG1200 are available at http://wiki.edg.com/bin/view/Wg21rapperswil/LWGSubGroup1. Unfortunately, those notes don't provide any more useful information regarding their review.

I agree with LWG's position that the proposed resolutions were overspecified. But Martinho is right, it looks like LWG didn't understand the concern. They appear to admit that with their "it is not clear what problem is being solved" comment.

I think our concern is slightly different than lwg1200. It appears that lwg1200 is complaining about the situation in which the eof() value can be held in a char_type value regardless of whether that value constitutes a valid code unit for the encoding. Our concern is that there is no value that eof() can return that is not a valid code unit (unless uint_least16_t is larger than 16 bits). I think we should proceed with a new issue that explains this view point but refers to lwg1200 as a related concern.

jwakely · 2018-07-25T16:33:49Z

On Wed, 25 Jul 2018 at 16:39, Tom Honermann wrote: I think our concern is slightly different than lwg1200. It appears that lwg1200 is complaining about the situation in which the eof() value can be held in a char_type value regardless of whether that value constitutes a valid code unit for the encoding. Our concern is that there is no value that eof() can return that is not a valid code unit (unless uint_least16_t is larger than 16 bits).

Right, that's the problem I solved in GCC.

I think we should proceed with a new issue that explains this view point but refers to lwg1200 as a related concern.

See https://cplusplus.github.io/LWG/lwg-active.html#2959

tahonermann · 2018-07-25T17:52:52Z

Thanks, Jonathan.

Right, that's the problem I solved in GCC.

... by changing to_int_type() to return 0xFFFD when converting from a value that matches eof():

https://gcc.gnu.org/viewcvs/gcc/trunk/libstdc%2B%2B-v3/include/bits/char_traits.h?r1=246655&r2=248843&pathrev=248843

Just to be clear, this is an improvement over spurious eof() matches, but doesn't actually solve lwg2959.

See https://cplusplus.github.io/LWG/lwg-active.html#2959

lwg2959 looks like an exact match for this issue. Thanks!

tahonermann · 2018-07-25T21:14:42Z

Just to be clear, this is an improvement over spurious eof() matches, but doesn't actually solve lwg2959.

... doesn't actually solve lwg2959 because the requirements for eq_int_type() in [char.traits.require] aren't satisfied. For c=0xFFFF and d=0xFFFD, eq(c,d) is false, but eq_int_type(to_int_type(c), to_int_type(d)) is true. Nor are the requirements for eof() satisfied since, for c=0xFFFD, eq_int_type(eof(), to_int_type(c)) is true.

jwakely · 2018-07-25T21:57:47Z

But the wording already allows char_traits<char16_t> to meet those requirements (by treating 0xFFFF and 0xFFFD as equal, and by returning 0xFFFD from to_int_type(0xFFFF)). No change is needed to the standard to make that true (just fixes to implementations).

The defect in the standard is the implication that there is any 16-bit value that "cannot appear as a valid UTF-16 code unit".

tahonermann · 2018-07-26T02:54:36Z

But the wording already allows char_traits<char16_t> to meet those requirements (by treating 0xFFFF and 0xFFFD as equal, and by returning 0xFFFD from to_int_type(0xFFFF)).

If I understand correctly, you're suggesting that, for c=0xFFFF and d=0xFFFD, that eq(c,d) can return true. But that would contradict [char.traits.specializations.char16_t]p2.

jwakely · 2018-07-26T06:33:06Z

Ah yes. I don't see how that can be solved without an ABI break to make int_type larger than char_type.

dimztimz · 2018-10-29T11:31:00Z

I'd like to give my comment on this.

Changing the type to 32 bit unsigned integer is the only real solution. Indeed, it changes
the ABI, but I can give an argument that the impact of this change is extremely low to zero.

char_traits::int_type and char_traits::eof() are used only in the context of
iostreams.
That means char_traits<char16_t>::int_type and eof() are meant to
be used only in the context basic_istream<char16_t>, basic_ostream<char16_t>, and their
derivatives.
Iostreams with type char16_t are not really a thing, are largely unusable, thus
it is highly unlikely that anyone ever uses them in real-world production code.
Since nobody uses them, any changes there are highly unlikely to have an impact.

I will now explain why the iostreams that work with char16_t are unusable.
There are very few operation that one can do with these iostreams.

You CAN construct iostream<char16_t>
You CAN NOT call any of the formatted io operations because all depend on
locale facets, and facets with char16_t do not exists. You will get
compile-time error.
You CAN NOT call some of the unformatted io operations, those that work with the
newline character. Those depend on ctype<char16_t> facet.
They need this facet to call ctype_facet.widen('\n')
You CAN only call the unformatted io functions that do not depend on any
locale facet. That leaves very few functions of the whole iostream.

jwakely · 2018-10-29T11:46:17Z

In theory implementations can provide ctype<char16_t> and other facets. If the implementation doesn't, users can define their own facets and imbue a custom locale into streams.

dimztimz · 2018-10-29T13:03:42Z

In theory, yes. In theory one could use char_traits::int_type all over the place. I'm saying that practically nobody uses that.

AFAIK there are only 4 implementations that provide modern C++ implementations, GCC, Clang, MSVC and EDG. Other implementations like Intel's are based on EDG. For the first 3 I can safely say then they don't provide facets that work with char16_t and char32_t. I'd expect EDG is the same.

Then there is Boost::Locale which claims that has experimental support for char16_t and char32_t facets that has to be enabled at configure-time with a preprocessor define, and is disabled by default. But, even when you explicitly set that definition, it won't work. Later in client code, when you use a char16_t facet, you will get a link-time error. Boost locale expects that std::facets<char16_t> exists and derives on those. Basically the boost::locale support for char16_t is not experimental but is broken, and that is the reason it is disabled by default.

I guess a Debian code search should be done to see if there is a place that uses iostreams with char16_t and char_traits::int_type.

And, after all, GCC once did ABI break on std::string, and that went well with the tricks employed there.

jwakely · 2018-10-29T13:12:06Z

You're not really adding useful information here.

In theory one could use char_traits::int_type all over the place.

That wouldn't work, because you don't control the code inside the standard library.

For the first 3 I can safely say then they don't provide facets that work with char16_t and char32_t. I'd expect EDG is the same.

EDG is a compiler front-end, it has nothing to do with providing facets, locales, iostreams etc. and similarly Clang is a compiler. What matters is the standard library implementation (and there are only three relevant implementations of that).

And, after all, GCC once did ABI break on std::string, and that went well with the tricks employed there.

Firstly, it was not a "break" because the old ABI is still present and supported, and how well it went is debatable. Secondly, that isn't an option in this case. Source: me.

dimztimz · 2018-10-29T13:29:32Z

Ok then. Do we all agree that char_traits<char16_t>::int_type needs to be changed to to wider type?

tahonermann · 2018-10-29T14:13:15Z

I think we had agreement on that before your recent comments :) See Jonathan's comment on July 26th.

dimztimz · 2018-10-29T17:27:00Z

How should we proceed? Is a classic paper with number Pxxxx needed, or there is different procedure for defect reports?

strega-nil · 2018-10-29T17:35:02Z

@jwakely I believe there are four - Sony uses dinkumware's STL.

tahonermann · 2018-10-29T18:15:30Z

@dimztimz There is already an active LWG issue for this so, to some degree, this is effectively in LWG's court. A paper arguing for a particular resolution could be helpful in moving things along, but attending an LWG issues processing session could also have the same effect.

https://cplusplus.github.io/LWG/lwg-active.html#2959

@ubsan, Microsoft uses Dinkumware's STL (heavily customized).

strega-nil · 2018-10-29T18:58:26Z

@tahonermann not anymore - it's a fork of dinkumware's STL that hasn't been in alignment with mainline for ~three years.

cor3ntin · 2019-08-08T12:30:48Z

same issue with char8_t - it should return a 32 bits int whose value is not in the range 00-10FFFF
This can be fixed for 20

jwakely · 2019-08-08T12:47:28Z

@cor3ntin char_traits<char8_t>::eof() only needs to return a value which is not a valid UTF-8 code unit. So any integer larger than char8_t will work. 10FFFF cannot appear as a UTF-8 code unit.

cor3ntin · 2019-08-08T13:39:36Z

There is no requirement that u8string stores valid utf-8 data. So now if a iostream function returns eof(), do I have invalid data or actually reach the end of the file? No way for me to know, forcing me to check the state of the string. Any 8 bit value is a utf8 code unit and can appear in input data. Same thing for utf 16 and 32 actually. But 32 is a bit complicated, I suppose we couldn't require a 64 bits value? I am missing something?

jwakely · 2019-08-08T13:48:11Z

But a u8string stores 8-bit code units. An 8-bit code unit cannot have the value 10FFFF.

eof() is compared to elements of the string, i.e. code units, not code points.

cor3ntin · 2019-08-08T13:53:59Z

Hum, nevermind, I originally missunderstood your reply, everything is fine 🥺

dimztimz · 2019-08-14T14:56:12Z

Was there any progress on this issue in the last ISO meeting?

Can NB comments be given for issues?

tahonermann · 2019-08-14T15:22:28Z

No, we didn’t spend any time on this in cologne.

You can certainly submit an NB comment for it. I think the only potentially controversial aspect is whether fixing this is worth an ABI break. I don’t have a good sense of what the committee will feel about that.

Ixrec · 2019-11-19T23:40:39Z

On ABI breaks, did P1863 get discussed at Belfast? I suspect "minor" reasons for breakage like this issue stand or fall with that larger question and ought to get included in a large batch of changes if done at all.

tahonermann · 2019-11-20T15:40:23Z

On ABI breaks, did P1863 get discussed at Belfast?

No, it didn't. I think there is intent for it to get discussed in Prague.

tahonermann · 2023-11-19T04:42:45Z

For historical reference, Unicode Corrigendum #9 (Clarification About Noncharacters) clarified that noncharacter code points are not prohibited in interchange.

dimztimz · 2023-11-24T16:32:05Z

A temporary solution can be to deprecate the type alias int_type and the member function eof() for the char16_t specialization.

tahonermann · 2023-11-28T22:35:20Z

Thanks, @dimztimz. Deprecating int_type would require deprecating not_eof(), to_char_type(), to_int_type(), and eq_int_type() in addition to eof(). We could do that, but it doesn't get us much closer to a solution. I think we're heading in the direction of just fixing int_type as a DR. I'm going to reach out to implementors to see if they have any additional concerns and, based on their response, will poll that direction in SG16.

tahonermann · 2024-04-16T20:49:35Z

Per the last comment, I reached out to implementors to collect their thoughts on whether it is feasible to just fix int_type as a DR. I received a couple of responses that I'll summarize in the github issue tracking LWG2959.

tahonermann changed the title ~~std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits~~ std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits (LWG#2959) Jul 25, 2018

tahonermann added paper needed A paper proposing a specific solution is needed help wanted Extra attention is needed bug Something isn't working labels Aug 6, 2018

tahonermann mentioned this issue May 29, 2023

LWG2959 char_traits<char16_t>::eof is a valid UTF-16 code unit cplusplus/papers#1572

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits (LWG#2959) #32

std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits (LWG#2959) #32

tahonermann commented Jul 24, 2018

tahonermann commented Jul 24, 2018

tahonermann commented Jul 24, 2018

cubbimew commented Jul 25, 2018

rmartinho commented Jul 25, 2018 •

edited

Loading

tahonermann commented Jul 25, 2018

jwakely commented Jul 25, 2018 via email •

edited

Loading

tahonermann commented Jul 25, 2018 •

edited

Loading

tahonermann commented Jul 25, 2018

jwakely commented Jul 25, 2018

tahonermann commented Jul 26, 2018

jwakely commented Jul 26, 2018

dimztimz commented Oct 29, 2018

jwakely commented Oct 29, 2018 via email •

edited

Loading

dimztimz commented Oct 29, 2018

jwakely commented Oct 29, 2018

dimztimz commented Oct 29, 2018

tahonermann commented Oct 29, 2018

dimztimz commented Oct 29, 2018

strega-nil commented Oct 29, 2018

tahonermann commented Oct 29, 2018 •

edited

Loading

strega-nil commented Oct 29, 2018

cor3ntin commented Aug 8, 2019 •

edited

Loading

jwakely commented Aug 8, 2019

cor3ntin commented Aug 8, 2019

jwakely commented Aug 8, 2019

cor3ntin commented Aug 8, 2019

dimztimz commented Aug 14, 2019

tahonermann commented Aug 14, 2019

Ixrec commented Nov 19, 2019

tahonermann commented Nov 20, 2019

tahonermann commented Nov 19, 2023

dimztimz commented Nov 24, 2023

tahonermann commented Nov 28, 2023

tahonermann commented Apr 16, 2024

std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits (LWG#2959) #32

std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits (LWG#2959) #32

Comments

tahonermann commented Jul 24, 2018

tahonermann commented Jul 24, 2018

tahonermann commented Jul 24, 2018

cubbimew commented Jul 25, 2018

rmartinho commented Jul 25, 2018 • edited Loading

tahonermann commented Jul 25, 2018

jwakely commented Jul 25, 2018 via email • edited Loading

tahonermann commented Jul 25, 2018 • edited Loading

tahonermann commented Jul 25, 2018

jwakely commented Jul 25, 2018

tahonermann commented Jul 26, 2018

jwakely commented Jul 26, 2018

dimztimz commented Oct 29, 2018

jwakely commented Oct 29, 2018 via email • edited Loading

dimztimz commented Oct 29, 2018

jwakely commented Oct 29, 2018

dimztimz commented Oct 29, 2018

tahonermann commented Oct 29, 2018

dimztimz commented Oct 29, 2018

strega-nil commented Oct 29, 2018

tahonermann commented Oct 29, 2018 • edited Loading

strega-nil commented Oct 29, 2018

cor3ntin commented Aug 8, 2019 • edited Loading

jwakely commented Aug 8, 2019

cor3ntin commented Aug 8, 2019

jwakely commented Aug 8, 2019

cor3ntin commented Aug 8, 2019

dimztimz commented Aug 14, 2019

tahonermann commented Aug 14, 2019

Ixrec commented Nov 19, 2019

tahonermann commented Nov 20, 2019

tahonermann commented Nov 19, 2023

dimztimz commented Nov 24, 2023

tahonermann commented Nov 28, 2023

tahonermann commented Apr 16, 2024

rmartinho commented Jul 25, 2018 •

edited

Loading

jwakely commented Jul 25, 2018 via email •

edited

Loading

tahonermann commented Jul 25, 2018 •

edited

Loading

jwakely commented Oct 29, 2018 via email •

edited

Loading

tahonermann commented Oct 29, 2018 •

edited

Loading

cor3ntin commented Aug 8, 2019 •

edited

Loading