Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits (LWG#2959) #32

Open
tahonermann opened this issue Jul 24, 2018 · 34 comments
Labels
bug Something isn't working help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed

Comments

@tahonermann
Copy link
Member

The specialization for std::char_traits<char16_t> requires a member int_type defined as uint_least16_t and for the member function eof() to return "an implementation-defined constant that cannot appear as a valid UTF-16 code unit."

However, all 16 bit values are valid UTF-16 code unit values, so the only way for an implementation to be conforming would be if its uint_least16_t type is larger than 16 bits.

@tahonermann
Copy link
Member Author

This was discovered during discussion on Twitter with Billy O'Neal over Microsoft's use of unsigned short for wint_t and definition of WEOF as 0xFFFF. The specification of std::char_traits<char16_t> creates the same issue he faced with Microsoft's wchar_t implementation choices.

@tahonermann
Copy link
Member Author

According to the Unicode FAQ, non-characters (and U+FFFF in particular) are valid in UTF-16 strings and cannot be used as sentinels. Relevant FAQ entries:

Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts. Particularly for implementations prior to Unicode 3.1, it should not be surprising to find legacy behavior treating U+FFFE and U+FFFF as invalid in Unicode 16-bit strings. And U+FFFF and U+10FFFF are, indeed, known to be used in various implementations as sentinels. For example, the value FFFF is used for WEOF in Windows implementations.

@cubbimew
Copy link

LWG thinks it's NAD: https://cplusplus.github.io/LWG/issue1200
GCC (@jwakely specifically) thinks it's a defect: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80624

@rmartinho
Copy link
Collaborator

rmartinho commented Jul 25, 2018

int_least16_t is likely to be the same size as char16_t, which may lead to surprising behavior, even if eof() is not a valid UTF-16 code unit.

To me this sounds like the decision was made without a proper understanding of the issue: all 16-bit values are valid UTF-16 code units.

The proposed wording betrays both a lack of understanding of what "permanently reserved" means (it doesn't mean it cannot be used, nor that it can be lost in interchange; only that it won't be assigned), and a misguided focus on UCS-2.

I think we should bring it up again with more expertise on hand.

@tahonermann
Copy link
Member Author

Thanks for those links, Sergey. The wiki notes for the Rapperswil review of LWG1200 are available at http://wiki.edg.com/bin/view/Wg21rapperswil/LWGSubGroup1. Unfortunately, those notes don't provide any more useful information regarding their review.

I agree with LWG's position that the proposed resolutions were overspecified. But Martinho is right, it looks like LWG didn't understand the concern. They appear to admit that with their "it is not clear what problem is being solved" comment.

I think our concern is slightly different than lwg1200. It appears that lwg1200 is complaining about the situation in which the eof() value can be held in a char_type value regardless of whether that value constitutes a valid code unit for the encoding. Our concern is that there is no value that eof() can return that is not a valid code unit (unless uint_least16_t is larger than 16 bits). I think we should proceed with a new issue that explains this view point but refers to lwg1200 as a related concern.

@jwakely
Copy link

jwakely commented Jul 25, 2018 via email

@tahonermann tahonermann changed the title std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits std::char_traits<char16_t>::eof() requires uint_least16_t to be larger than 16 bits (LWG#2959) Jul 25, 2018
@tahonermann
Copy link
Member Author

tahonermann commented Jul 25, 2018

Thanks, Jonathan.

Right, that's the problem I solved in GCC.

... by changing to_int_type() to return 0xFFFD when converting from a value that matches eof():

Just to be clear, this is an improvement over spurious eof() matches, but doesn't actually solve lwg2959.

See https://cplusplus.github.io/LWG/lwg-active.html#2959

lwg2959 looks like an exact match for this issue. Thanks!

@tahonermann
Copy link
Member Author

Just to be clear, this is an improvement over spurious eof() matches, but doesn't actually solve lwg2959.

... doesn't actually solve lwg2959 because the requirements for eq_int_type() in [char.traits.require] aren't satisfied. For c=0xFFFF and d=0xFFFD, eq(c,d) is false, but eq_int_type(to_int_type(c), to_int_type(d)) is true. Nor are the requirements for eof() satisfied since, for c=0xFFFD, eq_int_type(eof(), to_int_type(c)) is true.

@jwakely
Copy link

jwakely commented Jul 25, 2018

But the wording already allows char_traits<char16_t> to meet those requirements (by treating 0xFFFF and 0xFFFD as equal, and by returning 0xFFFD from to_int_type(0xFFFF)). No change is needed to the standard to make that true (just fixes to implementations).

The defect in the standard is the implication that there is any 16-bit value that "cannot appear as a valid UTF-16 code unit".

@tahonermann
Copy link
Member Author

But the wording already allows char_traits<char16_t> to meet those requirements (by treating 0xFFFF and 0xFFFD as equal, and by returning 0xFFFD from to_int_type(0xFFFF)).

If I understand correctly, you're suggesting that, for c=0xFFFF and d=0xFFFD, that eq(c,d) can return true. But that would contradict [char.traits.specializations.char16_t]p2.

@jwakely
Copy link

jwakely commented Jul 26, 2018

Ah yes. I don't see how that can be solved without an ABI break to make int_type larger than char_type.

@tahonermann tahonermann added paper needed A paper proposing a specific solution is needed help wanted Extra attention is needed bug Something isn't working labels Aug 6, 2018
@dimztimz
Copy link

I'd like to give my comment on this.

Changing the type to 32 bit unsigned integer is the only real solution. Indeed, it changes
the ABI, but I can give an argument that the impact of this change is extremely low to zero.

  1. char_traits::int_type and char_traits::eof() are used only in the context of
    iostreams.
  2. That means char_traits<char16_t>::int_type and eof() are meant to
    be used only in the context basic_istream<char16_t>, basic_ostream<char16_t>, and their
    derivatives.
  3. Iostreams with type char16_t are not really a thing, are largely unusable, thus
    it is highly unlikely that anyone ever uses them in real-world production code.
  4. Since nobody uses them, any changes there are highly unlikely to have an impact.

I will now explain why the iostreams that work with char16_t are unusable.
There are very few operation that one can do with these iostreams.

  1. You CAN construct iostream<char16_t>
  2. You CAN NOT call any of the formatted io operations because all depend on
    locale facets, and facets with char16_t do not exists. You will get
    compile-time error.
  3. You CAN NOT call some of the unformatted io operations, those that work with the
    newline character. Those depend on ctype<char16_t> facet.
    They need this facet to call ctype_facet.widen('\n')
  4. You CAN only call the unformatted io functions that do not depend on any
    locale facet. That leaves very few functions of the whole iostream.

@jwakely
Copy link

jwakely commented Oct 29, 2018 via email

@dimztimz
Copy link

In theory, yes. In theory one could use char_traits::int_type all over the place. I'm saying that practically nobody uses that.

AFAIK there are only 4 implementations that provide modern C++ implementations, GCC, Clang, MSVC and EDG. Other implementations like Intel's are based on EDG. For the first 3 I can safely say then they don't provide facets that work with char16_t and char32_t. I'd expect EDG is the same.

Then there is Boost::Locale which claims that has experimental support for char16_t and char32_t facets that has to be enabled at configure-time with a preprocessor define, and is disabled by default. But, even when you explicitly set that definition, it won't work. Later in client code, when you use a char16_t facet, you will get a link-time error. Boost locale expects that std::facets<char16_t> exists and derives on those. Basically the boost::locale support for char16_t is not experimental but is broken, and that is the reason it is disabled by default.

I guess a Debian code search should be done to see if there is a place that uses iostreams with char16_t and char_traits::int_type.

And, after all, GCC once did ABI break on std::string, and that went well with the tricks employed there.

@jwakely
Copy link

jwakely commented Oct 29, 2018

You're not really adding useful information here.

In theory one could use char_traits::int_type all over the place.

That wouldn't work, because you don't control the code inside the standard library.

For the first 3 I can safely say then they don't provide facets that work with char16_t and char32_t. I'd expect EDG is the same.

EDG is a compiler front-end, it has nothing to do with providing facets, locales, iostreams etc. and similarly Clang is a compiler. What matters is the standard library implementation (and there are only three relevant implementations of that).

And, after all, GCC once did ABI break on std::string, and that went well with the tricks employed there.

Firstly, it was not a "break" because the old ABI is still present and supported, and how well it went is debatable. Secondly, that isn't an option in this case. Source: me.

@dimztimz
Copy link

Ok then. Do we all agree that char_traits<char16_t>::int_type needs to be changed to to wider type?

@tahonermann
Copy link
Member Author

I think we had agreement on that before your recent comments :) See Jonathan's comment on July 26th.

@dimztimz
Copy link

How should we proceed? Is a classic paper with number Pxxxx needed, or there is different procedure for defect reports?

@strega-nil
Copy link

@jwakely I believe there are four - Sony uses dinkumware's STL.

@tahonermann
Copy link
Member Author

tahonermann commented Oct 29, 2018

@dimztimz There is already an active LWG issue for this so, to some degree, this is effectively in LWG's court. A paper arguing for a particular resolution could be helpful in moving things along, but attending an LWG issues processing session could also have the same effect.

@ubsan, Microsoft uses Dinkumware's STL (heavily customized).

@strega-nil
Copy link

@tahonermann not anymore - it's a fork of dinkumware's STL that hasn't been in alignment with mainline for ~three years.

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 8, 2019

same issue with char8_t - it should return a 32 bits int whose value is not in the range 00-10FFFF
This can be fixed for 20

@jwakely
Copy link

jwakely commented Aug 8, 2019

@cor3ntin char_traits<char8_t>::eof() only needs to return a value which is not a valid UTF-8 code unit. So any integer larger than char8_t will work. 10FFFF cannot appear as a UTF-8 code unit.

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 8, 2019

There is no requirement that u8string stores valid utf-8 data. So now if a iostream function returns eof(), do I have invalid data or actually reach the end of the file? No way for me to know, forcing me to check the state of the string. Any 8 bit value is a utf8 code unit and can appear in input data. Same thing for utf 16 and 32 actually. But 32 is a bit complicated, I suppose we couldn't require a 64 bits value? I am missing something?

@jwakely
Copy link

jwakely commented Aug 8, 2019

But a u8string stores 8-bit code units. An 8-bit code unit cannot have the value 10FFFF.

eof() is compared to elements of the string, i.e. code units, not code points.

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 8, 2019

Hum, nevermind, I originally missunderstood your reply, everything is fine 🥺

@dimztimz
Copy link

Was there any progress on this issue in the last ISO meeting?

Can NB comments be given for issues?

@tahonermann
Copy link
Member Author

No, we didn’t spend any time on this in cologne.

You can certainly submit an NB comment for it. I think the only potentially controversial aspect is whether fixing this is worth an ABI break. I don’t have a good sense of what the committee will feel about that.

@Ixrec
Copy link

Ixrec commented Nov 19, 2019

On ABI breaks, did P1863 get discussed at Belfast? I suspect "minor" reasons for breakage like this issue stand or fall with that larger question and ought to get included in a large batch of changes if done at all.

@tahonermann
Copy link
Member Author

On ABI breaks, did P1863 get discussed at Belfast?

No, it didn't. I think there is intent for it to get discussed in Prague.

@tahonermann
Copy link
Member Author

For historical reference, Unicode Corrigendum #9 (Clarification About Noncharacters) clarified that noncharacter code points are not prohibited in interchange.

@dimztimz
Copy link

A temporary solution can be to deprecate the type alias int_type and the member function eof() for the char16_t specialization.

@tahonermann
Copy link
Member Author

Thanks, @dimztimz. Deprecating int_type would require deprecating not_eof(), to_char_type(), to_int_type(), and eq_int_type() in addition to eof(). We could do that, but it doesn't get us much closer to a solution. I think we're heading in the direction of just fixing int_type as a DR. I'm going to reach out to implementors to see if they have any additional concerns and, based on their response, will poll that direction in SG16.

@tahonermann
Copy link
Member Author

Per the last comment, I reached out to implementors to collect their thoughts on whether it is feasible to just fix int_type as a DR. I received a couple of responses that I'll summarize in the github issue tracking LWG2959.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed
Development

No branches or pull requests

8 participants