Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide {to,from}_{ne,le,be}_bytes functions on integers #51919

Merged
merged 2 commits into from
Aug 4, 2018

Conversation

tbu-
Copy link
Contributor

@tbu- tbu- commented Jun 29, 2018

If one doesn't view integers as containers of bytes, converting them to
bytes necessarily needs the specfication of encoding.

I think Rust is a language that wants to be explicit. The to_bytes
function is basically the opposite of that – it converts an integer into
the native byte representation, but there's no mention (in the function
name) of it being very much platform dependent. Therefore, I think it
would be better to replace that method by three methods, the explicit
to_ne_bytes ("native endian") which does the same thing and
to_{le,be}_bytes which return the little- resp. big-endian encoding.

@rust-highfive
Copy link
Collaborator

r? @bluss

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 29, 2018
@tbu-
Copy link
Contributor Author

tbu- commented Jun 29, 2018

I think these methods also make more sense than the to_be and to_le function from a type-system point of view: It doesn't make sense to repeatedly call to_be on an integer, but you can, the same way that it doesn't make sense to repeatedly call encode() on a string in Python 2, but you can do so there (which will raise an error for non-ASCII, but still).

@tbu-
Copy link
Contributor Author

tbu- commented Jun 29, 2018

CC #49792

@kennytm kennytm added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jun 30, 2018
///
/// [`to_native_bytes`]: #method.to_native_bytes
#[rustc_deprecated(since = "1.29.0", reason = "method doesn't communicate
intent, use `to_native_bytes`, `to_le_bytes` or `to_be_bytes` instead")]
#[stable(feature = "int_to_from_bytes", since = "1.29.0")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR is accepted, we should better keep to_bytes()/from_bytes() unstable or just remove them, instead of stabilize and immediately deprecate them.

@alexreg
Copy link
Contributor

alexreg commented Jul 1, 2018

Very fair point! Three different method classes makes sense.

@felix91gr
Copy link
Contributor

felix91gr commented Jul 2, 2018

Hi! I don't know if I'm supposed to add my 2 cents here, but since I can, I will.

I've been using the to_bytes kind of transformation without regards for the endianness. This is because this is what my program is doing:

  • Receive an N-Dimensional Vec<f32>
  • Take its (non-randomly-seeded) u64 hash
  • Take the bytes of that hash
  • Feed those bytes to a pseudorandom generator as its seed.

I'm using this to make a generalized N-D Perlin (and in the future, Simplex) noise library. Since I'm using the hash's bytes as the seed for the pseudorandom generator, I didn't care about endianness when I implemented it.

But now that I think of it, specifying the endianness would allow me to have the exact same noise patterns regardless of the platform in which the library is running. So I think, even in my case, this is a better feature than the recently merged to_bytes :)

Thanks for your hard work 🙇‍♂️

@tbu-
Copy link
Contributor Author

tbu- commented Jul 5, 2018

Python implements a similar API on integers:

 |  from_bytes(...) from builtins.type
 |      int.from_bytes(bytes, byteorder, *, signed=False) -> int
 |      
 |      Return the integer represented by the given array of bytes.
 |      
 |      The bytes argument must be a bytes-like object (e.g. bytes or bytearray).
 |      
 |      The byteorder argument determines the byte order used to represent the
 |      integer.  If byteorder is 'big', the most significant byte is at the
 |      beginning of the byte array.  If byteorder is 'little', the most
 |      significant byte is at the end of the byte array.  To request the native
 |      byte order of the host system, use `sys.byteorder' as the byte order value.
 |      
 |      The signed keyword-only argument indicates whether two's complement is
 |      used to represent the integer.
 |  
 |  to_bytes(...)
 |      int.to_bytes(length, byteorder, *, signed=False) -> bytes
 |      
 |      Return an array of bytes representing an integer.
 |      
 |      The integer is represented using length bytes.  An OverflowError is
 |      raised if the integer is not representable with the given number of
 |      bytes.
 |      
 |      The byteorder argument determines the byte order used to represent the
 |      integer.  If byteorder is 'big', the most significant byte is at the
 |      beginning of the byte array.  If byteorder is 'little', the most
 |      significant byte is at the end of the byte array.  To request the native
 |      byte order of the host system, use `sys.byteorder' as the byte order value.
 |      
 |      The signed keyword-only argument determines whether two's complement is
 |      used to represent the integer.  If signed is False and a negative integer
 |      is given, an OverflowError is raised.

(Note that the byteorder argument is mandatory.)

@TimNN TimNN added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 10, 2018
@alexcrichton alexcrichton added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed I-nominated S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). labels Jul 12, 2018
@alexcrichton
Copy link
Member

This was discussed briefly at @rust-lang/libs triage yesterday but the conclusion was that we're going to leave comments individually here if we feel so.

I personally would not want to merge this PR. I think that leveraging the already-stable to_le and to_be methods to work in combination with to_bytes and from_bytes (already implemented) is the best way to work with these methods. Given that to_le and to_be are already stable I wouldn't feel we need to stabilize to_le_bytes or to_be_bytes. The question then to me is whether we need to include "native" in the name. I feel that the fact that these are native-endianness is implied by the lack of other information in the method name as otherwise we'd specify big or little.

@alexcrichton
Copy link
Member

@rfcbot fcp close

Let's see what others think though!

@rfcbot
Copy link

rfcbot commented Jul 12, 2018

Team member @alexcrichton has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-close This PR / issue is in PFCP or FCP with a disposition to close it. labels Jul 12, 2018
@tbu-
Copy link
Contributor Author

tbu- commented Jul 13, 2018

The question then to me is whether we need to include "native" in the name. I feel that the fact that these are native-endianness is implied by the lack of other information in the method name as otherwise we'd specify big or little.

As you can see in this thread, the implied native isn't clear for everyone. In fact, I think one should actively think about whether one wants a big-endian, little-endian or native byte representation for the integer. Having to_le and to_be on the integers does not achieve this, as you can just omit calls to them and never notice. If you instead have the choice between to_le_bytes, to_be_bytes, to_native_bytes, it is very clear that you have to make a choice, and it's immediately documented, too. If you see to_bytes in the wild, you wouldn't really know whether the author intended to make the code platform-specific or just forgot to add to_be or to_le.

@SimonSapin
Copy link
Contributor

This alternative was discussed in #49792 and I mildly preferred the current design in order to avoid the multiplication of API surface. This PR proposes replacing two methods with six.

I don’t really mind changing, especially since to_bytes and from_bytes were only stabilized in this cycle (1.29) so we can still remove them. However the motivation based only on principle feels a bit weak.

@tbu-, do you feel that this change would prevent bugs? What do you think of the "intermediate" option of renaming the existing methods but not adding the be/le ones? (Keep documenting them as to be combined with to_le and to_be.)

@tbu-
Copy link
Contributor Author

tbu- commented Jul 13, 2018

@tbu-, do you feel that this change would prevent bugs?

Yes, this is the intention behind this change. As you can see above, we already caught one with the PR alone. :)

What do you think of the "intermediate" option of renaming the existing methods but not adding the be/le ones?

I think that would be worse than what we currently have, as to_be().to_native_bytes() looks wrong, as we're combining a big-endian function with a function operating on native endian.

I think the optimization based on number of functions is misguided.

In C, zero-terminated "strings" share the same type, whether they're encoded as UTF-8 or just treated as bag of bytes. Rust makes the distinction because it is useful to prevent errors, it makes sense to make the user cast between bytes and UTF-8-strings, even if this results in an enormous method duplication between byte slices and strings.

As shown with the Python example, other languages consider the endianness of the integers part of the encoding process, too.

@alexcrichton
Copy link
Member

I think I'm personally just not sold on this. I see the cost of six functions, some of which duplicate the functionality of existing functions, as not worth it. Dealing with endianness is inherently tricky and I don't feel like making all the methods longer and more wordy will really reduce the trickiness, but rather just make it less ergonomic to call when you already know which one you'd like.

@tbu-
Copy link
Contributor Author

tbu- commented Jul 14, 2018

I see the cost of six functions, some of which duplicate the functionality of existing functions, as not worth it.

I understand this part.

don't feel like making all the methods longer and more wordy will really reduce the trickiness, but rather just make it less ergonomic to call when you already know which one you'd like.

I don't really understand this part. It's

to_be().to_bytes()              → to_be_bytes()            [reduction by  5 chars]
to_le().to_bytes()              → to_le_bytes()            [reduction by  5 chars]
to_bytes()                      → to_native_bytes()        [increase  by  7 chars]
i32::from_be(i32::from_bytes()) → i32::from_be_bytes()     [reduction by 10 chars]
i32::from_le(i32::from_bytes()) → i32::from_le_bytes()     [reduction by 10 chars]
i32::from_bytes()               → i32::from_native_bytes() [increase  by  7 chars]

To me, it looks like it decreases the ergonomics of doing the platform-specific thing slightly but also increases the ergonomics of doing the platform-independent thing slightly, and the important part: It brings them to the same level. Currently, doing the platform-independent thing is slightly less ergonomic.

Rust wants to encourage good code, if we want to bring ergonomics of the methods into play, then I think an even playing field between platform-independent methods and platform-dependent methods is in line with Rust's objectives.

@SimonSapin
Copy link
Contributor

I feel that "native" in a name doesn’t really work on its own. It only sort of does in "native endian", and even then only by opposition to little/big endian.

What do you all think of transmute_to_bytes and transmute_from_bytes, to express that this is a type-level change without a memory representation change?

Also the more I think about it the more I’m becoming partial to having explicit {to,from}_{le,be}_bytes conversions. The {to,from}_{le,be} methods have the merit of already being stable, but they’re kinda weird in taking and returning the same type. In fact the implementation of to_be and from_be for example is identical! I understand that they’re intended to be thought about differently, and perhaps their use is self-documenting what direction of conversion is intended, but it feels like the type system would be better suited to this (even though adding a whole series of dedicated types like struct BigEndianU32(u32); is probably not worth it).

@alexcrichton
Copy link
Member

@tbu- to me ergonomics isn't literally how many characters you type but also how easy it is to remember idioms, and remembering that there's endian conversions on integers plus byte<-> int conversions means it's easy to remember how to compose. With everything it can be confusing about knowing which one was the one implemented and there's now a choice to be made of which to do in various circumstances.

@SimonSapin I think we could try to find a better name, yeah, but I'd reserve transmute for unsafe-like operations and would prefer to not have it here on a safe method.

@marshrayms
Copy link

I am today starting on my first Rust program. I need to read a binary file format. When reading and writing files or network data, the endianness is an inherent part of the format's or protocol's definition.

It seems far more complicated to me to first read an incorrect intermediate value using host endianness and then adjust it to the correct value with bit manipulations. Byte-swapping is a completely unnecessary complication when we can just read it correctly in the first place.

I would really appreciate it if conversions of the form 'u32::from_le_bytes([u8; 4])' were made available.

@pietroalbini pietroalbini added the S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). label Jul 23, 2018
The old issue has already been in FCP, a new issue was opened for the
new API.
@kennytm kennytm added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). labels Aug 4, 2018
@tbu-
Copy link
Contributor Author

tbu- commented Aug 4, 2018

@SimonSapin Done.

@SimonSapin
Copy link
Contributor

@bors r+

Thanks!

@bors
Copy link
Contributor

bors commented Aug 4, 2018

📌 Commit 0ddfae5 has been approved by SimonSapin

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 4, 2018
kennytm added a commit to kennytm/rust that referenced this pull request Aug 4, 2018
…Sapin

Provide `{to,from}_{ne,le,be}_bytes` functions on integers

If one doesn't view integers as containers of bytes, converting them to
bytes necessarily needs the specfication of encoding.

I think Rust is a language that wants to be explicit. The `to_bytes`
function is basically the opposite of that – it converts an integer into
the native byte representation, but there's no mention (in the function
name) of it being very much platform dependent. Therefore, I think it
would be better to replace that method by three methods, the explicit
`to_ne_bytes` ("native endian") which does the same thing and
`to_{le,be}_bytes` which return the little- resp. big-endian encoding.
@rust-highfive

This comment has been minimized.

bors added a commit that referenced this pull request Aug 4, 2018
Rollup of 14 pull requests

Successful merges:

 - #51919 (Provide `{to,from}_{ne,le,be}_bytes` functions on integers)
 - #52940 (Align 6-week cycle check with beta promotion instead of stable release.)
 - #52968 (App-lint-cability)
 - #52969 (rustbuild: fix local_rebuild)
 - #52995 (Remove unnecessary local in await! generator)
 - #52996 (RELEASES.md: fix the `hash_map::Entry::or_default` link)
 - #53001 (privacy: Fix an ICE in `path_is_private_type`)
 - #53003 (Stabilize --color and --error-format options in rustdoc)
 - #53022 (volatile operations docs: clarify that this does not help wrt. concurrency)
 - #53024 (Specify reentrancy gurantees of `Once::call_once`)
 - #53041 (Fix invalid code css rule)
 - #53047 (Make entire row of doc search results clickable)
 - #53050 (Make left column of rustdoc search results narrower)
 - #53062 (Remove redundant field names in structs)
@bors
Copy link
Contributor

bors commented Aug 4, 2018

⌛ Testing commit 0ddfae5 with merge 215bf3a...

@bors bors merged commit 0ddfae5 into rust-lang:master Aug 4, 2018
@bors
Copy link
Contributor

bors commented Aug 4, 2018

💥 Test timed out

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Aug 4, 2018
SimonSapin added a commit to SimonSapin/rust that referenced this pull request Aug 14, 2018
frewsxcv added a commit to frewsxcv/rust that referenced this pull request Aug 17, 2018
`{to,from}_{ne,le,be}_bytes` for unsigned integer types

Same as rust-lang#51919 did for signed integers.

Tracking issue: rust-lang#52963
@gamozolabs
Copy link

I would really like to see this be put into a trait rather than functions.

As a trait this could be implemented by a builtin derive which would make it work for structures composed of primitive types like this is being implemented on currently.

This would allow for simple creation of packed (or C structures by ignoring padding) structures in Rust (like an IP header or binary file header) and then the automatic conversion to and from the raw bytes on the wire/file.

I have a library to do this now and I use it in many of my projects. I think the ability to be able to handle reading and writing packed structures safely from bytestreams is critical enough to a systems language to justify it being put into the core language rather than being an external library. Especially given it would be almost no code increase to the Rust base (would be a procmacro), and would have no implications to programs unless they decide to use the feature.

@BurntSushi
Copy link
Member

@gamozolabs I think it would be appropriate to publish that library on crates.io, garner adoption and prove out the API. At that point, someone could consider writing an RFC, but I think it's probably immature to jump to that right now.

cswinter pushed a commit to cswinter/LocustDB that referenced this pull request Sep 18, 2018
* Fix typos and unused lifetimes
* Remove duplicated cast
* As rust-lang/rust#51919 changes function name, use to_ne_bytes instead. 
* Remove stabled feature flag
* Update toolchain to latest version
@jonhoo
Copy link
Contributor

jonhoo commented Jan 17, 2019

@BurntSushi will this affect byteorder at all?

@BurntSushi
Copy link
Member

@jonhoo That's a pretty broad question! Nothing immediately pops out at me. byteorder should continue to work fine?

@jonhoo
Copy link
Contributor

jonhoo commented Jan 17, 2019

Hehe, that's true. I guess I was more wondering whether byteorder will transition to use these methods instead of having its own implementation?

@BurntSushi
Copy link
Member

Hehe, that's true. I guess I was more wondering whether byteorder will transition to use these methods instead of having its own implementation?

Oh! Hah. That didn't cross my mind at all because byteorder is one of the most conservative crates I maintain, and I won't increase its MSRV for something like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.