-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hash: add Hasher, Seed for hashing user data #28322
Comments
Related to #21195. |
It seems to me this could easily be an external package. There are already many good string hashing functions available with slightly different tradeoffs: strength, speed for short strings, speed for long strings, etc. (I have a few benchmarks here: https://github.com/dgryski/trifles/blob/master/hashbench/bench_test.go#L10 ) A "one-size-fits-all" solution is nice, but I think this ends up being a sufficiently rare it shouldn't be in the standard library. That said, packaging a copy of the runtime's string hash function into /x/ somewhere might be an idea. |
This is not the right API. All hash functions must take an initial seed so that they can be chained and produce a different hash function for different seeds. The current proposed API, with no input seed, can only be used to build bad hash tables. (To avoid engineered collisions aka hash-flooding it is important that when you grow a hash table you also completely change the hash function, which you can do by having a different initial seed for the hash inputs.) On top of that we'd also want to vary the hash calculation, even for a given input seed value, for each process, to avoid people writing code dependent on the exact hash implementation. More API design is probably needed here. It might work to copy this to an external package and simply have two of the code. On the other hand this functionality is very much exporting what's in the runtime (unexported), so that would weigh toward being in the standard library. The right answer may depend on how much API we are talking about. |
Well, adding a seed parameter is trivial, and corresponds to the underlying implementation:
|
Since Per discussion with @golang/proposal-review |
Does anyone have time to work out a concrete API here? Or should we put this on hold? |
Does the API need to be more complex that this?
|
Is the compiler intelligent enough today to avoid allocating when doing: var b []byte = ...
h := hash.String(string(b), seed) such that we can avoid the bytes version? |
Also probably want to make it clear that different processes even with the same seed may produce different hashes. |
Also should probably be uint64; uint32 is likely not good enough. It's unclear to me if the API should be limited to a single function like this or if we want to define a Hasher object that you can feed strings and other basic types into. |
A stateful hasher object that satisfies hash.Hash64 is a good idea: it makes the state parameter invisible in the common case, it presents an API consistent with the other hash/... packages, and it allows you to write non-strings to it using Fprintf. |
Change https://golang.org/cl/155118 mentions this issue: |
Somewhat related: during some recent work I had a need for a hash function for arbitrary reflect.Values that is consistent with the Go equivalence relation for those values, and I don't think it's possible to achieve in Go today. It's possible to test reflect.Values x and y for equivalence using x.Interface()==y.Interface(), but there's no way to obtain the values' hashes. I wonder if the hash API proposed in this issue should instead allow arbitrary Go values to be hashed in a manner consistent with equals? As I said, it's only somewhat related; this issue is primarily about performance, not consistency, but perhaps there's an opportunity to kill two stones with one bird. |
Hashing arbitrary values is currently implemented with code generation in the compiler when it detects that a type requires a hash function. That obviously won’t work for an arbitrary value inside an interface or reflect.Value. We’d have to simulate instead. That suggests different API to me—something like reflect.Hash that accepts a reflect.Value (and a seed?). |
Ian, Robert, Andy, and I chatted through this and sketched an API, below.
That would lead to an API something like:
Robert says he could use this for some of the generics experiments. Alan, does this make sense for you? /cc @alandonovan |
Thanks, that all makes sense. I assume Add would permit any value that is a valid map key, including pointers, and panic otherwise? |
Hasher.SetSeed would need to disallow the zero Seed value to ensure that users can't choose a specific seed. |
Is the call pattern part of the hash? That is, is Presumably What does Does |
I see good reasons for both ways:
|
For the compound key instance, the traditional solution is to include a separator in the hash calculation: |
Yes?
No? I was thinking the call sequence matters, so you can just shove all the fields from a struct into the Hasher one at a time without worrying about {Name: "abc", Value: "def"} and {Name: "abcde", Value: "f"} having the same hash.
Hadn't thought through that. Not sure. What do you think?
Yes.
Yes.
Yes. |
I think we should define that
behave identically. Similarly,
behave identically. Given that a The more I think about the generic I like the ease of use provided by including the call sequence in the hash. All of |
@alandonovan thoughts on @randall77's comments? |
Re: call pattern. It's straightforward to implement a hasher that consumes a stream of values if you have a hasher that consumes a sequence of bytes, but the reverse is not the case, so I think the call pattern should not be significant to the hash, even if that may be less convenient. Re: generic add. We would not be specifying how structs are hashed, only that structs that compare equal hash the same. This leaves room for any specific implementation, of which the runtime alg is the obvious choice. We would of course be revealing the specific values, and that means some people will write programs that behave differently when the hash function changes. Re: seed and reset. In a hash table, each table would need to create and save a new seed (ideally without an additional call to malloc), then each lookup operation would create its own local hasher (again ideally avoiding malloc), initialize it from the table's seed and hash the key. There's no call for Reset in this scenario. What's the value of Reset if NewHasher doesn't heap-allocate? |
@alandonovan I think the value of Reset is that you avoid having to manage the seed yourself. For a typical one-goroutine-at-a-time data structure, one hasher is fine, probably declared in your top-level data structure, and you just call Reset between uses. I'm inclined to agree with Keith about leaving Add(interface{}) out at the start and have only AddBytes, AddString, AddFloat64, AddInt64, AddUint64. We can add the general interface{} case later once we understand why it is necessary. |
Thinking more about this, I'd say that providing a way to perform allocation-free/non-one-shot hashing of a (potentially non-fully-in-memory) stream of data is a good enough use case to put up with the inconvenience of having to manually delimit chunks. |
Sounds good to me. As far as semantics, I think we converged on the fact that any sequence of I presume It might be nice to be able to serialize/deserialize a |
...or construct a |
Int/Uint/Float are not necessary but having to encoding/binary them just to get bytes to feed to the hash seems like a lot of effort that is not going to yield a fast hash. I agree about @randall77's clarifications and added them above. It sounds like we can accept this. |
Does anyone want to pick this up for Go 1.14? |
I'm sketching this out with @randall77 and @alandonovan. |
Re: @randall77 's comment
This implies that any []byte or string can be added by adding each individual byte separately. Perhaps there should be an |
Change https://golang.org/cl/186877 mentions this issue: |
Fixes golang#28322 R=go1.14 RELNOTE=yes Change-Id: Ic29f8b587c8c77472260836a5c3e13edaded13fa Reviewed-on: https://go-review.googlesource.com/c/go/+/186877 Reviewed-by: Alan Donovan <[email protected]>
Fixes golang#28322 R=go1.14 RELNOTE=yes Change-Id: Ic29f8b587c8c77472260836a5c3e13edaded13fa Reviewed-on: https://go-review.googlesource.com/c/go/+/186877 Reviewed-by: Alan Donovan <[email protected]>
CL 186877 added Edit: @randall77 mentioned the ability to create a Seed from an int64 above actually (#28322 (comment)), but I read that as a suggestion to add |
Be sure to document the fact that unlike other Go 1.x functions, hash behavior is expected to dramatically change in backwards incompatible ways with different Go versions. Any applications depending on consistent hashing are advised to reimplement hashing in a more stable API. |
https://tip.golang.org/pkg/hash/maphash/#Seed currently warns "Each Seed value is local to a single process and cannot be serialized or otherwise recreated in a different process." If even two different processes running the same version of Go can't use the same Seed values, then there doesn't seem to be any need to explain that two different processes running different Go versions can't either. |
Change https://golang.org/cl/217099 mentions this issue: |
Update #36878 Update #28322 Change-Id: I793c7c4dbdd23fdecd715500e90b7cc0cbe4cea5 Reviewed-on: https://go-review.googlesource.com/c/go/+/217099 Reviewed-by: Ian Lance Taylor <[email protected]>
Change https://golang.org/cl/219877 mentions this issue: |
Given that it's a package that did not exist before, was a proposal in issue #28322, got accepted and implemented for 1.14, it seems to be more than a minor change to the library. Highlight it accordingly. Also specify the results are 64-bit integers, as done in CL 219340. Updates #36878 Updates #28322 Change-Id: Idefe63d4c47a02cdcf8be8ab08c40cdb94ff2098 Reviewed-on: https://go-review.googlesource.com/c/go/+/219877 Run-TryBot: Dmitri Shuralyov <[email protected]> Reviewed-by: Toshihiro Shiino <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
Change https://golang.org/cl/220518 mentions this issue: |
Updates golang/go#28322. Change-Id: I025b9a9de1273c9a81f0e2d00e890337ed21a59b Reviewed-on: https://go-review.googlesource.com/c/build/+/220518 Run-TryBot: Dmitri Shuralyov <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Keith Randall <[email protected]>
[See #28322 (comment) below for accepted proposal.]
Some applications need an efficient non-cryptographic hash function for strings. The Go runtime contains such a function, hashString, that is optimized to use AESENC instructions on amd64, but it is not exported. The implementation is non-trivial, so clients are loathe to copy it and will tend to use the //linkname hack instead.
In the same way that math/bits exports Go functions that use the best available implementation of a given bit-twiddling function, it would be useful for a package such as
hash
to export a decent string hash function. The implementation should include a per-process random salt to prevent users from assuming that hashes are stable across process restarts, which has been a problem for hash libraries in other languages.If we can agree where this belongs, I'll send the CL.
The text was updated successfully, but these errors were encountered: