-
Notifications
You must be signed in to change notification settings - Fork 18k
hash: standardize the hash function #70471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
(The minimal answer without Seeds is definitely the wrong one, as you explained before giving it.) |
One approach is to define type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
} A generic hash table with keys of type T will take a value of type Hasher[T], and invoke the methods of that value to compute hash values and to compare elements for equality. Then we can have // MakeHasher returns a Hasher[T] that invokes the hash and equal functions.
// Note that this does not use an explicit seed.
func MakeHasher[T any](hash(T) uint64, equal(T, T) bool) Hasher[T] { ... }
// MakeSeededHasher returns a Hasher[T] that uses a random seed.
func MakeSeededHasher[T any](hash(T, uint64), equal(T, T) bool) Hasher[T] {
return &seededHasher[T]{hash: hash, equal: equal, seed: randUint64()}
}
type seededHasher[T any] struct {
hash func(T, uint64)
equal func(T, T) bool
seed uint64
}
func (sh *seededHasher[T]) Hash(v T) uint64 {
return sh.hash(v, sh.seed)
}
func (sh *seededHasher[T]) Equal(a, b T) bool {
return sh.equal(a, b)
}
func (sh *seededHasher[T]) SetSeed(seed uint64) {
sh.seed = seed
}
// MakeMapHashHasher returns a Hasher[T] that uses [maphash.Comparable].
func MakeMapHashHasher[T comparable](seed maphash.Seed) Hasher[T] {
return mapHashHasher[T]{seed}
}
type mapHashHasher[T comparable] struct {
seed maphash.Seed
}
func (mhh mapHashHasher[T]) Hash(v T) uint64 {
return maphash.Comparable(mhh.seed, v)
}
func (mhh mapHashHasher[T]) Equal(a, b T) bool {
return a == b
} |
Indeed, I was hoping to confirm Cunningham's law.
This approach doesn't support a per-table seed. That said, I still don't really understand how even a per-table seed provides a defense against flooding unless it is passed through the hash function itself. If the attacker knows the hash function and can provide all the keys in the table, then it's not enough to transform the hashes at the end, since they will all be transformed in the same way, whatever that is, preserving collisions. It seems to me that the seed would need to be threaded through the hash function so that whether two keys' hashes collide is impossible to predict from the sequence of elements that are incorporated into the hash. |
I don't think that's super important. A per-process seed is probably good enough.
Yes, it must be passed through the hash function itself. Ian's proposed code does that. |
I believe that |
That is, I was thinking in terms of passing a |
TL;DR: Of the options I present below, option 3 feels like the best balance to me. Note that a lot of this discussion is.. uh.. rehashing #69559. I worry that @ianlancetaylor 's API proposed in #70471 (comment) (or one where a hash table is parameterized by the hasher type), makes it easier to write a non-seeded hash function than a seeded one. I'd much rather an API that makes it an obvious mistake to not seed the hash function. Also, I think it's important to compare what impact different API options would have on a custom hash table type. Otherwise we're operating in a vacuum. Setting aside for a moment the type of the seed, option 1 is an obvious extension of @adonovan 's API: type HashFunc[T any] func(Seed, T) uint64 This leads to a hash table API like: type EqFunc[T any] func(T, T) bool
type HashTable[T any] struct {
seed Seed
hash HashFunc[T]
eq EqFunc[T]
...
}
func NewHashTable[T any](hash HashFunc[T], eq EqFunc[T]) *HashTable @mateusz834 proposed this here. Option 1 requires the hash table to store three words for its hasher, and calls to the hasher and equal are indirect. Option 2 is to do this same transformation, but to @ianlancetaylor 's Hasher interface: type Hasher[T any] interface {
Hash(Seed, T) uint64
Equal(T, T) bool
}
// Example hash table
type HashTable[T any, Hash Hasher[T]] struct {
hash Hash // Probably zero-sized
seed Seed
...
} Option 2 most likely stores just the seed in the hash table, which is minimal, and calls to the hasher and equal functions will be direct. However, the type implementing Hasher will almost always be an empty struct, so we're in the land of pure type metaprogramming here, which isn't wrong but sure feels weird. This definitely suffers from the lack of type-type inference. Option 3 is to make the Hasher store the seed: type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
Reset(Seed)
}
// Example hash table
type HashTable[T any, Hash Hasher[T]] struct {
hash Hash // Probably just struct { Seed }
...
}
func NewHashTable[T any, Hash Hasher[T]]() *HashTable[T, Hash] {
ht := &HashTable[T, Hash]{}
seed := MakeSeed()
ht.hash.Reset(seed)
...
} This is similar to a proposal by @Merovius. Option 3, like option 2, stores just the seed in the hash table, and calls to the hasher are direct, but suffers from the lack of type-type inference. However, it avoids pure metaprogramming. It feels slightly weird that the type implementing Hasher is mutable, but that's not too unusual. Option 4 avoids mutability by introducing a hash constructor function: type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
}
type NewHasher[T any, Hash Hasher[T]] func(Seed) Hash
// Example hash table
type HashTable[T any, Hash Hasher[T]] struct{
h Hash
...
}
func NewHashTable[T any, Hash Hasher[T]](newHasher NewHasher[T, Hash]) *HashTable[T, Hash] {
seed := MakeSeed()
h := newHasher(seed)
return &HashTable[T, Hash]{h: h, ...}
} This is pretty similar to option 3, but it feels like it's getting heavy on the mechanism.
As for the Seed type, this is an interesting idea, though in practice I think a lot of hash functions would either want to ultimately pass the Seed on to maphash (so it's just opaque) or will want to get transform the seed into a uint64 or some bytes or something they can use directly. You could do that with the interface by passing, e.g., 0, 1, etc, but that feels not very obvious. Alternatively, the Seed could provide a few different ways to get a concrete seed value. E.g., it could have a Uint64 method to get it as a uint64, and a Bytes method to populate an arbitrarily large []byte. These methods would be idempotent: pure functions of the Seed's hidden value. And there would be no way to construct a Seed from any of these representations. If we want to support hash functions that derive something from Seed and use that instead of using Seed opaquely, I think we need to have something like |
I don't want to touch I think the advantage of @ianlancetaylor's design is that it makes the type of seed opaque to the hash table. I think that is a good thing. It gives maximum flexibility, while staying simple in usage. If we want to make it maximally simple to use safely, we could do something like type Hasher[T any] interface{
Hash(T) uint64
Equal(T, T) bool
}
// exported type, to optionally avoid indirection.
type MapHashHasher[T comparable] struct {
Seed maphash.Seed
}
func (h *MapHashHasher[T]) Reset() { h.Seed = maphash.MakeSeed() }
func (h MapHashHasher[T]) Hash(v T) uint64 { return maphash.Comparable(h.Seed, v) }
func (h MapHashHasher[T]) Equal(a, b T) bool { return a == b }
func MakeMapHashHasher[T comparable]() MapHashHasher { return MapHashHasher[T]{maphash.MakeSeed()} }
type FuncHasher[T, Seed any] struct {
hash(Seed, T) uint64
equal(T, T) bool
seed Seed
}
func MakeFuncHasher[T, Seed any](hash func(Seed, T) uint64, equal func(T, T) bool, seed Seed) FuncHasher {
return FuncHasher[T, Seed]{hash, equal, seed}
}
func (h FuncHasher) Hash(v T) uint64 { return h.hash(h.seed, v) }
func (h FuncHasher) Equal(a, b T) bool { return h.equal(a, b) } To be fair, this still makes it easy to construct a badly seeded hasher for composite types. There are two ways to deal with that:
With this, a hashtable API can decide itself whether it wants to parameterize over I think ultimately, the |
Option 4 seems like the right answer for a hash table with custom equality: it needs Hash(T) and Equal(T, T), and the underlying Hash should absolutely be seeded, but you can just instantiate each hash table with a pre-seeded Hasher. It seems like the interface belongs more in the hash-table package than it does in package hash, which is not really about hash tables. It would be more like sort.Interface defining the interface it needs. So maybe we have the first part of the hash table package: the interface it will require for processing type T. |
This proposal has been added to the active column of the proposals project |
@Merovius , I worry that entirely taking any seed out of Hasher like in your proposed type Hasher[T any] interface{
Hash(T) uint64
Equal(T, T) bool
} makes it far too tempting for people to write bad hash functions that either aren't seeded at all, or don't use a per-instance hash. I'd much rather make it hard for people to ignore seeding.
I had another thought on this. What if we forced hashers to take a |
I'm not sure I understand this argument. Hashes are useful for lots of things, including other implementations hash tables and data types that are totally unlike hash tables. To me, this seems philosophically more aligned with |
maphash.Hash doesn't support integers and floats, only strings. Perhaps it should support all the elementary types, both for efficiency, and to avoid mistakes coercing a number-shaped peg into a string-shaped hole. |
Thanks, I had forgotten about that. |
If anything, it should probably take a |
To add to this, in the absence of clear documentation, I think it is a bit unclear where this interface should be implemented. i.e., I think some people would mistakenly implement this directly on the type you want to hash: type Foo struct { ... }
func (f Foo) Hash(f2 Foo) {}
func (f Foo) Equal(f1, f2 Foo) {} Hopefully the extra argument (why does Hash have an argument and receiver?) would clue people in. Still, if someone did this, then they would almost certainly not have a per-instance seed as that doesn't really make sense in this context at all. |
That's a great point about recursive hashing. To reiterate what I believe @Merovius is getting at: hashes are often built up recursively, mirroring the type hierarchy. If custom hashers take a type T1 struct { a string; b T2 }
type T2 struct { c string }
type H1 struct {} // T1's hasher
func (H1) Hash(seed maphash.Seed, val T1) uint64 {
var mh maphash.Hash // Turn seed into a maphash
mh.SetSeed(seed)
mh.WriteString(val.a)
// We pass the same seed because we have no other choice,
// and then fold the uint64 into our maphash.
maphash.WriteComparable(&mh, H2{}.Hash(seed, val.b))
return mh.Uint64()
}
type H2 struct {} // T2's hasher
func (H2) Hash(seed maphash.Seed, val T2) uint64 {
var mh maphash.Hash // Turn seed into a maphash
mh.SetSeed(seed)
mh.WriteString(val.c)
return mh.Uint64() // Turn maphash into a uint64
} It would be better to avoid all of this back and forth and just pass a @aclements option 2bModifying my "option 2" from above to take a type Hasher[T any] interface {
Hash(*maphash.Hash, T)
Equal(T, T) bool
}
// Example hash table
type HashTable[T any, Hash Hasher[T]] struct {
hash Hash // Probably zero-sized
seed maphash.Seed
...
} With this, the equivalent of my above code example looks like: type T1 struct { a string; b T2 }
type T2 struct { c string }
type H1 struct {} // T1's hasher
func (H1) Hash(mh *maphash.Hash, val T1) {
mh.WriteString(val.a)
H2{}.Hash(mh, val.b)
}
type H2 struct {} // T2's hasher
func (H2) Hash(mh *maphash.Hash, val T2) {
mh.WriteString(val.c)
} This seems much nicer to me! Unfortunately, today, this would force the I don't believe this is a fundamental limitation. I spent a while banging out potential solutions with @cherrymui, and we came up with a few options around either including escape information in the GC shape used for stenciling, or doing a runtime form of "conditional escape" by putting escape information into the generic dictionary. None of these solutions are easy, but they don't seem super hard, and fixing this would lift a lot of boats. I am loathe to warp basic interfaces like this to the current limitations on our tools, especially when we see a path to fixing those limitations. Though I also recognize that it's frustrating to block one thing on another. @aclements option 3bI don't think passing the type Hasher[T any] interface {
Hash(T) uint64
Equal(T, T) bool
Reset(*maphash.Hash)
} Here, @aclements option 4b
There are some variations on this, but no matter what the 152 byte @Merovius parameterized Seed optionI think @Merovius 's parameterized Seed runs into a lot of similar problems. There, it's up to the Hasher to store either a If you implement a hash table on this, all of this seed state has to persist as long as the hash table itself. Even if the Hasher stores just a I'm not positive, but I think @Merovius 's solution does avoid the escape problem. If a hash table type is parameterized over the Hasher type, then we will stencil that type definition. I think we might still consider the method calls to escape their receiver, but that would just force a one-time heap allocation of the whole hash table object, not a per-get/put heap allocation. |
I like option 2b; I don't mind waiting for the compiler to catch up before we can do it. This approach will elevate the rather obscure I agree that 3b (Reset) introduces state where it needn't exist. How immovable an obstacle is the 152-byte size of the maphash? Is a buffer smaller than 128 bytes much less efficient? Or is the point that rthash in two chunks delivers a different result from calling it in one chunk? |
I believe the smallest possible buffer is sizeof uintptr. That would still leave us with 32 bytes, and would certainly be much less efficient, though I didn't see a benchmark specifically for this. |
If we have an interface like option 2b, type Default[T comparable] struct {}
func (Default[T]) Hash(h *Hash, v T) {
WriteComparable(h, v)
}
func (Default[T]) Equal(a, b T) bool {
return a == b
} |
@adonovan is going to try writing a CL showing how |
I am thinking whether type Hasher[T,T2 any] interface {
Hash(hash *maphash.Hash, value T)
Equal(T, T2) bool
} There is a concept of Zig folks somehow managed to use To be clear i do not feel like this kind of stuff (with Adapted HashMaps) fits for Go (or at least in the std). But having a I am still unsure about this, but leaving this here. |
Interesting. Zig's hash table apparently allows the client to provide distinct hash/equal operator pairs for use by the "insert" and "get" operations. It's not clear to me whether this is a necessity of Zig's type system (const vs nonconst access to the map data), or a performance hack, perhaps allowing the client to choose a different representation for the key as it is represented within the map. Presumably it also allows you to exploit the fact that key has a bounded lifetime in Get but may escape in Insert, which is often a bottleneck in Go; maphash.Comparable causes its pointer argument to escape. This all seems excessively complex to me. |
It's a performance hack (I think you can call it a Data Oriented Design approach, ziglang/zig#8619)
True, it is really hard to reason about what is happening with such uses of hash maps. |
No change in consensus, so accepted. 🎉 The proposal is to add the following API definition to package // A Hasher is a type that implements hashing and equality for type T.
//
// A Hasher must be stateless. Hence, typically, a Hasher will be an empty struct.
type Hasher[T any] interface {
// Hash updates hash to reflect the contents of value.
//
// If two values are [Equal], they must also Hash the same.
// Specifically, if Equal(a, b) is true, then Hash(h, a) and Hash(h, b)
// must write identical streams to h.
Hash(hash *maphash.Hash, value T)
Equal(T, T) bool
} As an example, some simple hashers could look like: type Strings []string
type StringsHasher struct{} // Hasher for Strings
func (StringsHasher) Hash(mh *maphash.Hash, val Strings) {
for _, s := range val {
mh.WriteString(s)
}
}
func (StringsHasher) Equal(a, b Strings) bool {
return slices.Equal(a, b)
}
type Thing struct {
ss Strings
i int
}
type ThingHasher struct{} // Hasher for Thing
func (ThingHasher) Hash(mh *maphash.Hash, val Thing) {
StringsHasher{}.Hash(mh, val.ss)
maphash.WriteComparable(mh, val.i)
}
func (ThingHasher) Equal(a, b Thing) bool {
if a.i != b.i {
return false
}
return StringsHasher{}.Equal(a.ss, b.ss)
} A simple custom hash-based could use this interface as follows: // Example hash-based set
type Set[T any, Hash Hasher[T]] struct {
hash Hash // Probably zero-sized
seed maphash.Seed
data []T
// ...
}
func (s *Set[T, Hash]) Has(val T) bool {
var mh maphash.Hash
mh.SetSeed(s.seed)
s.hash.Hash(&mh, val)
offset := mh.Sum64()
for base := range s.data {
i := (uint64(base) + offset) % uint64(len(s.data))
// ... break if this slot is empty ...
if s.hash.Equal(val, s.data[i]) {
return true
}
}
return false
}
var s Set[Thing, ThingHasher] |
Over on #69559, @mateusz834 pointed out that we didn't specify whether the zero value of a Hasher must be usable, which leads to ambiguity when designing APIs around it because it's not clear if the caller must provide a Hasher value or if the type alone is always sufficient. I think we should specify that the zero value of a Hasher must be usable. Given that Hashers must be stateless and this will almost always (possibly always always) be an empty struct, I think this is a reasonable requirement. Requiring this means that APIs can work solely with the type of a Hasher (e.g., as a type parameter) without having to provide a mechanism for setting the Hasher's value. |
There are essentially three choices for a hash table:
The downsides of option 3, however, are that: All three of these are I think instances of what @ianlancetaylor calls "programming with types". I am reminded of C++'s STL, in which the type of string is This raises a potential solution: define, along with Hash and hash.Map[K,V,H], an abstract Map[K,V] interface, so that most users will refer to hash.Map via the abstract type and pay the additional costs, but motivated performance-sensitive clients can use Map[K,V,H] directly. I'm not particularly enthusiastic about it though: we should be driven by actual use cases, and my primary motivation is where K=types.Type and the Type has already escaped. If others have different needs, they should speak up. |
Why? You can still use the |
I wonder whether a type-alias could help here somehow: type MapStatic[K, V any, H Hasher[K]] struct{}
type Map[K, V any] = MapStatic[K, V, Hasher[K]] // uses interface type |
You're right; sorry.
It doesn't solve the escape problem: the alias is just a short way of writing Of course, you can always define a specific alias for one particular hasher of interest. In my case this is types.Type and its hash function:
but then you can't pass a |
Yes, i didn't mean to solve that with this, it is just a idea that we can have two map types without having two separate structs.
Nevertheless an interfaces solves the same problem, but now as i think about that, the type-alias approach allows us to add |
I agree that in practice the key type will have already escaped for most uses of |
Only the byte array under the slice pointer will escape, since the slice header is copied during One interesting thing is that if |
@adonovan Option 1 also eliminates dynamic calls and doesn't require you to carry around the hash as a type parameter. I think the only reason that we are predisposed against that is that it requires you to define a new type to override the hash (and as a special case, it doesn't work with predeclared types). Personally, I'm not convinced that downside outweighs the downsides of the alternative approaches. |
It's true that it doesn't require you to specify the hash as a additional type parameter, but it does require you to encode the hash in the existing type parameter K. So, for types.Type, you need to declare a wrapper type HashableType that essentially combines the concepts of K=types.Type and H=types.Hash, and then explicitly wrap/unwrap this type around every operation: type Hashable[K any] interface {
Hash(*maphash.Hash)
Equals(K) bool
}
type Map[K Hashable[K], V any] struct{}
func (m *Map[K, V]) Get(k K) (_ V) { k.Hash(nil); k.Equals(k); return }
type HashableType struct{ t types.Type }
func (ht HashableType) Hash(*maphash.Hash) { types.Hash(ht.t) }
func (x HashableType) Equals(y HashableType) bool { return types.Identical(x.t, y.t) }
func main() {
var m Map[HashableType, int]
var t types.Type
_ = m.Get(HashableType{t})
} So it's good to explore it for completeness, but I don't think it has any real advantage over option 3. |
I like that option 1 encapsulates everything that is related to keys and their comparison in a single type parameter:
A few questions came to my mind related to both option 1 & 3:
|
Also, another afterthought: While equally powerful on a technical level, the two options will likely tend to steer API design in a certain direction (for types that you control). Option 1:
Option 3:
|
@adonovan I think the critical part of that example is that For everyday interfaces (like
The advantages are IMO pretty real - in the common case. They don't work for a relatively special case like |
I was originally in favor of option 1 due to its simplicity, but am now in favor of option 3 because it avoids boilerplate for a likely very common case that hasn't been mentioned yet AFAICT: Given some library-defined data structure that requires a Option 1 would instead require users to either write boilerplate It therefore seems to me that option 3 is likely preferable. It still makes sense to also provide a |
I'm putting this back on the docket. For option 3 (using maphash.Hasher as a dynamic interface), I'm not so concerned about the key escaping. I'm concerned that the Option 1 (a type is its own hasher) has a lot of appeal. That's how lots of other languages do this. Lots of types are only going to have one reasonable definition of equality, in which case it's better to bundle it with the type than to require the user to find some second type that goes with the first type. However, it requires a "wrapper" type in three situations:
I think we've been pursuing a separate hasher type because we haven't had this until now, so it seems likely people will want custom equality for a lot of types they don't control, and because this version of this discussion was driven by hashing go/type.Type, which is both an interface type and has more than one equivalence relation. My sense is that problem 1 will fade over time, and that problems 2 and 3 will be relatively rare. Option 1 requires a different API, like what @adonovan gave above: // Hashable types provide a custom equality predicate and accompanying hash function.
type Hashable[T any] interface {
// Hash updates hash to reflect the contents of this value.
//
// If two values are [Equal], they must also Hash the same.
// Specifically, if Equal(a, b) is true, then Hash(h, a) and Hash(h, b)
// must write identical streams to h.
Hash(hash *maphash.Hash)
Equal(T) bool
} Updating my example from above, a Hashable type would look like: type Strings []string
func (ss Strings) Hash(mh *maphash.Hash) {
for _, s := range ss {
mh.WriteString(s)
}
}
func (ss Strings) Equal(b Strings) bool {
return slices.Equal(ss, b)
}
type Thing struct {
ss Strings
i int
}
func (t Thing) Hash(mh *maphash.Hash) {
t.ss.Hash(mh)
maphash.WriteComparable(mh, t.i)
}
func (t Thing) Equal(b Thing) bool {
return t.i == b.i && t.ss.Equal(b.ss)
} And an example hash-based collection would look like: // Example hash-based set
type Set[T Hashable[T]] struct {
seed maphash.Seed
data []T
// ...
}
func (s *Set[T]) Has(val T) bool {
var mh maphash.Hash
mh.SetSeed(s.seed)
val.Hash(&mh)
offset := mh.Sum64()
for base := range s.data {
i := (uint64(base) + offset) % uint64(len(s.data))
// ... break if this slot is empty ...
if val.Equal(s.data[i]) {
return true
}
}
return false
}
var s Set[Thing] |
This proposal has been added to the active column of the proposals project |
Background: Issue #69420 proposes a
types.Hash
function that provides a hash operator fortypes.Type
values that is consistent with the equivalence relation for types defined bytypes.Identical
. The only real question in that proposal is: what is to be Go's future convention for the signature of a custom hash function?The naive answer is
func(T) uint64
, where T stands fortypes.Type
. However, hash tables with only a single hash function are vulnerable to a "hash flooding" denial-of-service attack, in which an adversary provides many inputs that all hash to the same value, causing the table to degenerate into a linked list. The defense is for each process (or particular hash table) to select a different hash function at random so that the values cannot be predicted. Thehash/maphash
package embodies this approach: a random Seed value determines the hash functions (Hash.Strings
andHash.Bytes
), which depend on the seed's opaque random value. (maphash does not provide a Seed-dependent way to hash integers.)So, perhaps custom hash functions should accept a
maphash.Seed
parameter. And perhaps Seed should also provide methods for hashing integers.Proposal: In the hash package, define function types HashFunc and EqFunc that document the conventions for Go hash tables.
Here is the minimal answer, without Seeds:
@ianlancetaylor @griesemer
The text was updated successfully, but these errors were encountered: