-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Runtime] Reimplement protocol conformance cache with a hash table #33487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
what are the benefits of these bespoke concurrent data structures over ones in other projects like folly? |
@michaeleisel There are the usual motivations like minimizing external dependencies and making sure we know exactly what the code does. But the major reason is to have something that can be precisely tuned to our specific needs. Taking the example of Folly, its concurrent data structures have two fatal flaws for our purposes:
No doubt they're great for use cases that fit those constraints, this just isn't one of them. Overall, the goals are:
With a custom implementation, we can take maximum advantage of 3 to do the best job we can for 1 and 2. |
In Folly, there's AtomicUnorderedMap, AtomicHashMap, and AtomicHashArray (that I know of). They each have different characteristics, but AtomicUnorderedMap seems like a potentially good fit based on what you've stated. It supports arbitrary key types and fast wait-free reads (I don't see a discussion of memory overhead though). It does not support deletion. The size is immutable, but perhaps rehashing could be done very infrequently. It may not end up a good fit, but it would be good to at least benchmark against it if we go with new hash maps, to make sure that the newer solution is indeed faster. There's also its lower-level counterpart AtomicHashArray, which is meant to be used as a building block for higher-level concurrent hash map structures. Perhaps it could be used as a building block here. That having been said, it sounds very exciting that this is getting optimized. |
I'm not sure how to work around a fixed size without basically reinventing something like the scheme I'm using here. These types also use an array of key/value pairs as their hash table, which is typical, but means you waste more space for larger keys or values. This scheme uses small indexes for the hash table, with elements stored out of line, making for much less wasted space. See Concurrent.h:560 for more discussion of how that works. |
@swift-ci please benchmark |
Performance: -O
Code size: -OPerformance: -Osize
Code size: -OsizePerformance: -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
AtomicUnorderedMap supports unlimited insertion (albeit slower after a point). Or we could have rehashing done, either blocking or concurrently (while some secondary hash is used for backup storage). A new hash gets created, everything gets moved into it, and then the pointers get swapped. By only dealing with rehashing, and not the rest of the implementation, it would leverage their optimizations. I just want to be sure that development of this is done with the current state-of-the-art in mind, and that at least the optimizations of folly (as one example of battle-tested data structures) are being considered. That 92% reduction on the protocol conformance benchmark looks awesome though. It may well be faster than simple approaches using folly. |
(s/AtomicUnorderedMap/AtomicHashMap, although the caveat there is that it doesn't support arbitrary-sized keys. another correction: s/unlimited/18x initial size) |
I like that you flush the cache instead of mucking with a generation count. That's a nice improvement. If that proves problematic for some case, you might consider walking all entries and clearing just the negative ones. That would require benchmarking to see if it's actually faster, of course, and would increase the code size so probably not worth the effort unless there's demand. |
@michaeleisel I totally understand, and I appreciate the questions! Not supporting larger keys would be a showstopper. Fixed size would just add some complication, as you say. I feel like it would be much more effort than it's worth, but I certainly wouldn't oppose looking into it to see how it does in practice. This doesn't have to be the end-all be-all of the journey! |
@tbkka Yeah, I spent a bunch of time trying to get the generation count mechanism working in the new world (the lack of stable addresses for entries complicated things) before realizing I could just get rid of it, and that made everything so much simpler. I'd be really surprised if this approach caused any trouble. Loading libraries dynamically is so rare, and so slow for all sorts of other reasons, that I don't see how this could matter. But that's a good idea in case I'm wrong! |
@michaeleisel As we continue to explore new target platforms, adding new libraries comes with a real cost, as each such library becomes an additional requirement on any new platform. We already rely heavily on LLVM data structures in the runtime, so it seems more prudent to continue evolving those rather than pull in a new library for this purpose. |
@tbkka this appears to be largely a greenfield data structure, and does not use many llvm things. and the open-source alternatives are always important to consider carefully, if only as a starting point for high-level thinking. but anyways, mike ash's approach seems good |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got through about half of the code. I have a few comments, but only one that was substantive: I couldn't find where you actually put the old indices array onto the free list to be recycled.
@@ -907,7 +934,24 @@ class ReflectionContext | |||
|
|||
auto Root = getReader().readPointer(ConformancesAddr->getResolvedAddress(), | |||
sizeof(StoredPointer)); | |||
iterateConformanceTree(Root->getResolvedAddress().getAddressData(), Call); | |||
auto RootAddr = Root->getResolvedAddress().getAddressData(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In transitional cases like this, I generally like to name things with the new behavior and then comment the old. That makes it easier when you delete the backwards support. So I might name this "ReaderCount" and then comment below that the location used for the reader count used to be used for the root node of the conformance cache tree.
a954e28
to
b6a4345
Compare
Squashed commits and applied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@swift-ci please test |
Build failed |
…ance cache to use it. ConcurrentReadableHashMap is lock-free for readers, with writers using a lock to ensure mutual exclusion amongst each other. The intent is to eventually replace all uses ConcurrentMap with ConcurrentReadableHashMap. ConcurrentReadableHashMap provides for relatively quick lookups by using a hash table. Rearders perform an atomic increment/decrement in order to inform writers that there are active readers. The design attempts to minimize wasted memory by storing the actual elements out-of-line, and having the table store indices into a separate array of elements. The protocol conformance cache now uses ConcurrentReadableHashMap, which provides faster lookups and less memory use than the previous ConcurrentMap implementation. The previous implementation caches ProtocolConformanceDescriptors and extracts the WitnessTable after the cache lookup. The new implementation directly caches the WitnessTable, removing an extra step (potentially a quite slow one) from the fast path. The previous implementation used a generational scheme to detect when negative cache entries became obsolete due to new dynamic libraries being loaded, and update them in place. The new implementation just clears the entire cache when libraries are loaded, greatly simplifying the code and saving the memory needed to track the current generation in each negative cache entry. This means we need to re-cache all requested conformances after loading a dynamic library, but loading libraries at runtime is rare and slow anyway. rdar://problem/67268325
b6a4345
to
ecd6d4d
Compare
Whoops, push changes, THEN test. |
@swift-ci please test |
Build failed |
Build failed |
@swift-ci please test Windows platform |
@swift-ci please test linux platform |
Build failed |
@swift-ci please test Linux platform |
Build failed |
@swift-ci please test Linux platform |
Hi @mikeash – "ConcurrentReadableHashMapTest.MultiThreaded4" from this pull request is failing semi-reliably on my Linux box when there are background compiles going on. How can I help you debug this? What's the most stressful/unforgiving box you have to test this? Have you tested this with TSAN?
|
For reasons not worth going into, I don't have good luck running TSAN, but I was able to hack it together for this test. I hope this helps: https://znu.io/tsan.txt |
I haven't studied this design closely, but based on the TSAN output, I found the following code that (at least superficially) is a classic time-of-use versus time-of-check race:
|
Map->incrementReaders(); | ||
} | ||
|
||
~Snapshot() { Map->decrementReaders(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Snapshot
is truly a snapshot, then why is the map reader lock being held until the snapshot is destroyed? Shouldn't it be dropped at the end of the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reader count isn't really a reader lock, as writes can still make progress when it's non-zero. It just informs writers whether it's safe to free old allocations. Outstanding snapshots may be pointing to old allocations, so it's not safe to free them until all outstanding snapshots have been destroyed.
Snapshot(const Snapshot &other) | ||
: Map(other.Map), Indices(other.Indices), Elements(other.Elements), | ||
ElementCount(other.ElementCount) { | ||
Map->incrementReaders(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seem out of order. Shouldn't the reader lock be acquired BEFORE initializing copies of the Map data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snapshot being copied is already holding the reader count above zero, so the order doesn't matter here. Writers only care about zero and non-zero.
@@ -541,6 +546,382 @@ template <class ElemTy> struct ConcurrentReadableArray { | |||
} | |||
}; | |||
|
|||
using llvm::hash_value; | |||
|
|||
/// A hash table that can be queried without taking any locks. Writes are still |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems out of touch with the implementation which is clearly using a reader/writer lock strategy (as opposed to a truly lockless strategy like RCU).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment about how this is not really a reader lock, but just controlling whether old allocations get freed.
|
||
/// The number of readers currently active, equal to the number of snapshot | ||
/// objects currently alive. | ||
std::atomic<size_t> ReaderCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless something has dramatically changed, no operating system to date supports anywhere close to four billion threads. I think this can safely be a uint32_t for a long time :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I feel weird using atomic integers smaller than a pointer, but that's just me being silly. I could shrink this and the element count to save a word.
The more I look at this code, the more I think it should just be replaced with a dramatically simplified reader/writer spin lock and a |
(Why oh why doesn't GitHub support proper threaded replies on top-level comments?) The caller holds the write lock, so the free list can't change in here. The danger with freeing items on the free list is if an outstanding snapshot holds a reference to one of those items. Thus, if As you say, we then have a TOCTOU problem, except that there's no way for any new readers to get a reference to anything on the free lists. Because we hold the writer lock, the elements and indices are unchanging while we do this operation. A new snapshot will copy the current indices and elements, but it can't get any of the old ones. As far as simplifying this with a more traditional locked solution, I think this is used often enough to justify the extra complexity (especially once I replace other uses of |
I believe the failure is fixed by this new PR: https://github.com/apple/swift/pull/33595/files. |
With the improvements to protocol casting performance from swiftlang#33487, the demangling overhead is hopefully manageable for most people using this flag today.
Summary
The protocol conformance cache currently uses
ConcurrentMap
fromConcurrent.h
, which is a binary tree. The large amount of pointer chasing means that it performs less than optimally on modern hardware, and there's also a great deal of memory overhead from the left/right pointers. The entries themselves are only four words, so those child pointers add 50% overhead.New Hash Map Type
This PR adds a new type to
Concurrent.h
,ConcurrentReadableHashMap
, which is intended to replaceConcurrentMap
, and changes the protocol conformance cache to use it. I plan to replace other uses ofConcurrentMap
as well; this is just the first place I decided to target.The design of
ConcurrentReadableHashMap
is detailed in the comments above the implementation. The executive summary is:n/2 * 4
bytes instead ofn/2 * sizeof(Element)
bytes. I plan to make the code adaptive so that it can use indices of 1 or 2 bytes when the number of elements is small. Currently it always uses 4 bytes per index regardless of size.Protocol Conformance Cache Updates
The protocol conformance cache is updated to use
ConcurrentReadableHashMap
. This was not quite a drop-in replacement, for a couple of reasons.Negative cache entries (i.e. "type
T
does not conform to protocolP
) can become invalid, when a dynamic library is loaded at runtime that contains a conformance. The existing code handles this by storing afailureGeneration
on each cache entry. The global "failure generation" is increased when dynamic libraries are loaded, so readers can tell when a negative entry is obsolete. In that case, the readers will scan the new libraries and then update the entry with either a hit or a newfailureGeneration
.This requires updating the cache entry in place, which is inconvenient for the new cache. It also requires 8 bytes (on 64-bit systems) per cache entry, which is 25% of the total entry size. Instead of using this scheme, the updated code clears the conformance cache when dynamic libraries are loaded. This will result in redoing a bunch of work for positive (i.e. "type
T
conforms to protocolP
) entries, but dynamic library loading is extremely rare so this is an acceptable tradeoff.There's also a change to what the cache actually caches. Protocol conformance checks involve a two-stage lookup:
ProtocolConformanceDescriptor *
for the given type and protocol we're looking for.WitnessTable *
for the type and conformance descriptor.Step 2 can be fairly expensive, at least for conditional conformances. For reasons I don't fully understand (but which I'm sure were entirely reasonable at the time it was written... I think there was an attempt to cache based on type descriptors instead of metatype pointers, which would have reduced cache size, but it's not implemented now), the existing code puts the cache on step 1. The new code moves it to step 2, so that a cached conformance check is just a single hash table lookup and nothing else.
Step 1 was implemented with the runtime call
swift_conformsToSwiftProtocol
, and then step 2 was implemented withswift_conformsToProtocol
which wrapsswift_conformsToSwiftProtocol
and then does the witness table lookup.swift_conformsToSwiftProtocol
was not used by anything else, and this PR removes it entirely.swift_conformsToSwiftProtocol
was available as an override point inCompatibilityOverrides.def
. This PR removes it from that file. Since it's still part of the overrides ABI for older Swift runtime, this PR puts copies ofCompatibilityOverrides.def
into the directories for the compatibility libraries for Swift 5.0 and 5.1. Overrides are version-specific so there's no need to maintain a common layout.Measurements
The benchmark run is below. The dedicated protocol conformance benchmark shows substantial improvements. Some nice smaller improvements show on various other benchmarks.
For memory usage, I used modified version of the protocol conformance benchmark to look at a program with 3,000 cached conformances, then examined the memory. With the existing code, that produces 3,000 allocations of 48 bytes each, for a total of 144kB plus the allocator overhead of 3,000 separate allocations. With the new implementation, we end up with a 96kB allocation for the elements array and 16kB for the indices, for a total of 112kB, for a pretty decent memory win. Support for two-byte indices would save another 8kB and cut this down to 104kB, so that would be a useful future enhancement.
rdar://problem/67268325