[Runtime] Reimplement protocol conformance cache with a hash table #33487

mikeash · 2020-08-14T19:30:38Z

Summary

The protocol conformance cache currently uses ConcurrentMap from Concurrent.h, which is a binary tree. The large amount of pointer chasing means that it performs less than optimally on modern hardware, and there's also a great deal of memory overhead from the left/right pointers. The entries themselves are only four words, so those child pointers add 50% overhead.

New Hash Map Type

This PR adds a new type to Concurrent.h, ConcurrentReadableHashMap, which is intended to replace ConcurrentMap, and changes the protocol conformance cache to use it. I plan to replace other uses of ConcurrentMap as well; this is just the first place I decided to target.

The design of ConcurrentReadableHashMap is detailed in the comments above the implementation. The executive summary is:

Use a lock to coordinate writers. Readers don't lock.
Keep an atomic count of readers to avoid freeing memory that's being read.
Keep the supported operations to a bare minimum: insert, look up, clear all.
Use an array of small integers as the actual hash table, with those integers being indexes into a separate array of elements. This minimizes wasted space due to the table's load factor. A table that's only 50% full will only waste n/2 * 4 bytes instead of n/2 * sizeof(Element) bytes. I plan to make the code adaptive so that it can use indices of 1 or 2 bytes when the number of elements is small. Currently it always uses 4 bytes per index regardless of size.

Protocol Conformance Cache Updates

The protocol conformance cache is updated to use ConcurrentReadableHashMap. This was not quite a drop-in replacement, for a couple of reasons.

Negative cache entries (i.e. "type T does not conform to protocol P) can become invalid, when a dynamic library is loaded at runtime that contains a conformance. The existing code handles this by storing a failureGeneration on each cache entry. The global "failure generation" is increased when dynamic libraries are loaded, so readers can tell when a negative entry is obsolete. In that case, the readers will scan the new libraries and then update the entry with either a hit or a new failureGeneration.

This requires updating the cache entry in place, which is inconvenient for the new cache. It also requires 8 bytes (on 64-bit systems) per cache entry, which is 25% of the total entry size. Instead of using this scheme, the updated code clears the conformance cache when dynamic libraries are loaded. This will result in redoing a bunch of work for positive (i.e. "type T conforms to protocol P) entries, but dynamic library loading is extremely rare so this is an acceptable tradeoff.

There's also a change to what the cache actually caches. Protocol conformance checks involve a two-stage lookup:

Search conformance descriptors to find the apprapriate ProtocolConformanceDescriptor * for the given type and protocol we're looking for.
Get the WitnessTable * for the type and conformance descriptor.

Step 2 can be fairly expensive, at least for conditional conformances. For reasons I don't fully understand (but which I'm sure were entirely reasonable at the time it was written... I think there was an attempt to cache based on type descriptors instead of metatype pointers, which would have reduced cache size, but it's not implemented now), the existing code puts the cache on step 1. The new code moves it to step 2, so that a cached conformance check is just a single hash table lookup and nothing else.

Step 1 was implemented with the runtime call swift_conformsToSwiftProtocol, and then step 2 was implemented with swift_conformsToProtocol which wraps swift_conformsToSwiftProtocol and then does the witness table lookup. swift_conformsToSwiftProtocol was not used by anything else, and this PR removes it entirely.

swift_conformsToSwiftProtocol was available as an override point in CompatibilityOverrides.def. This PR removes it from that file. Since it's still part of the overrides ABI for older Swift runtime, this PR puts copies of CompatibilityOverrides.def into the directories for the compatibility libraries for Swift 5.0 and 5.1. Overrides are version-specific so there's no need to maintain a common layout.

Measurements

The benchmark run is below. The dedicated protocol conformance benchmark shows substantial improvements. Some nice smaller improvements show on various other benchmarks.

For memory usage, I used modified version of the protocol conformance benchmark to look at a program with 3,000 cached conformances, then examined the memory. With the existing code, that produces 3,000 allocations of 48 bytes each, for a total of 144kB plus the allocator overhead of 3,000 separate allocations. With the new implementation, we end up with a 96kB allocation for the elements array and 16kB for the indices, for a total of 112kB, for a pretty decent memory win. Support for two-byte indices would save another 8kB and cut this down to 104kB, so that would be a useful future enhancement.

rdar://problem/67268325

michaeleisel · 2020-08-15T13:30:52Z

what are the benefits of these bespoke concurrent data structures over ones in other projects like folly?

mikeash · 2020-08-17T14:31:20Z

@michaeleisel There are the usual motivations like minimizing external dependencies and making sure we know exactly what the code does. But the major reason is to have something that can be precisely tuned to our specific needs.

Taking the example of Folly, its concurrent data structures have two fatal flaws for our purposes:

Keys are only 64 bits. We need two pointers for protocol conformance keys, and will need even bigger keys for some other places I plan to use this.
The maximum size must be set at creation, but we have no idea how big these caches might get in any given process.

No doubt they're great for use cases that fit those constraints, this just isn't one of them.

Overall, the goals are:

Reads to be as fast as possible. Writes can be much slower and their speed almost doesn't matter, and they don't need to be concurrent with respect to each other.
Minimal memory overhead. A typical Swift program has a lot of entries in this cache (and the other caches I want to use this for later) and overhead from the data structure and padding can waste significant memory.
Support only a limited set of operations. We only need search, insert, and clear. In particular, we don't need to remove existing entries (aside from the bulk clear operation), which simplifies things significantly.

With a custom implementation, we can take maximum advantage of 3 to do the best job we can for 1 and 2.

michaeleisel · 2020-08-17T18:41:43Z

In Folly, there's AtomicUnorderedMap, AtomicHashMap, and AtomicHashArray (that I know of). They each have different characteristics, but AtomicUnorderedMap seems like a potentially good fit based on what you've stated. It supports arbitrary key types and fast wait-free reads (I don't see a discussion of memory overhead though). It does not support deletion. The size is immutable, but perhaps rehashing could be done very infrequently. It may not end up a good fit, but it would be good to at least benchmark against it if we go with new hash maps, to make sure that the newer solution is indeed faster. There's also its lower-level counterpart AtomicHashArray, which is meant to be used as a building block for higher-level concurrent hash map structures. Perhaps it could be used as a building block here. That having been said, it sounds very exciting that this is getting optimized.

mikeash · 2020-08-17T19:25:01Z

I'm not sure how to work around a fixed size without basically reinventing something like the scheme I'm using here. These types also use an array of key/value pairs as their hash table, which is typical, but means you waste more space for larger keys or values. This scheme uses small indexes for the hash table, with elements stored out of line, making for much less wasted space. See Concurrent.h:560 for more discussion of how that works.

mikeash · 2020-08-17T19:46:58Z

@swift-ci please benchmark

swift-ci · 2020-08-17T20:29:47Z

Performance: -O

Regression	OLD	NEW	DELTA	RATIO
DictionaryBridgeToObjC_Access	871	955	+9.6%	0.91x (?)

Improvement	OLD	NEW	DELTA	RATIO
ProtocolConformance	725	53	-92.7%	13.68x
StringFromLongWholeSubstringGeneric	6	5	-16.7%	1.20x
ObjectiveCBridgeFromNSDictionaryAnyObjectForced	8150	7350	-9.8%	1.11x (?)
ObjectiveCBridgeToNSDictionary	18850	17050	-9.5%	1.11x (?)
MirrorDefault	2792	2584	-7.4%	1.08x (?)
RandomTree.insert.Unmanaged.fast	209	194	-7.2%	1.08x (?)
EqualSubstringSubstring	42	39	-7.1%	1.08x (?)
EqualSubstringSubstringGenericEquatable	42	39	-7.1%	1.08x
EqualSubstringString	42	39	-7.1%	1.08x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed	444	413	-7.0%	1.08x (?)
LessSubstringSubstring	43	40	-7.0%	1.07x
EqualStringSubstring	43	40	-7.0%	1.07x (?)
LessSubstringSubstringGenericComparable	43	40	-7.0%	1.07x
Breadcrumbs.MutatedIdxToUTF16.Mixed	453	422	-6.8%	1.07x (?)
Set.subtracting.Empty.Box	15	14	-6.7%	1.07x (?)

Code size: -O

Performance: -Osize

Regression	OLD	NEW	DELTA	RATIO
StringFromLongWholeSubstring	4	5	+25.0%	0.80x
FlattenListLoop	1635	1900	+16.2%	0.86x (?)
StringWalk	2520	2720	+7.9%	0.93x (?)
StringComparison_slowerPrenormal	1520	1640	+7.9%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
ProtocolConformance	722	53	-92.7%	13.62x
DistinctClassFieldAccesses	442	323	-26.9%	1.37x (?)
StringFromLongWholeSubstringGeneric	6	5	-16.7%	1.20x
LessSubstringSubstring	43	39	-9.3%	1.10x
LessSubstringSubstringGenericComparable	43	39	-9.3%	1.10x
ObjectiveCBridgeFromNSDictionaryAnyObjectForced	8100	7350	-9.3%	1.10x (?)
MirrorDefault	2798	2556	-8.6%	1.09x (?)
ObjectiveCBridgeFromNSArrayAnyObjectForced	6860	6320	-7.9%	1.09x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed	442	409	-7.5%	1.08x (?)
EqualStringSubstring	42	39	-7.1%	1.08x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed	454	422	-7.0%	1.08x (?)
EqualSubstringSubstring	43	40	-7.0%	1.07x (?)
EqualSubstringSubstringGenericEquatable	43	40	-7.0%	1.07x (?)
EqualSubstringString	43	40	-7.0%	1.07x (?)

Code size: -Osize

Performance: -Onone

Regression	OLD	NEW	DELTA	RATIO
ArrayOfPOD	1058	1141	+7.8%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
DataCreateEmptyArray	72950	17950	-75.4%	4.06x
ProtocolConformance	842	214	-74.6%	3.93x
DataReplaceSmallBuffer	11400	3800	-66.7%	3.00x
DataReplaceMediumBuffer	14500	5600	-61.4%	2.59x
DataCreateSmallArray	128600	54400	-57.7%	2.36x
DataCreateMediumArray	18420	8780	-52.3%	2.10x
DataReplaceLargeBuffer	33	24	-27.3%	1.37x (?)
ClosedRangeOverlapsClosedRange	28373	23671	-16.6%	1.20x (?)
RangeOverlapsRange	29670	25509	-14.0%	1.16x (?)
Breadcrumbs.IdxToUTF16Range.longASCII	382	330	-13.6%	1.16x (?)
Set.subtracting.Seq.Int25	7683	6649	-13.5%	1.16x
Set.subtracting.Seq.Int0	7325	6343	-13.4%	1.15x
RangeOverlapsClosedRange	31450	27275	-13.3%	1.15x (?)
Set.subtracting.Seq.Int50	7957	6925	-13.0%	1.15x
Set.subtracting.Seq.Int100	8594	7636	-11.1%	1.13x (?)
String.data.LargeUnicode	150	134	-10.7%	1.12x (?)
CStringShortAscii	6500	5860	-9.8%	1.11x (?)
FatCompactMap	534770	482530	-9.8%	1.11x (?)
Breadcrumbs.CopyUTF16CodeUnits.Mixed	264	239	-9.5%	1.10x (?)
RemoveWhereQuadraticString	4338	3961	-8.7%	1.10x (?)
UTF8Decode_InitFromBytes_ascii_as_ascii	589	538	-8.7%	1.09x (?)
LessSubstringSubstring	50	46	-8.0%	1.09x (?)
LessSubstringSubstringGenericComparable	50	46	-8.0%	1.09x
EqualSubstringString	51	47	-7.8%	1.09x (?)
MirrorDefault	2948	2723	-7.6%	1.08x (?)
CharIteration_russian_unicodeScalars_Backwards	289800	267920	-7.6%	1.08x (?)
CharIteration_punctuated_unicodeScalars	54520	50440	-7.5%	1.08x (?)
CharIteration_punctuatedJapanese_unicodeScalars	43080	39880	-7.4%	1.08x (?)
Breadcrumbs.MutatedIdxToUTF16.ASCII	27	25	-7.4%	1.08x (?)
StringToDataMedium	4800	4450	-7.3%	1.08x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed	483	448	-7.2%	1.08x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed	490	456	-6.9%	1.07x (?)
CharIteration_japanese_unicodeScalars	294040	274560	-6.6%	1.07x (?)

Code size: -swiftlibs

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

michaeleisel · 2020-08-17T20:46:00Z

AtomicUnorderedMap supports unlimited insertion (albeit slower after a point). Or we could have rehashing done, either blocking or concurrently (while some secondary hash is used for backup storage). A new hash gets created, everything gets moved into it, and then the pointers get swapped. By only dealing with rehashing, and not the rest of the implementation, it would leverage their optimizations.

I just want to be sure that development of this is done with the current state-of-the-art in mind, and that at least the optimizations of folly (as one example of battle-tested data structures) are being considered. That 92% reduction on the protocol conformance benchmark looks awesome though. It may well be faster than simple approaches using folly.

michaeleisel · 2020-08-17T20:50:25Z

(s/AtomicUnorderedMap/AtomicHashMap, although the caveat there is that it doesn't support arbitrary-sized keys. another correction: s/unlimited/18x initial size)

tbkka · 2020-08-17T20:54:06Z

I like that you flush the cache instead of mucking with a generation count. That's a nice improvement. If that proves problematic for some case, you might consider walking all entries and clearing just the negative ones. That would require benchmarking to see if it's actually faster, of course, and would increase the code size so probably not worth the effort unless there's demand.

mikeash · 2020-08-17T20:55:30Z

@michaeleisel I totally understand, and I appreciate the questions!

Not supporting larger keys would be a showstopper. Fixed size would just add some complication, as you say. I feel like it would be much more effort than it's worth, but I certainly wouldn't oppose looking into it to see how it does in practice. This doesn't have to be the end-all be-all of the journey!

mikeash · 2020-08-17T20:56:53Z

@tbkka Yeah, I spent a bunch of time trying to get the generation count mechanism working in the new world (the lack of stable addresses for entries complicated things) before realizing I could just get rid of it, and that made everything so much simpler.

I'd be really surprised if this approach caused any trouble. Loading libraries dynamically is so rare, and so slow for all sorts of other reasons, that I don't see how this could matter. But that's a good idea in case I'm wrong!

tbkka · 2020-08-17T21:07:20Z

@michaeleisel As we continue to explore new target platforms, adding new libraries comes with a real cost, as each such library becomes an additional requirement on any new platform. We already rely heavily on LLVM data structures in the runtime, so it seems more prudent to continue evolving those rather than pull in a new library for this purpose.

michaeleisel · 2020-08-17T21:20:46Z

@tbkka this appears to be largely a greenfield data structure, and does not use many llvm things. and the open-source alternatives are always important to consider carefully, if only as a starting point for high-level thinking. but anyways, mike ash's approach seems good

tbkka

Got through about half of the code. I have a few comments, but only one that was substantive: I couldn't find where you actually put the old indices array onto the free list to be recycled.

tbkka · 2020-08-19T00:35:20Z

include/swift/Reflection/ReflectionContext.h

@@ -907,7 +934,24 @@ class ReflectionContext

    auto Root = getReader().readPointer(ConformancesAddr->getResolvedAddress(),
                                        sizeof(StoredPointer));
-    iterateConformanceTree(Root->getResolvedAddress().getAddressData(), Call);
+    auto RootAddr = Root->getResolvedAddress().getAddressData();


In transitional cases like this, I generally like to name things with the new behavior and then comment the old. That makes it easier when you delete the backwards support. So I might name this "ReaderCount" and then comment below that the location used for the reader count used to be used for the root node of the conformance cache tree.

include/swift/Reflection/ReflectionContext.h

include/swift/Runtime/Concurrent.h

mikeash · 2020-08-19T20:19:44Z

Squashed commits and applied clang-format.

tbkka

Nice!

stdlib/public/runtime/ProtocolConformance.cpp

mikeash · 2020-08-20T17:01:47Z

@swift-ci please test

swift-ci · 2020-08-20T17:04:45Z

Build failed
Swift Test Linux Platform
Git Sha - 3bd501005dcd95cd19c5f4363b5c0d6e551581fc

…ance cache to use it. ConcurrentReadableHashMap is lock-free for readers, with writers using a lock to ensure mutual exclusion amongst each other. The intent is to eventually replace all uses ConcurrentMap with ConcurrentReadableHashMap. ConcurrentReadableHashMap provides for relatively quick lookups by using a hash table. Rearders perform an atomic increment/decrement in order to inform writers that there are active readers. The design attempts to minimize wasted memory by storing the actual elements out-of-line, and having the table store indices into a separate array of elements. The protocol conformance cache now uses ConcurrentReadableHashMap, which provides faster lookups and less memory use than the previous ConcurrentMap implementation. The previous implementation caches ProtocolConformanceDescriptors and extracts the WitnessTable after the cache lookup. The new implementation directly caches the WitnessTable, removing an extra step (potentially a quite slow one) from the fast path. The previous implementation used a generational scheme to detect when negative cache entries became obsolete due to new dynamic libraries being loaded, and update them in place. The new implementation just clears the entire cache when libraries are loaded, greatly simplifying the code and saving the memory needed to track the current generation in each negative cache entry. This means we need to re-cache all requested conformances after loading a dynamic library, but loading libraries at runtime is rare and slow anyway. rdar://problem/67268325

mikeash · 2020-08-20T17:05:44Z

Whoops, push changes, THEN test.

mikeash · 2020-08-20T17:05:51Z

@swift-ci please test

swift-ci · 2020-08-20T17:06:21Z

Build failed
Swift Test OS X Platform
Git Sha - b6a4345d1fc2c25337b57b6efefd2134588aac16

swift-ci · 2020-08-20T17:08:54Z

Build failed
Swift Test Linux Platform
Git Sha - b6a4345d1fc2c25337b57b6efefd2134588aac16

mikeash · 2020-08-20T17:32:22Z

@swift-ci please test Windows platform

mikeash · 2020-08-20T20:11:05Z

@swift-ci please test linux platform

swift-ci · 2020-08-20T21:32:10Z

Build failed
Swift Test Linux Platform
Git Sha - ecd6d4d

mikeash · 2020-08-21T00:20:31Z

@swift-ci please test Linux platform

swift-ci · 2020-08-21T01:56:12Z

Build failed
Swift Test Linux Platform
Git Sha - ecd6d4d

mikeash · 2020-08-21T13:17:52Z

@swift-ci please test Linux platform

davezarzycki · 2020-08-22T11:34:21Z

Hi @mikeash – "ConcurrentReadableHashMapTest.MultiThreaded4" from this pull request is failing semi-reliably on my Linux box when there are background compiles going on. How can I help you debug this? What's the most stressful/unforgiving box you have to test this? Have you tested this with TSAN?

FAIL: Swift-Unit :: runtime/./SwiftRuntimeTests/ConcurrentReadableHashMapTest.MultiThreaded4 (3 of 9)
******************** TEST 'Swift-Unit :: runtime/./SwiftRuntimeTests/ConcurrentReadableHashMapTest.MultiThreaded4' FAILED ********************
Note: Google Test filter = ConcurrentReadableHashMapTest.MultiThreaded4
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ConcurrentReadableHashMapTest
[ RUN      ] ConcurrentReadableHashMapTest.MultiThreaded4
#0 0x00000000002aa444 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/home/dave/b/u/t/tools/swift/unittests/runtime/./SwiftRuntimeTests+0x2aa444)
#1 0x00000000002a8b07 llvm::sys::RunSignalHandlers() (/home/dave/b/u/t/tools/swift/unittests/runtime/./SwiftRuntimeTests+0x2a8b07)
#2 0x00000000002aab5a SignalHandler(int) (/home/dave/b/u/t/tools/swift/unittests/runtime/./SwiftRuntimeTests+0x2aab5a)
#3 0x00007ffff7f9ea90 __restore_rt (/lib64/libpthread.so.0+0x14a90)
#4 0x000000000026068a std::__1::pair<MultiThreadedValue*, std::__1::atomic<unsigned int>*> swift::ConcurrentReadableHashMap<MultiThreadedValue>::find<MultiThreadedKey>(MultiThreadedKey const&, swift::ConcurrentReadableHashMap<MultiThreadedValue>::IndexStorage*, unsigned long, MultiThreadedValue*) (/home/dave/b/u/t/tools/swift/unittests/runtime/./SwiftRuntimeTests+0x26068a)
#5 0x0000000000262e97 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void threadedExecute<ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24, void threadedExecute<ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24>(int, ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24)::'lambda'()>(int, ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24, void threadedExecute<ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24>(int, ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24)::'lambda'())::'lambda'()> >(void*) (/home/dave/b/u/t/tools/swift/unittests/runtime/./SwiftRuntimeTests+0x262e97)
#6 0x00007ffff7f93432 start_thread (/lib64/libpthread.so.0+0x9432)
#7 0x00007ffff5c8b913 clone (/lib64/libc.so.6+0x101913)

********************

Testing Time: 1.49s
********************
Failing Tests (1):
    Swift-Unit :: runtime/./SwiftRuntimeTests/ConcurrentReadableHashMapTest.MultiThreaded4

davezarzycki · 2020-08-22T12:52:42Z

For reasons not worth going into, I don't have good luck running TSAN, but I was able to hack it together for this test. I hope this helps: https://znu.io/tsan.txt

davezarzycki · 2020-08-22T13:25:24Z

I haven't studied this design closely, but based on the TSAN output, I found the following code that (at least superficially) is a classic time-of-use versus time-of-check race:

void deallocateFreeListIfSafe() {
  if (ReaderCount.load(std::memory_order_relaxed) == 0)
    deallocateFreeList();
}

davezarzycki · 2020-08-22T13:31:41Z

include/swift/Runtime/Concurrent.h

+      Map->incrementReaders();
+    }
+
+    ~Snapshot() { Map->decrementReaders(); }


If Snapshot is truly a snapshot, then why is the map reader lock being held until the snapshot is destroyed? Shouldn't it be dropped at the end of the constructor?

The reader count isn't really a reader lock, as writes can still make progress when it's non-zero. It just informs writers whether it's safe to free old allocations. Outstanding snapshots may be pointing to old allocations, so it's not safe to free them until all outstanding snapshots have been destroyed.

davezarzycki · 2020-08-22T13:32:36Z

include/swift/Runtime/Concurrent.h

+    Snapshot(const Snapshot &other)
+        : Map(other.Map), Indices(other.Indices), Elements(other.Elements),
+          ElementCount(other.ElementCount) {
+      Map->incrementReaders();


This seem out of order. Shouldn't the reader lock be acquired BEFORE initializing copies of the Map data?

The snapshot being copied is already holding the reader count above zero, so the order doesn't matter here. Writers only care about zero and non-zero.

davezarzycki · 2020-08-22T13:34:25Z

include/swift/Runtime/Concurrent.h

@@ -541,6 +546,382 @@ template <class ElemTy> struct ConcurrentReadableArray {
  }
 };

+using llvm::hash_value;
+
+/// A hash table that can be queried without taking any locks. Writes are still


This comment seems out of touch with the implementation which is clearly using a reader/writer lock strategy (as opposed to a truly lockless strategy like RCU).

See previous comment about how this is not really a reader lock, but just controlling whether old allocations get freed.

davezarzycki · 2020-08-22T14:15:43Z

include/swift/Runtime/Concurrent.h

+
+  /// The number of readers currently active, equal to the number of snapshot
+  /// objects currently alive.
+  std::atomic<size_t> ReaderCount;


Unless something has dramatically changed, no operating system to date supports anywhere close to four billion threads. I think this can safely be a uint32_t for a long time :-)

That's a good point. I feel weird using atomic integers smaller than a pointer, but that's just me being silly. I could shrink this and the element count to save a word.

davezarzycki · 2020-08-22T14:25:37Z

The more I look at this code, the more I think it should just be replaced with a dramatically simplified reader/writer spin lock and a std::vector. If the write scenario becomes a bottleneck, there are various workarounds. It really just depends on the usage patterns.

mikeash · 2020-08-22T15:11:11Z

I haven't studied this design closely, but based on the TSAN output, I found the following code that (at least superficially) is a classic time-of-use versus time-of-check race:
void deallocateFreeListIfSafe() {
  if (ReaderCount.load(std::memory_order_relaxed) == 0)
    deallocateFreeList();
}

(Why oh why doesn't GitHub support proper threaded replies on top-level comments?)

The caller holds the write lock, so the free list can't change in here. The danger with freeing items on the free list is if an outstanding snapshot holds a reference to one of those items. Thus, if ReaderCount is non-zero, we don't free anything. If it is zero, then we know that there were no outstanding snapshots at that point in time, and thus nothing with a reference to anything on the free lists.

As you say, we then have a TOCTOU problem, except that there's no way for any new readers to get a reference to anything on the free lists. Because we hold the writer lock, the elements and indices are unchanging while we do this operation. A new snapshot will copy the current indices and elements, but it can't get any of the old ones.

As far as simplifying this with a more traditional locked solution, I think this is used often enough to justify the extra complexity (especially once I replace other uses of ConcurrentMap) but it would be interesting to try out a simpler version and see how well it works in practice. We had a locked std::vector storing the list of sections to search in the protocol conformance code and a few other places, which I replaced with ConcurrentReadableArray a while back. Performance aside, there were some rare and tricky deadlocks between readers and writers due to interactions with other parts of the system. I don't THINK any of that would apply here, but it has me a bit wary.

mikeash · 2020-08-22T15:11:49Z

I believe the failure is fixed by this new PR: https://github.com/apple/swift/pull/33595/files.

With the improvements to protocol casting performance from swiftlang#33487, the demangling overhead is hopefully manageable for most people using this flag today.

BasThomas mentioned this pull request Aug 14, 2020

[167] Issue #167 - August 27, 2020 SwiftWeekly/swiftweekly.github.io#542

Closed

mikeash changed the title ~~[WIP] Reimplement protocol conformance cache with a hash table~~ [Runtime] Reimplement protocol conformance cache with a hash table Aug 17, 2020

mikeash requested review from jckarter and tbkka August 18, 2020 14:12

mikeash marked this pull request as ready for review August 18, 2020 16:22

tbkka reviewed Aug 19, 2020

View reviewed changes

mikeash force-pushed the concurrenthashmap branch 3 times, most recently from a954e28 to b6a4345 Compare August 19, 2020 20:19

tbkka approved these changes Aug 19, 2020

View reviewed changes

mikeash force-pushed the concurrenthashmap branch from b6a4345 to ecd6d4d Compare August 20, 2020 17:05

mikeash merged commit 39d4e01 into swiftlang:master Aug 21, 2020

davezarzycki reviewed Aug 22, 2020

View reviewed changes

jckarter mentioned this pull request Aug 28, 2020

[WIP] Disable -disable-concrete-type-metadata-mangled-name-accessors flag. #33689

Closed

[Runtime] Reimplement protocol conformance cache with a hash table #33487

[Runtime] Reimplement protocol conformance cache with a hash table #33487

Conversation

mikeash commented Aug 14, 2020 • edited Loading

Summary

New Hash Map Type

Protocol Conformance Cache Updates

Measurements

michaeleisel commented Aug 15, 2020

mikeash commented Aug 17, 2020

michaeleisel commented Aug 17, 2020

mikeash commented Aug 17, 2020

mikeash commented Aug 17, 2020

swift-ci commented Aug 17, 2020

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

michaeleisel commented Aug 17, 2020

michaeleisel commented Aug 17, 2020

tbkka commented Aug 17, 2020

mikeash commented Aug 17, 2020

mikeash commented Aug 17, 2020

tbkka commented Aug 17, 2020

michaeleisel commented Aug 17, 2020

tbkka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikeash commented Aug 19, 2020

tbkka left a comment

Choose a reason for hiding this comment

mikeash commented Aug 20, 2020

swift-ci commented Aug 20, 2020

mikeash commented Aug 20, 2020

mikeash commented Aug 20, 2020

swift-ci commented Aug 20, 2020

swift-ci commented Aug 20, 2020

mikeash commented Aug 20, 2020

mikeash commented Aug 20, 2020

swift-ci commented Aug 20, 2020

mikeash commented Aug 21, 2020

swift-ci commented Aug 21, 2020

mikeash commented Aug 21, 2020

davezarzycki commented Aug 22, 2020

davezarzycki commented Aug 22, 2020

davezarzycki commented Aug 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davezarzycki commented Aug 22, 2020

mikeash commented Aug 22, 2020

mikeash commented Aug 22, 2020

mikeash commented Aug 14, 2020 •

edited

Loading