Skip to content

[Runtime] Reimplement protocol conformance cache with a hash table #33487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 21, 2020

Conversation

mikeash
Copy link
Contributor

@mikeash mikeash commented Aug 14, 2020

Summary

The protocol conformance cache currently uses ConcurrentMap from Concurrent.h, which is a binary tree. The large amount of pointer chasing means that it performs less than optimally on modern hardware, and there's also a great deal of memory overhead from the left/right pointers. The entries themselves are only four words, so those child pointers add 50% overhead.

New Hash Map Type

This PR adds a new type to Concurrent.h, ConcurrentReadableHashMap, which is intended to replace ConcurrentMap, and changes the protocol conformance cache to use it. I plan to replace other uses of ConcurrentMap as well; this is just the first place I decided to target.

The design of ConcurrentReadableHashMap is detailed in the comments above the implementation. The executive summary is:

  1. Use a lock to coordinate writers. Readers don't lock.
  2. Keep an atomic count of readers to avoid freeing memory that's being read.
  3. Keep the supported operations to a bare minimum: insert, look up, clear all.
  4. Use an array of small integers as the actual hash table, with those integers being indexes into a separate array of elements. This minimizes wasted space due to the table's load factor. A table that's only 50% full will only waste n/2 * 4 bytes instead of n/2 * sizeof(Element) bytes. I plan to make the code adaptive so that it can use indices of 1 or 2 bytes when the number of elements is small. Currently it always uses 4 bytes per index regardless of size.

Protocol Conformance Cache Updates

The protocol conformance cache is updated to use ConcurrentReadableHashMap. This was not quite a drop-in replacement, for a couple of reasons.

Negative cache entries (i.e. "type T does not conform to protocol P) can become invalid, when a dynamic library is loaded at runtime that contains a conformance. The existing code handles this by storing a failureGeneration on each cache entry. The global "failure generation" is increased when dynamic libraries are loaded, so readers can tell when a negative entry is obsolete. In that case, the readers will scan the new libraries and then update the entry with either a hit or a new failureGeneration.

This requires updating the cache entry in place, which is inconvenient for the new cache. It also requires 8 bytes (on 64-bit systems) per cache entry, which is 25% of the total entry size. Instead of using this scheme, the updated code clears the conformance cache when dynamic libraries are loaded. This will result in redoing a bunch of work for positive (i.e. "type T conforms to protocol P) entries, but dynamic library loading is extremely rare so this is an acceptable tradeoff.

There's also a change to what the cache actually caches. Protocol conformance checks involve a two-stage lookup:

  1. Search conformance descriptors to find the apprapriate ProtocolConformanceDescriptor * for the given type and protocol we're looking for.
  2. Get the WitnessTable * for the type and conformance descriptor.

Step 2 can be fairly expensive, at least for conditional conformances. For reasons I don't fully understand (but which I'm sure were entirely reasonable at the time it was written... I think there was an attempt to cache based on type descriptors instead of metatype pointers, which would have reduced cache size, but it's not implemented now), the existing code puts the cache on step 1. The new code moves it to step 2, so that a cached conformance check is just a single hash table lookup and nothing else.

Step 1 was implemented with the runtime call swift_conformsToSwiftProtocol, and then step 2 was implemented with swift_conformsToProtocol which wraps swift_conformsToSwiftProtocol and then does the witness table lookup. swift_conformsToSwiftProtocol was not used by anything else, and this PR removes it entirely.

swift_conformsToSwiftProtocol was available as an override point in CompatibilityOverrides.def. This PR removes it from that file. Since it's still part of the overrides ABI for older Swift runtime, this PR puts copies of CompatibilityOverrides.def into the directories for the compatibility libraries for Swift 5.0 and 5.1. Overrides are version-specific so there's no need to maintain a common layout.

Measurements

The benchmark run is below. The dedicated protocol conformance benchmark shows substantial improvements. Some nice smaller improvements show on various other benchmarks.

For memory usage, I used modified version of the protocol conformance benchmark to look at a program with 3,000 cached conformances, then examined the memory. With the existing code, that produces 3,000 allocations of 48 bytes each, for a total of 144kB plus the allocator overhead of 3,000 separate allocations. With the new implementation, we end up with a 96kB allocation for the elements array and 16kB for the indices, for a total of 112kB, for a pretty decent memory win. Support for two-byte indices would save another 8kB and cut this down to 104kB, so that would be a useful future enhancement.

rdar://problem/67268325

@michaeleisel
Copy link

what are the benefits of these bespoke concurrent data structures over ones in other projects like folly?

@mikeash
Copy link
Contributor Author

mikeash commented Aug 17, 2020

@michaeleisel There are the usual motivations like minimizing external dependencies and making sure we know exactly what the code does. But the major reason is to have something that can be precisely tuned to our specific needs.

Taking the example of Folly, its concurrent data structures have two fatal flaws for our purposes:

  1. Keys are only 64 bits. We need two pointers for protocol conformance keys, and will need even bigger keys for some other places I plan to use this.
  2. The maximum size must be set at creation, but we have no idea how big these caches might get in any given process.

No doubt they're great for use cases that fit those constraints, this just isn't one of them.

Overall, the goals are:

  1. Reads to be as fast as possible. Writes can be much slower and their speed almost doesn't matter, and they don't need to be concurrent with respect to each other.
  2. Minimal memory overhead. A typical Swift program has a lot of entries in this cache (and the other caches I want to use this for later) and overhead from the data structure and padding can waste significant memory.
  3. Support only a limited set of operations. We only need search, insert, and clear. In particular, we don't need to remove existing entries (aside from the bulk clear operation), which simplifies things significantly.

With a custom implementation, we can take maximum advantage of 3 to do the best job we can for 1 and 2.

@michaeleisel
Copy link

In Folly, there's AtomicUnorderedMap, AtomicHashMap, and AtomicHashArray (that I know of). They each have different characteristics, but AtomicUnorderedMap seems like a potentially good fit based on what you've stated. It supports arbitrary key types and fast wait-free reads (I don't see a discussion of memory overhead though). It does not support deletion. The size is immutable, but perhaps rehashing could be done very infrequently. It may not end up a good fit, but it would be good to at least benchmark against it if we go with new hash maps, to make sure that the newer solution is indeed faster. There's also its lower-level counterpart AtomicHashArray, which is meant to be used as a building block for higher-level concurrent hash map structures. Perhaps it could be used as a building block here. That having been said, it sounds very exciting that this is getting optimized.

@mikeash
Copy link
Contributor Author

mikeash commented Aug 17, 2020

I'm not sure how to work around a fixed size without basically reinventing something like the scheme I'm using here. These types also use an array of key/value pairs as their hash table, which is typical, but means you waste more space for larger keys or values. This scheme uses small indexes for the hash table, with elements stored out of line, making for much less wasted space. See Concurrent.h:560 for more discussion of how that works.

@mikeash
Copy link
Contributor Author

mikeash commented Aug 17, 2020

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

Performance: -O

Regression OLD NEW DELTA RATIO
DictionaryBridgeToObjC_Access 871 955 +9.6% 0.91x (?)
 
Improvement OLD NEW DELTA RATIO
ProtocolConformance 725 53 -92.7% 13.68x
StringFromLongWholeSubstringGeneric 6 5 -16.7% 1.20x
ObjectiveCBridgeFromNSDictionaryAnyObjectForced 8150 7350 -9.8% 1.11x (?)
ObjectiveCBridgeToNSDictionary 18850 17050 -9.5% 1.11x (?)
MirrorDefault 2792 2584 -7.4% 1.08x (?)
RandomTree.insert.Unmanaged.fast 209 194 -7.2% 1.08x (?)
EqualSubstringSubstring 42 39 -7.1% 1.08x (?)
EqualSubstringSubstringGenericEquatable 42 39 -7.1% 1.08x
EqualSubstringString 42 39 -7.1% 1.08x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed 444 413 -7.0% 1.08x (?)
LessSubstringSubstring 43 40 -7.0% 1.07x
EqualStringSubstring 43 40 -7.0% 1.07x (?)
LessSubstringSubstringGenericComparable 43 40 -7.0% 1.07x
Breadcrumbs.MutatedIdxToUTF16.Mixed 453 422 -6.8% 1.07x (?)
Set.subtracting.Empty.Box 15 14 -6.7% 1.07x (?)

Code size: -O

Performance: -Osize

Regression OLD NEW DELTA RATIO
StringFromLongWholeSubstring 4 5 +25.0% 0.80x
FlattenListLoop 1635 1900 +16.2% 0.86x (?)
StringWalk 2520 2720 +7.9% 0.93x (?)
StringComparison_slowerPrenormal 1520 1640 +7.9% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
ProtocolConformance 722 53 -92.7% 13.62x
DistinctClassFieldAccesses 442 323 -26.9% 1.37x (?)
StringFromLongWholeSubstringGeneric 6 5 -16.7% 1.20x
LessSubstringSubstring 43 39 -9.3% 1.10x
LessSubstringSubstringGenericComparable 43 39 -9.3% 1.10x
ObjectiveCBridgeFromNSDictionaryAnyObjectForced 8100 7350 -9.3% 1.10x (?)
MirrorDefault 2798 2556 -8.6% 1.09x (?)
ObjectiveCBridgeFromNSArrayAnyObjectForced 6860 6320 -7.9% 1.09x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed 442 409 -7.5% 1.08x (?)
EqualStringSubstring 42 39 -7.1% 1.08x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 454 422 -7.0% 1.08x (?)
EqualSubstringSubstring 43 40 -7.0% 1.07x (?)
EqualSubstringSubstringGenericEquatable 43 40 -7.0% 1.07x (?)
EqualSubstringString 43 40 -7.0% 1.07x (?)

Code size: -Osize

Performance: -Onone

Regression OLD NEW DELTA RATIO
ArrayOfPOD 1058 1141 +7.8% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
DataCreateEmptyArray 72950 17950 -75.4% 4.06x
ProtocolConformance 842 214 -74.6% 3.93x
DataReplaceSmallBuffer 11400 3800 -66.7% 3.00x
DataReplaceMediumBuffer 14500 5600 -61.4% 2.59x
DataCreateSmallArray 128600 54400 -57.7% 2.36x
DataCreateMediumArray 18420 8780 -52.3% 2.10x
DataReplaceLargeBuffer 33 24 -27.3% 1.37x (?)
ClosedRangeOverlapsClosedRange 28373 23671 -16.6% 1.20x (?)
RangeOverlapsRange 29670 25509 -14.0% 1.16x (?)
Breadcrumbs.IdxToUTF16Range.longASCII 382 330 -13.6% 1.16x (?)
Set.subtracting.Seq.Int25 7683 6649 -13.5% 1.16x
Set.subtracting.Seq.Int0 7325 6343 -13.4% 1.15x
RangeOverlapsClosedRange 31450 27275 -13.3% 1.15x (?)
Set.subtracting.Seq.Int50 7957 6925 -13.0% 1.15x
Set.subtracting.Seq.Int100 8594 7636 -11.1% 1.13x (?)
String.data.LargeUnicode 150 134 -10.7% 1.12x (?)
CStringShortAscii 6500 5860 -9.8% 1.11x (?)
FatCompactMap 534770 482530 -9.8% 1.11x (?)
Breadcrumbs.CopyUTF16CodeUnits.Mixed 264 239 -9.5% 1.10x (?)
RemoveWhereQuadraticString 4338 3961 -8.7% 1.10x (?)
UTF8Decode_InitFromBytes_ascii_as_ascii 589 538 -8.7% 1.09x (?)
LessSubstringSubstring 50 46 -8.0% 1.09x (?)
LessSubstringSubstringGenericComparable 50 46 -8.0% 1.09x
EqualSubstringString 51 47 -7.8% 1.09x (?)
MirrorDefault 2948 2723 -7.6% 1.08x (?)
CharIteration_russian_unicodeScalars_Backwards 289800 267920 -7.6% 1.08x (?)
CharIteration_punctuated_unicodeScalars 54520 50440 -7.5% 1.08x (?)
CharIteration_punctuatedJapanese_unicodeScalars 43080 39880 -7.4% 1.08x (?)
Breadcrumbs.MutatedIdxToUTF16.ASCII 27 25 -7.4% 1.08x (?)
StringToDataMedium 4800 4450 -7.3% 1.08x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed 483 448 -7.2% 1.08x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 490 456 -6.9% 1.07x (?)
CharIteration_japanese_unicodeScalars 294040 274560 -6.6% 1.07x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@mikeash mikeash changed the title [WIP] Reimplement protocol conformance cache with a hash table [Runtime] Reimplement protocol conformance cache with a hash table Aug 17, 2020
@michaeleisel
Copy link

AtomicUnorderedMap supports unlimited insertion (albeit slower after a point). Or we could have rehashing done, either blocking or concurrently (while some secondary hash is used for backup storage). A new hash gets created, everything gets moved into it, and then the pointers get swapped. By only dealing with rehashing, and not the rest of the implementation, it would leverage their optimizations.

I just want to be sure that development of this is done with the current state-of-the-art in mind, and that at least the optimizations of folly (as one example of battle-tested data structures) are being considered. That 92% reduction on the protocol conformance benchmark looks awesome though. It may well be faster than simple approaches using folly.

@michaeleisel
Copy link

(s/AtomicUnorderedMap/AtomicHashMap, although the caveat there is that it doesn't support arbitrary-sized keys. another correction: s/unlimited/18x initial size)

@tbkka
Copy link
Contributor

tbkka commented Aug 17, 2020

I like that you flush the cache instead of mucking with a generation count. That's a nice improvement. If that proves problematic for some case, you might consider walking all entries and clearing just the negative ones. That would require benchmarking to see if it's actually faster, of course, and would increase the code size so probably not worth the effort unless there's demand.

@mikeash
Copy link
Contributor Author

mikeash commented Aug 17, 2020

@michaeleisel I totally understand, and I appreciate the questions!

Not supporting larger keys would be a showstopper. Fixed size would just add some complication, as you say. I feel like it would be much more effort than it's worth, but I certainly wouldn't oppose looking into it to see how it does in practice. This doesn't have to be the end-all be-all of the journey!

@mikeash
Copy link
Contributor Author

mikeash commented Aug 17, 2020

@tbkka Yeah, I spent a bunch of time trying to get the generation count mechanism working in the new world (the lack of stable addresses for entries complicated things) before realizing I could just get rid of it, and that made everything so much simpler.

I'd be really surprised if this approach caused any trouble. Loading libraries dynamically is so rare, and so slow for all sorts of other reasons, that I don't see how this could matter. But that's a good idea in case I'm wrong!

@tbkka
Copy link
Contributor

tbkka commented Aug 17, 2020

@michaeleisel As we continue to explore new target platforms, adding new libraries comes with a real cost, as each such library becomes an additional requirement on any new platform. We already rely heavily on LLVM data structures in the runtime, so it seems more prudent to continue evolving those rather than pull in a new library for this purpose.

@michaeleisel
Copy link

@tbkka this appears to be largely a greenfield data structure, and does not use many llvm things. and the open-source alternatives are always important to consider carefully, if only as a starting point for high-level thinking. but anyways, mike ash's approach seems good

@mikeash mikeash requested review from jckarter and tbkka August 18, 2020 14:12
@mikeash mikeash marked this pull request as ready for review August 18, 2020 16:22
Copy link
Contributor

@tbkka tbkka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got through about half of the code. I have a few comments, but only one that was substantive: I couldn't find where you actually put the old indices array onto the free list to be recycled.

@@ -907,7 +934,24 @@ class ReflectionContext

auto Root = getReader().readPointer(ConformancesAddr->getResolvedAddress(),
sizeof(StoredPointer));
iterateConformanceTree(Root->getResolvedAddress().getAddressData(), Call);
auto RootAddr = Root->getResolvedAddress().getAddressData();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In transitional cases like this, I generally like to name things with the new behavior and then comment the old. That makes it easier when you delete the backwards support. So I might name this "ReaderCount" and then comment below that the location used for the reader count used to be used for the root node of the conformance cache tree.

@mikeash mikeash force-pushed the concurrenthashmap branch 3 times, most recently from a954e28 to b6a4345 Compare August 19, 2020 20:19
@mikeash
Copy link
Contributor Author

mikeash commented Aug 19, 2020

Squashed commits and applied clang-format.

Copy link
Contributor

@tbkka tbkka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@mikeash
Copy link
Contributor Author

mikeash commented Aug 20, 2020

@swift-ci please test

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 3bd501005dcd95cd19c5f4363b5c0d6e551581fc

…ance cache

to use it.

ConcurrentReadableHashMap is lock-free for readers, with writers using a lock to
ensure mutual exclusion amongst each other. The intent is to eventually replace
all uses ConcurrentMap with ConcurrentReadableHashMap.

ConcurrentReadableHashMap provides for relatively quick lookups by using a hash
table. Rearders perform an atomic increment/decrement in order to inform writers
that there are active readers. The design attempts to minimize wasted memory by
storing the actual elements out-of-line, and having the table store indices into
a separate array of elements.

The protocol conformance cache now uses ConcurrentReadableHashMap, which
provides faster lookups and less memory use than the previous ConcurrentMap
implementation. The previous implementation caches
ProtocolConformanceDescriptors and extracts the WitnessTable after the cache
lookup. The new implementation directly caches the WitnessTable, removing an
extra step (potentially a quite slow one) from the fast path.

The previous implementation used a generational scheme to detect when negative
cache entries became obsolete due to new dynamic libraries being loaded, and
update them in place. The new implementation just clears the entire cache when
libraries are loaded, greatly simplifying the code and saving the memory needed
to track the current generation in each negative cache entry. This means we need
to re-cache all requested conformances after loading a dynamic library, but
loading libraries at runtime is rare and slow anyway.

rdar://problem/67268325
@mikeash mikeash force-pushed the concurrenthashmap branch from b6a4345 to ecd6d4d Compare August 20, 2020 17:05
@mikeash
Copy link
Contributor Author

mikeash commented Aug 20, 2020

Whoops, push changes, THEN test.

@mikeash
Copy link
Contributor Author

mikeash commented Aug 20, 2020

@swift-ci please test

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - b6a4345d1fc2c25337b57b6efefd2134588aac16

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - b6a4345d1fc2c25337b57b6efefd2134588aac16

@mikeash
Copy link
Contributor Author

mikeash commented Aug 20, 2020

@swift-ci please test Windows platform

@mikeash
Copy link
Contributor Author

mikeash commented Aug 20, 2020

@swift-ci please test linux platform

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - ecd6d4d

@mikeash
Copy link
Contributor Author

mikeash commented Aug 21, 2020

@swift-ci please test Linux platform

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - ecd6d4d

@mikeash
Copy link
Contributor Author

mikeash commented Aug 21, 2020

@swift-ci please test Linux platform

@mikeash mikeash merged commit 39d4e01 into swiftlang:master Aug 21, 2020
@davezarzycki
Copy link
Contributor

Hi @mikeash – "ConcurrentReadableHashMapTest.MultiThreaded4" from this pull request is failing semi-reliably on my Linux box when there are background compiles going on. How can I help you debug this? What's the most stressful/unforgiving box you have to test this? Have you tested this with TSAN?

FAIL: Swift-Unit :: runtime/./SwiftRuntimeTests/ConcurrentReadableHashMapTest.MultiThreaded4 (3 of 9)
******************** TEST 'Swift-Unit :: runtime/./SwiftRuntimeTests/ConcurrentReadableHashMapTest.MultiThreaded4' FAILED ********************
Note: Google Test filter = ConcurrentReadableHashMapTest.MultiThreaded4
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ConcurrentReadableHashMapTest
[ RUN      ] ConcurrentReadableHashMapTest.MultiThreaded4
#0 0x00000000002aa444 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/home/dave/b/u/t/tools/swift/unittests/runtime/./SwiftRuntimeTests+0x2aa444)
#1 0x00000000002a8b07 llvm::sys::RunSignalHandlers() (/home/dave/b/u/t/tools/swift/unittests/runtime/./SwiftRuntimeTests+0x2a8b07)
#2 0x00000000002aab5a SignalHandler(int) (/home/dave/b/u/t/tools/swift/unittests/runtime/./SwiftRuntimeTests+0x2aab5a)
#3 0x00007ffff7f9ea90 __restore_rt (/lib64/libpthread.so.0+0x14a90)
#4 0x000000000026068a std::__1::pair<MultiThreadedValue*, std::__1::atomic<unsigned int>*> swift::ConcurrentReadableHashMap<MultiThreadedValue>::find<MultiThreadedKey>(MultiThreadedKey const&, swift::ConcurrentReadableHashMap<MultiThreadedValue>::IndexStorage*, unsigned long, MultiThreadedValue*) (/home/dave/b/u/t/tools/swift/unittests/runtime/./SwiftRuntimeTests+0x26068a)
#5 0x0000000000262e97 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void threadedExecute<ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24, void threadedExecute<ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24>(int, ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24)::'lambda'()>(int, ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24, void threadedExecute<ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24>(int, ConcurrentReadableHashMapTest_MultiThreaded4_Test::TestBody()::$_24)::'lambda'())::'lambda'()> >(void*) (/home/dave/b/u/t/tools/swift/unittests/runtime/./SwiftRuntimeTests+0x262e97)
#6 0x00007ffff7f93432 start_thread (/lib64/libpthread.so.0+0x9432)
#7 0x00007ffff5c8b913 clone (/lib64/libc.so.6+0x101913)

********************

Testing Time: 1.49s
********************
Failing Tests (1):
    Swift-Unit :: runtime/./SwiftRuntimeTests/ConcurrentReadableHashMapTest.MultiThreaded4



@davezarzycki
Copy link
Contributor

For reasons not worth going into, I don't have good luck running TSAN, but I was able to hack it together for this test. I hope this helps: https://znu.io/tsan.txt

@davezarzycki
Copy link
Contributor

I haven't studied this design closely, but based on the TSAN output, I found the following code that (at least superficially) is a classic time-of-use versus time-of-check race:

void deallocateFreeListIfSafe() {
  if (ReaderCount.load(std::memory_order_relaxed) == 0)
    deallocateFreeList();
}

Map->incrementReaders();
}

~Snapshot() { Map->decrementReaders(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Snapshot is truly a snapshot, then why is the map reader lock being held until the snapshot is destroyed? Shouldn't it be dropped at the end of the constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reader count isn't really a reader lock, as writes can still make progress when it's non-zero. It just informs writers whether it's safe to free old allocations. Outstanding snapshots may be pointing to old allocations, so it's not safe to free them until all outstanding snapshots have been destroyed.

Snapshot(const Snapshot &other)
: Map(other.Map), Indices(other.Indices), Elements(other.Elements),
ElementCount(other.ElementCount) {
Map->incrementReaders();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem out of order. Shouldn't the reader lock be acquired BEFORE initializing copies of the Map data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot being copied is already holding the reader count above zero, so the order doesn't matter here. Writers only care about zero and non-zero.

@@ -541,6 +546,382 @@ template <class ElemTy> struct ConcurrentReadableArray {
}
};

using llvm::hash_value;

/// A hash table that can be queried without taking any locks. Writes are still
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems out of touch with the implementation which is clearly using a reader/writer lock strategy (as opposed to a truly lockless strategy like RCU).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment about how this is not really a reader lock, but just controlling whether old allocations get freed.


/// The number of readers currently active, equal to the number of snapshot
/// objects currently alive.
std::atomic<size_t> ReaderCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless something has dramatically changed, no operating system to date supports anywhere close to four billion threads. I think this can safely be a uint32_t for a long time :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I feel weird using atomic integers smaller than a pointer, but that's just me being silly. I could shrink this and the element count to save a word.

@davezarzycki
Copy link
Contributor

The more I look at this code, the more I think it should just be replaced with a dramatically simplified reader/writer spin lock and a std::vector. If the write scenario becomes a bottleneck, there are various workarounds. It really just depends on the usage patterns.

@mikeash
Copy link
Contributor Author

mikeash commented Aug 22, 2020

I haven't studied this design closely, but based on the TSAN output, I found the following code that (at least superficially) is a classic time-of-use versus time-of-check race:

void deallocateFreeListIfSafe() {
  if (ReaderCount.load(std::memory_order_relaxed) == 0)
    deallocateFreeList();
}

(Why oh why doesn't GitHub support proper threaded replies on top-level comments?)

The caller holds the write lock, so the free list can't change in here. The danger with freeing items on the free list is if an outstanding snapshot holds a reference to one of those items. Thus, if ReaderCount is non-zero, we don't free anything. If it is zero, then we know that there were no outstanding snapshots at that point in time, and thus nothing with a reference to anything on the free lists.

As you say, we then have a TOCTOU problem, except that there's no way for any new readers to get a reference to anything on the free lists. Because we hold the writer lock, the elements and indices are unchanging while we do this operation. A new snapshot will copy the current indices and elements, but it can't get any of the old ones.

As far as simplifying this with a more traditional locked solution, I think this is used often enough to justify the extra complexity (especially once I replace other uses of ConcurrentMap) but it would be interesting to try out a simpler version and see how well it works in practice. We had a locked std::vector storing the list of sections to search in the protocol conformance code and a few other places, which I replaced with ConcurrentReadableArray a while back. Performance aside, there were some rare and tricky deadlocks between readers and writers due to interactions with other parts of the system. I don't THINK any of that would apply here, but it has me a bit wary.

@mikeash
Copy link
Contributor Author

mikeash commented Aug 22, 2020

I believe the failure is fixed by this new PR: https://github.com/apple/swift/pull/33595/files.

jckarter added a commit to jckarter/swift that referenced this pull request Aug 28, 2020
With the improvements to protocol casting performance from swiftlang#33487, the demangling overhead is
hopefully manageable for most people using this flag today.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants