-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Package collections: trie for search #3090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci please smoke test |
} | ||
} | ||
|
||
group.notify(queue: self.queue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also use wait?
|
||
group.notify(queue: self.queue) { | ||
self.cacheLock.withLock { | ||
self.cache[collection.identifier] = collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my understanding, so this will basically update the in-memory copy with the latest doc index? I think the doc index are the Trie instance variables, so not 100% clear to me how this is used
nm, I see now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to consider is if we should store this info inside the trie, for it would simplify de/serialization I think. I need to study SQLitePackageCollectionsStorage
some more, but the direction I am leaning towards is embed this inside SQLitePackageCollectionsStorage
rather than making them separate strategies.
Sources/PackageCollections/Search/InMemoryPackageCollectionsSearch.swift
Show resolved
Hide resolved
|
||
import PackageModel | ||
|
||
final class Trie<Document: Hashable> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be a struct to make sure its thread safe / make sure its otherwise thread-safe
} | ||
} | ||
|
||
private final class TrieNode<T: Hashable, Document: Hashable> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can try making into a struct with mutating functions that may safe the needs in locks, not sure will work but worth trying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super cool!
Have we considered taking advantage of more existing SQLite functionality as a way to optimize these operations? FTS3 / FTS4, for example, would likely provide better and faster full-text search than anything we'd want to build from scratch. SQLite 3.31.0 introduces generated columns, which makes it convenient to index on inserted JSON objects, or we could also destructure blobs into rows and columns ourselves. |
Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Introduce `InMemoryPackageCollectionsSearch`, which uses `Trie`s as underlying search indexes. `InMemoryPackageCollectionsSearch` contains the same `findPackage`, `searchPackages`, and `searchTargets` method signatures as `SQLitePackageCollectionsStorage`. This PR only intends to show `InMemoryPackageCollectionsSearch` working standalone, and the next step is to weave `InMemoryPackageCollectionsSearch` into the existing business logic. This might mean using `InMemoryPackageCollectionsSearch` instead of `SQLitePackageCollectionsStorage` for search, or combining the two. We will also need to be able to de/serialize the indexes (tries) to avoid rebuilding them each time and to reduce "warm up" time. Result: A search index for package collections API that based on performance tests, yield better timings that current search strategy.
Yes, we've considered FTS. However, given the collection format is unstable, I think we should avoid tying the database schema too closely to it. This is why in the database collections are stored as blobs rather than broken into rows and columns (or even tables). Also, since the contents of package collections change over time, we have to make sure the database gets updated accordingly, and IMO updating/deleting rows in a FTS table is not as simple as it should be (one has to deal with I agree an existing solution such as FTS would offer much more features and capabilities than something we build from scratch, but if our main concerns are search performance and maintainability, trie is a reasonable option. Plus, I don't think we've ruled out FTS completely; it's just not the best option at this time. |
Thanks for explaining your rationale. For what it's worth, I think you could still take advantage of SQLite and FTS even with an unstable JSON format by using an in-memory database or temporary virtual table. Beyond any performance benefits there, I'd also encourage you to consider the UX benefits of leveraging FTS. Features like locale-aware tokenization, stemming, case / diacritic folding, TF-IDF salience, and search operators can make a huge difference in how useful search functionality is for end-users. |
We would still use SQLite for storage, but I guess the point of contention is how/where search queries will be done. Temporary virtual table sounds useful. I will look into it. |
This is an alternate to swiftlang#3090 but is a complete solution. Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`. Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine. With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms. The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer. Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete. `put` now takes longer to complete because of the search index updates. Result: Better search performance.
A different take: #3108 |
Closing this in favor of #3108 |
This is an alternate to swiftlang#3090 but is a complete solution. Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`. Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine. With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms. The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer. Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete. `put` now takes longer to complete because of the search index updates. Result: Better search performance.
This is an alternate to swiftlang#3090 but is a complete solution. Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`. Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine. With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms. The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer. Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete. `put` now takes longer to complete because of the search index updates. Result: Better search performance.
This is an alternate to swiftlang#3090 but is a complete solution. Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`. Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine. With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms. The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer. Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete. `put` now takes longer to complete because of the search index updates. Result: Better search performance.
This is an alternate to swiftlang#3090 but is a complete solution. Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`. Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine. With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms. The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer. Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete. `put` now takes longer to complete because of the search index updates. Result: Better search performance.
This is an alternate to swiftlang#3090 but is a complete solution. Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`. Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine. With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms. The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer. Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete. `put` now takes longer to complete because of the search index updates. Result: Better search performance.
This is an alternate to swiftlang#3090 but is a complete solution. Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`. Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine. With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms. The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer. Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete. `put` now takes longer to complete because of the search index updates. Result: Better search performance.
This is an alternate to swiftlang#3090 but is a complete solution. Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`. Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine. With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms. The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer. Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete. `put` now takes longer to complete because of the search index updates. Result: Better search performance.
This is an alternate to #3090 but is a complete solution. Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`. Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine. With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms. The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer. Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete. `put` now takes longer to complete because of the search index updates. Result: Better search performance.
This is an alternate to swiftlang#3090 but is a complete solution. Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`. Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine. With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms. The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer. Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete. `put` now takes longer to complete because of the search index updates. Result: Better search performance.
* Package collections: improve search performance This is an alternate to #3090 but is a complete solution. Motivation: Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized. Modifications: Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`. Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine. With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms. The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer. Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete. `put` now takes longer to complete because of the search index updates. Result: Better search performance. * Fallback to slow path, which does not use FTS, in case FTS4 is not available
Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g.,
package.summary.contains("foobar")
). This can be optimized.Modifications:
Introduce
InMemoryPackageCollectionsSearch
, which usesTrie
s as underlying search indexes.InMemoryPackageCollectionsSearch
contains the samefindPackage
,searchPackages
, andsearchTargets
method signatures asSQLitePackageCollectionsStorage
. This PR only intends to showInMemoryPackageCollectionsSearch
working standalone, and the next step is to weaveInMemoryPackageCollectionsSearch
into the existing business logic. This might mean usingInMemoryPackageCollectionsSearch
instead ofSQLitePackageCollectionsStorage
for search, or combining the two. We will also need to be able to de/serialize the indexes (tries) to avoid rebuilding them each time and to reduce "warm up" time.Result:
A search index for package collections API that based on performance tests, yield better timings that current search strategy.