Skip to content

Package collections: trie for search #3090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

yim-lee
Copy link
Contributor

@yim-lee yim-lee commented Dec 5, 2020

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., package.summary.contains("foobar")). This can be optimized.

Modifications:
Introduce InMemoryPackageCollectionsSearch, which uses Tries as underlying search indexes.

InMemoryPackageCollectionsSearch contains the same findPackage, searchPackages, and searchTargets method signatures as
SQLitePackageCollectionsStorage. This PR only intends to show InMemoryPackageCollectionsSearch working standalone, and the next step is to weave InMemoryPackageCollectionsSearch into the existing business logic. This might mean using InMemoryPackageCollectionsSearch instead of SQLitePackageCollectionsStorage for search, or combining the two. We will also need to be able to de/serialize the indexes (tries) to avoid rebuilding them each time and to reduce "warm up" time.

Result:
A search index for package collections API that based on performance tests, yield better timings that current search strategy.

@yim-lee
Copy link
Contributor Author

yim-lee commented Dec 5, 2020

@swift-ci please smoke test

}
}

group.notify(queue: self.queue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also use wait?


group.notify(queue: self.queue) {
self.cacheLock.withLock {
self.cache[collection.identifier] = collection
Copy link
Contributor

@tomerd tomerd Dec 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding, so this will basically update the in-memory copy with the latest doc index? I think the doc index are the Trie instance variables, so not 100% clear to me how this is used

nm, I see now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider is if we should store this info inside the trie, for it would simplify de/serialization I think. I need to study SQLitePackageCollectionsStorage some more, but the direction I am leaning towards is embed this inside SQLitePackageCollectionsStorage rather than making them separate strategies.


import PackageModel

final class Trie<Document: Hashable> {
Copy link
Contributor

@tomerd tomerd Dec 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be a struct to make sure its thread safe / make sure its otherwise thread-safe

}
}

private final class TrieNode<T: Hashable, Document: Hashable> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can try making into a struct with mutating functions that may safe the needs in locks, not sure will work but worth trying

Copy link
Contributor

@tomerd tomerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super cool!

@mattt
Copy link
Contributor

mattt commented Dec 7, 2020

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., package.summary.contains("foobar")). This can be optimized.

Have we considered taking advantage of more existing SQLite functionality as a way to optimize these operations?

FTS3 / FTS4, for example, would likely provide better and faster full-text search than anything we'd want to build from scratch. SQLite 3.31.0 introduces generated columns, which makes it convenient to index on inserted JSON objects, or we could also destructure blobs into rows and columns ourselves.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Introduce `InMemoryPackageCollectionsSearch`, which uses `Trie`s as underlying search indexes.

`InMemoryPackageCollectionsSearch` contains the same `findPackage`, `searchPackages`, and `searchTargets` method signatures as
`SQLitePackageCollectionsStorage`. This PR only intends to show `InMemoryPackageCollectionsSearch` working standalone, and the next step is to weave `InMemoryPackageCollectionsSearch` into the existing business logic. This might mean using `InMemoryPackageCollectionsSearch` instead of `SQLitePackageCollectionsStorage` for search, or combining the two. We will also need to be able to de/serialize the indexes (tries) to avoid rebuilding them each time and to reduce "warm up" time.

Result:
A search index for package collections API that based on performance tests, yield better timings that current search strategy.
@yim-lee
Copy link
Contributor Author

yim-lee commented Dec 9, 2020

Have we considered taking advantage of more existing SQLite functionality as a way to optimize these operations?

Yes, we've considered FTS. However, given the collection format is unstable, I think we should avoid tying the database schema too closely to it. This is why in the database collections are stored as blobs rather than broken into rows and columns (or even tables). Also, since the contents of package collections change over time, we have to make sure the database gets updated accordingly, and IMO updating/deleting rows in a FTS table is not as simple as it should be (one has to deal with rowid).

I agree an existing solution such as FTS would offer much more features and capabilities than something we build from scratch, but if our main concerns are search performance and maintainability, trie is a reasonable option. Plus, I don't think we've ruled out FTS completely; it's just not the best option at this time.

@mattt
Copy link
Contributor

mattt commented Dec 10, 2020

Have we considered taking advantage of more existing SQLite functionality as a way to optimize these operations?

Yes, we've considered FTS. However, given the collection format is unstable, I think we should avoid tying the database schema too closely to it. This is why in the database collections are stored as blobs rather than broken into rows and columns (or even tables). Also, since the contents of package collections change over time, we have to make sure the database gets updated accordingly, and IMO updating/deleting rows in a FTS table is not as simple as it should be (one has to deal with rowid).

Thanks for explaining your rationale. For what it's worth, I think you could still take advantage of SQLite and FTS even with an unstable JSON format by using an in-memory database or temporary virtual table.

Beyond any performance benefits there, I'd also encourage you to consider the UX benefits of leveraging FTS. Features like locale-aware tokenization, stemming, case / diacritic folding, TF-IDF salience, and search operators can make a huge difference in how useful search functionality is for end-users.

@yim-lee
Copy link
Contributor Author

yim-lee commented Dec 10, 2020

I think you could still take advantage of SQLite and FTS even with an unstable JSON format by using an in-memory database or temporary virtual table.

We would still use SQLite for storage, but I guess the point of contention is how/where search queries will be done.

Temporary virtual table sounds useful. I will look into it.

yim-lee added a commit to yim-lee/swift-package-manager that referenced this pull request Dec 11, 2020
This is an alternate to swiftlang#3090 but is a complete solution.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`.

Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine.

With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms.

The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer.

Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete.

`put` now takes longer to complete because of the search index updates.

Result:
Better search performance.
@yim-lee
Copy link
Contributor Author

yim-lee commented Dec 11, 2020

A different take: #3108

@yim-lee
Copy link
Contributor Author

yim-lee commented Dec 14, 2020

Closing this in favor of #3108

@yim-lee yim-lee closed this Dec 14, 2020
yim-lee added a commit to yim-lee/swift-package-manager that referenced this pull request Dec 15, 2020
This is an alternate to swiftlang#3090 but is a complete solution.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`.

Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine.

With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms.

The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer.

Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete.

`put` now takes longer to complete because of the search index updates.

Result:
Better search performance.
yim-lee added a commit to yim-lee/swift-package-manager that referenced this pull request Dec 22, 2020
This is an alternate to swiftlang#3090 but is a complete solution.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`.

Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine.

With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms.

The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer.

Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete.

`put` now takes longer to complete because of the search index updates.

Result:
Better search performance.
yim-lee added a commit to yim-lee/swift-package-manager that referenced this pull request Jan 6, 2021
This is an alternate to swiftlang#3090 but is a complete solution.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`.

Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine.

With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms.

The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer.

Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete.

`put` now takes longer to complete because of the search index updates.

Result:
Better search performance.
yim-lee added a commit to yim-lee/swift-package-manager that referenced this pull request Jan 8, 2021
This is an alternate to swiftlang#3090 but is a complete solution.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`.

Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine.

With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms.

The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer.

Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete.

`put` now takes longer to complete because of the search index updates.

Result:
Better search performance.
yim-lee added a commit to yim-lee/swift-package-manager that referenced this pull request Jan 11, 2021
This is an alternate to swiftlang#3090 but is a complete solution.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`.

Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine.

With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms.

The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer.

Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete.

`put` now takes longer to complete because of the search index updates.

Result:
Better search performance.
yim-lee added a commit to yim-lee/swift-package-manager that referenced this pull request Jan 19, 2021
This is an alternate to swiftlang#3090 but is a complete solution.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`.

Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine.

With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms.

The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer.

Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete.

`put` now takes longer to complete because of the search index updates.

Result:
Better search performance.
yim-lee added a commit to yim-lee/swift-package-manager that referenced this pull request Jan 20, 2021
This is an alternate to swiftlang#3090 but is a complete solution.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`.

Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine.

With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms.

The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer.

Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete.

`put` now takes longer to complete because of the search index updates.

Result:
Better search performance.
yim-lee added a commit that referenced this pull request Jan 20, 2021
This is an alternate to #3090 but is a complete solution.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`.

Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine.

With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms.

The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer.

Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete.

`put` now takes longer to complete because of the search index updates.

Result:
Better search performance.
yim-lee added a commit to yim-lee/swift-package-manager that referenced this pull request Jan 21, 2021
This is an alternate to swiftlang#3090 but is a complete solution.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`.

Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine.

With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms.

The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (swiftlang#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer.

Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete.

`put` now takes longer to complete because of the search index updates.

Result:
Better search performance.
yim-lee added a commit that referenced this pull request Jan 21, 2021
* Package collections: improve search performance

This is an alternate to #3090 but is a complete solution.

Motivation:
Currently to support search for package collections API we read and deserialize collection blobs from SQLite then perform string matchings on individual properties in memory (e.g., `package.summary.contains("foobar")`). This can be optimized.

Modifications:
Use SQLite FTS--define FTS virtual tables for packages and targets, and update implementation for `findPackage`, `searchPackages`, and `searchTargets` methods of `SQLitePackageCollectionsStorage`.

Without optimization, `PackageCollectionsTests.testPackageSearchPerformance` and `testTargetsSearchPerformance` take about ~400ms to run on my local machine.

With FTS, `testPackageSearchPerformance` takes ~40ms and `testTargetsSearchPerformance` ~50ms.

The `testSearchTargetsPerformance` in `InMemoryPackageCollectionsSearchTests` (#3090) yields result in ~10ms, though it queries the trie directly without going through the PackageCollections API layer.

Since target search is either exact or prefix match and doesn't tokenize the query, and given the good results I saw in `InMemoryPackageCollectionsSearchTests`, this implementation includes a trie on top of SQLite FTS for target search. The trie is in-memory and loads from SQLite FTS during initialization. `put`/`remove` updates both the SQLite FTS and trie. The improvement with using trie varies--`testTargetsSearchPerformance` takes between ~15-50ms to complete.

`put` now takes longer to complete because of the search index updates.

Result:
Better search performance.

* Fallback to slow path, which does not use FTS, in case FTS4 is not available
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants