Skip to content

Implement parallel processing of different kmeans clusters in vector index #17247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #8967
kunga opened this issue Apr 15, 2025 · 7 comments
Open
Tracked by #8967
Assignees

Comments

@kunga
Copy link
Member

kunga commented Apr 15, 2025

Now it seems that we run shard scans sequentially:

2025-04-16T10:37:50.1744789070321773Z :BUILD_INDEX NOTICE: Starting TLocalKMeansScan TabletId: 72075186224083200 Id: 281474977825662 TabletId: 72075186224083200 PathId { OwnerId: 72075186224037897 LocalId: 68 } SeqNoGeneration: 25 SeqNoRound: 1 Settings { metric: DISTANCE_COSINE vector_type: VECTOR_TYPE_FLOAT vector_dimension: 1024 } Seed: 72075186224083200 K: 200 Upload: UPLOAD_BUILD_TO_POSTING NeedsRounds: 3 ParentFrom: 42 Child: 8401 LevelName: "/Root/testdb/alice/idx_vector/indexImplLevelTable" PostingName: "/Root/testdb/alice/idx_vector/indexImplPostingTable" EmbeddingColumn: "embedding" ParentTo: 42 row version v1744789070000/18446744073709551615
2025-04-16T10:37:50.1744789070320288Z :BUILD_INDEX NOTICE: Done TLocalKMeansScan TabletId: 72075186224083410 Id: 281474977825662 Parent: 41 Child: 8201 K: 200 Clusters: 0 State: SAMPLE Round: 0 / 3 LevelBuf size: 0 PostingBuf size: 0 UploadTable: /Root/testdb/alice/idx_vector/indexImplPostingTable UploadBuf size: 0 RetryCount: 0 Id: 281474977825662 TabletId: 72075186224083410 RequestSeqNoGeneration: 25 RequestSeqNoRound: 1 Status: DONE UploadRows: 618663 UploadBytes: 14431986 ReadRows: 3092315 ReadBytes: 12706322335
2025-04-16T10:40:13.1744789213372628Z :BUILD_INDEX NOTICE: Starting TLocalKMeansScan TabletId: 72075186224082996 Id: 281474977825662 TabletId: 72075186224082996 PathId { OwnerId: 72075186224037897 LocalId: 68 } SeqNoGeneration: 25 SeqNoRound: 1 Settings { metric: DISTANCE_COSINE vector_type: VECTOR_TYPE_FLOAT vector_dimension: 1024 } Seed: 72075186224082996 K: 200 Upload: UPLOAD_BUILD_TO_POSTING NeedsRounds: 3 ParentFrom: 44 Child: 8801 LevelName: "/Root/testdb/alice/idx_vector/indexImplLevelTable" PostingName: "/Root/testdb/alice/idx_vector/indexImplPostingTable" EmbeddingColumn: "embedding" ParentTo: 44 row version v1744789213000/18446744073709551615
2025-04-16T10:40:13.1744789213370433Z :BUILD_INDEX NOTICE: Done TLocalKMeansScan TabletId: 72075186224083200 Id: 281474977825662 Parent: 42 Child: 8401 K: 200 Clusters: 0 State: SAMPLE Round: 0 / 3 LevelBuf size: 0 PostingBuf size: 0 UploadTable: /Root/testdb/alice/idx_vector/indexImplPostingTable UploadBuf size: 0 RetryCount: 0 Id: 281474977825662 TabletId: 72075186224083200 RequestSeqNoGeneration: 25 RequestSeqNoRound: 1 Status: DONE UploadRows: 568754 UploadBytes: 13333988 ReadRows: 2842770 ReadBytes: 11680941930
2025-04-16T10:44:03.1744789443712769Z :BUILD_INDEX NOTICE: Done TLocalKMeansScan TabletId: 72075186224082996 Id: 281474977825662 Parent: 44 Child: 8801 K: 200 Clusters: 0 State: SAMPLE Round: 0 / 3 LevelBuf size: 0 PostingBuf size: 0 UploadTable: /Root/testdb/alice/idx_vector/indexImplPostingTable UploadBuf size: 0 RetryCount: 0 Id: 281474977825662 TabletId: 72075186224082996 RequestSeqNoGeneration: 25 RequestSeqNoRound: 1 Status: DONE UploadRows: 945349 UploadBytes: 21619078 ReadRows: 4725745 ReadBytes: 19418086205
2025-04-16T10:44:03.1744789443714739Z :BUILD_INDEX NOTICE: Starting TLocalKMeansScan TabletId: 72075186224083228 Id: 281474977825662 TabletId: 72075186224083228 PathId { OwnerId: 72075186224037897 LocalId: 68 } SeqNoGeneration: 25 SeqNoRound: 1 Settings { metric: DISTANCE_COSINE vector_type: VECTOR_TYPE_FLOAT vector_dimension: 1024 } Seed: 72075186224083228 K: 200 Upload: UPLOAD_BUILD_TO_POSTING NeedsRounds: 3 ParentFrom: 45 Child: 9001 LevelName: "/Root/testdb/alice/idx_vector/indexImplLevelTable" PostingName: "/Root/testdb/alice/idx_vector/indexImplPostingTable" EmbeddingColumn: "embedding" ParentTo: 45 row version v1744789443000/18446744073709551615
2025-04-16T10:45:56.1744789556696726Z :BUILD_INDEX NOTICE: Done TLocalKMeansScan TabletId: 72075186224083228 Id: 281474977825662 Parent: 45 Child: 9001 K: 200 Clusters: 0 State: SAMPLE Round: 0 / 3 LevelBuf size: 0 PostingBuf size: 0 UploadTable: /Root/testdb/alice/idx_vector/indexImplPostingTable UploadBuf size: 0 RetryCount: 0 Id: 281474977825662 TabletId: 72075186224083228 RequestSeqNoGeneration: 25 RequestSeqNoRound: 1 Status: DONE UploadRows: 456045 UploadBytes: 10854390 ReadRows: 2279225 ReadBytes: 9365335525
2025-04-16T10:45:56.1744789556698333Z :BUILD_INDEX NOTICE: Starting TLocalKMeansScan TabletId: 72075186224083058 Id: 281474977825662 TabletId: 72075186224083058 PathId { OwnerId: 72075186224037897 LocalId: 68 } SeqNoGeneration: 25 SeqNoRound: 1 Settings { metric: DISTANCE_COSINE vector_type: VECTOR_TYPE_FLOAT vector_dimension: 1024 } Seed: 72075186224083058 K: 200 Upload: UPLOAD_BUILD_TO_POSTING NeedsRounds: 3 ParentFrom: 46 Child: 9201 LevelName: "/Root/testdb/alice/idx_vector/indexImplLevelTable" PostingName: "/Root/testdb/alice/idx_vector/indexImplPostingTable" EmbeddingColumn: "embedding" ParentTo: 46 row version v1744789556000/18446744073709551615
2025-04-16T10:48:05.1744789685561006Z :BUILD_INDEX NOTICE: Starting TLocalKMeansScan TabletId: 72075186224083322 Id: 281474977825662 TabletId: 72075186224083322 PathId { OwnerId: 72075186224037897 LocalId: 68 } SeqNoGeneration: 25 SeqNoRound: 1 Settings { metric: DISTANCE_COSINE vector_type: VECTOR_TYPE_FLOAT vector_dimension: 1024 } Seed: 72075186224083322 K: 200 Upload: UPLOAD_BUILD_TO_POSTING NeedsRounds: 3 ParentFrom: 47 Child: 9401 LevelName: "/Root/testdb/alice/idx_vector/indexImplLevelTable" PostingName: "/Root/testdb/alice/idx_vector/indexImplPostingTable" EmbeddingColumn: "embedding" ParentTo: 47 row version v1744789685000/18446744073709551615
2025-04-16T10:48:05.1744789685559055Z :BUILD_INDEX NOTICE: Done TLocalKMeansScan TabletId: 72075186224083058 Id: 281474977825662 Parent: 46 Child: 9201 K: 200 Clusters: 0 State: SAMPLE Round: 0 / 3 LevelBuf size: 0 PostingBuf size: 0 UploadTable: /Root/testdb/alice/idx_vector/indexImplPostingTable UploadBuf size: 0 RetryCount: 0 Id: 281474977825662 TabletId: 72075186224083058 RequestSeqNoGeneration: 25 RequestSeqNoRound: 1 Status: DONE UploadRows: 517710 UploadBytes: 12211020 ReadRows: 2587550 ReadBytes: 10632242950
2025-04-16T10:52:03.1744789923112374Z :BUILD_INDEX NOTICE: Starting TLocalKMeansScan TabletId: 72075186224083426 Id: 281474977825662 TabletId: 72075186224083426 PathId { OwnerId: 72075186224037897 LocalId: 68 } SeqNoGeneration: 25 SeqNoRound: 1 Settings { metric: DISTANCE_COSINE vector_type: VECTOR_TYPE_FLOAT vector_dimension: 1024 } Seed: 72075186224083426 K: 200 Upload: UPLOAD_BUILD_TO_POSTING NeedsRounds: 3 ParentFrom: 48 Child: 9601 LevelName: "/Root/testdb/alice/idx_vector/indexImplLevelTable" PostingName: "/Root/testdb/alice/idx_vector/indexImplPostingTable" EmbeddingColumn: "embedding" ParentTo: 48 row version v1744789923000/18446744073709551615
2025-04-16T10:52:03.1744789923110602Z :BUILD_INDEX NOTICE: Done TLocalKMeansScan TabletId: 72075186224083322 Id: 281474977825662 Parent: 47 Child: 9401 K: 200 Clusters: 0 State: SAMPLE Round: 0 / 3 LevelBuf size: 0 PostingBuf size: 0 UploadTable: /Root/testdb/alice/idx_vector/indexImplPostingTable UploadBuf size: 0 RetryCount: 0 Id: 281474977825662 TabletId: 72075186224083322 RequestSeqNoGeneration: 25 RequestSeqNoRound: 1 Status: DONE UploadRows: 946163 UploadBytes: 21636986 ReadRows: 4729815 ReadBytes: 19434809835
2025-04-16T10:53:20.1744790000746881Z :BUILD_INDEX NOTICE: Starting TLocalKMeansScan TabletId: 72075186224083152 Id: 281474977825662 TabletId: 72075186224083152 PathId { OwnerId: 72075186224037897 LocalId: 68 } SeqNoGeneration: 25 SeqNoRound: 1 Settings { metric: DISTANCE_COSINE vector_type: VECTOR_TYPE_FLOAT vector_dimension: 1024 } Seed: 72075186224083152 K: 200 Upload: UPLOAD_BUILD_TO_POSTING NeedsRounds: 3 ParentFrom: 49 Child: 9801 LevelName: "/Root/testdb/alice/idx_vector/indexImplLevelTable" PostingName: "/Root/testdb/alice/idx_vector/indexImplPostingTable" EmbeddingColumn: "embedding" ParentTo: 49 row version v1744790000000/18446744073709551615
2025-04-16T10:53:20.1744790000745187Z :BUILD_INDEX NOTICE: Done TLocalKMeansScan TabletId: 72075186224083426 Id: 281474977825662 Parent: 48 Child: 9601 K: 200 Clusters: 0 State: SAMPLE Round: 0 / 3 LevelBuf size: 0 PostingBuf size: 0 UploadTable: /Root/testdb/alice/idx_vector/indexImplPostingTable UploadBuf size: 0 RetryCount: 0 Id: 281474977825662 TabletId: 72075186224083426 RequestSeqNoGeneration: 25 RequestSeqNoRound: 1 Status: DONE UploadRows: 312171 UploadBytes: 7689162 ReadRows: 1559855 ReadBytes: 6409444195
2025-04-16T10:56:41.1744790201623512Z :BUILD_INDEX NOTICE: Starting TLocalKMeansScan TabletId: 72075186224083390 Id: 281474977825662 TabletId: 72075186224083390 PathId { OwnerId: 72075186224037897 LocalId: 68 } SeqNoGeneration: 25 SeqNoRound: 1 Settings { metric: DISTANCE_COSINE vector_type: VECTOR_TYPE_FLOAT vector_dimension: 1024 } Seed: 72075186224083390 K: 200 Upload: UPLOAD_BUILD_TO_POSTING NeedsRounds: 3 ParentFrom: 50 Child: 10001 LevelName: "/Root/testdb/alice/idx_vector/indexImplLevelTable" PostingName: "/Root/testdb/alice/idx_vector/indexImplPostingTable" EmbeddingColumn: "embedding" ParentTo: 50 row version v1744790201000/18446744073709551615
2025-04-16T10:56:41.1744790201621157Z :BUILD_INDEX NOTICE: Done TLocalKMeansScan TabletId: 72075186224083152 Id: 281474977825662 Parent: 49 Child: 9801 K: 200 Clusters: 0 State: SAMPLE Round: 0 / 3 LevelBuf size: 0 PostingBuf size: 0 UploadTable: /Root/testdb/alice/idx_vector/indexImplPostingTable UploadBuf size: 0 RetryCount: 0 Id: 281474977825662 TabletId: 72075186224083152 RequestSeqNoGeneration: 25 RequestSeqNoRound: 1 Status: DONE UploadRows: 813621 UploadBytes: 18721062 ReadRows: 4067105 ReadBytes: 16711734445
2025-04-16T10:58:55.1744790335233102Z :BUILD_INDEX NOTICE: Done TLocalKMeansScan TabletId: 72075186224083390 Id: 281474977825662 Parent: 50 Child: 10001 K: 200 Clusters: 0 State: SAMPLE Round: 0 / 3 LevelBuf size: 0 PostingBuf size: 0 UploadTable: /Root/testdb/alice/idx_vector/indexImplPostingTable UploadBuf size: 0 RetryCount: 0 Id: 281474977825662 TabletId: 72075186224083390 RequestSeqNoGeneration: 25 RequestSeqNoRound: 1 Status: DONE UploadRows: 538834 UploadBytes: 12675748 ReadRows: 2693170 ReadBytes: 11066235530
2025-04-16T10:58:55.1744790335234346Z :BUILD_INDEX NOTICE: Starting TLocalKMeansScan TabletId: 72075186224083180 Id: 281474977825662 TabletId: 72075186224083180 PathId { OwnerId: 72075186224037897 LocalId: 68 } SeqNoGeneration: 25 SeqNoRound: 1 Settings { metric: DISTANCE_COSINE vector_type: VECTOR_TYPE_FLOAT vector_dimension: 1024 } Seed: 72075186224083180 K: 200 Upload: UPLOAD_BUILD_TO_POSTING NeedsRounds: 3 ParentFrom: 51 Child: 10201 LevelName: "/Root/testdb/alice/idx_vector/indexImplLevelTable" PostingName: "/Root/testdb/alice/idx_vector/indexImplPostingTable" EmbeddingColumn: "embedding" ParentTo: 51 row version v1744790335000/18446744073709551615

Image

It seems need to be fixed

@vitalif
Copy link
Collaborator

vitalif commented Apr 28, 2025

Not sure if it's actually broken. How to reproduce it?...
My test run after partitioning a test table into 8 partitions:
Image

@kunga
Copy link
Member Author

kunga commented Apr 28, 2025

Try this

ALTER TABLE alice
ADD INDEX idx_vector
GLOBAL USING vector_kmeans_tree
ON (embedding)
WITH (distance=cosine, vector_type="float", vector_dimension=1024, levels=2, clusters=200);

On dataset https://wiki.yandex-team.ru/kikimr/developers/datashard/vector-search/getting-stated-alice-logs/

First level build will be fine, but the next (I guess second) not

@kunga
Copy link
Member Author

kunga commented Apr 28, 2025

You probably need >1 shards of some temporary build table, not a source one to reproduce it

@vitalif
Copy link
Collaborator

vitalif commented May 15, 2025

It seems I reproduced it at least to some extent... During the last index build, TReshuffleKMeansScan always ran in parallel, and some of TLocalKMeansScan within a single node ran sequentially. But some of them didn't and also ran in parallel...

@vitalif
Copy link
Collaborator

vitalif commented May 20, 2025

Discovered the cause accidentally (:D) during debugging #18355. Different clusters are really processed sequentially on levels > 1. Only some of them are processed in parallel - specifically, only clusters which only have 1 data shard, during the "MultiLocal" kmeans phase, introduced in #11539. :-)

@kunga
Copy link
Member Author

kunga commented May 20, 2025

As I know we expect all clusters on level >1 to be in 1 data shard

@vitalif
Copy link
Collaborator

vitalif commented May 20, 2025

As I know we expect all clusters on level >1 to be in 1 data shard

It seems that the code allows >1 shard per cluster, here:
https://github.com/ydb-platform/ydb/blob/main/ydb/core/tx/schemeshard/schemeshard_build_index__progress.cpp#L860

However, in my case it skips all .Global shards except the last one on the last level for some reason and it causes #18355 :)

@vitalif vitalif changed the title Build vector index run sequentially Implement parallel processing of different kmeans clusters in vector index Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants