Skip to content

Commit d4fb6ba

Browse files
committed
Proof-read the first few sections
1 parent ef302eb commit d4fb6ba

File tree

2 files changed

+35
-35
lines changed

2 files changed

+35
-35
lines changed

ydb/docs/en/core/contributor/datashard-distributed-txs.md

Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,45 @@
11
# DataShard: distributed transactions
22

3-
{{ydb-short-name}} uses distributed transactions based on ideas from [Calvin](https://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf). These transactions consist of operations (not necessarily deterministic) that are preformed at a predetermined set of participants (e.g. DataShards). Executing a transaction involves preparing it on all participants and assigning a position (timestamp) in the global transaction execution order using one of coordinator tablets in the database. Each participant receives and executes an ordered stream of transactions that it is involved in. Participants may receive and execute their parts of a larger transaction with varying speed and not necessarily simultaneously. However, any particular distributed transaction has the same timestamp at all participants, and must include all effects with preceding timestamps, so when viewed as a logical clock (based on global timestamps) these distributed transaction are performed at the same point in logical time.
3+
{{ydb-short-name}} uses distributed transactions, which are based on ideas from Calvin (see [Calvin: A Scalable Distributed Transactions System](https://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf)). These transactions consist of a set of operations performed by a group of participants, such as DataShards. Unlike Calvin these operations are not required to be deterministic. To execute a distributed transaction, a proposer prepares the transaction at each participant, assigns a position (or timestamp) to the transaction in the global transaction execution order using one of the coordinator tablets, and collects the transaction results. Each participant receives and processes a subset of transactions it is involved in, following a specific order. Participants may process their part of the larger transaction at different speeds and not simultaneously. Distributed transactions share the same timestamp across all participating shards, and must include all changes from transactions with preceding timestamps. When viewed as a logical sequence, timestamps act as a single logical timeline, where any distributed transaction happens at the same point in time.
44

5-
When transaction execution depends on the state of other participants, participants exchange data using so called ReadSets, which are persistent messages between participants, delivered at least once. ReadSets cause the transaction to have some additional phases:
5+
When the execution of a transaction depends on the state of other participants, the participants exchange data using so-called ReadSets. These are persistent messages exchanged between participants that are delivered at least once. The use of ReadSets causes transactions to go through additional phases:
66

7-
1. Reading phase. Participant reads, persists and sends data needed for other participants. Currently commits of KQP transactions (transaction type `TX_KIND_DATA` with a non-empty field `TDataTransaction.KqpTransaction` and type `KQP_TX_TYPE_DATA`) validate optimistic locks in this phase. Older MiniKQL transactions (transaction type `TX_KIND_DATA` with a non-empty field `TDataTransaction.MiniKQL`) used this phase to read and send arbitrary table data. Distributed TTL transaction for erasing expired rows is another example of using this phase for reads, the primary shard generates a bit mask of matching expired rows, making sure primary and index shards delete the same set of rows.
8-
2. Waiting phase. Participant waits until it receives all the necessary data from other participants.
9-
3. Execution phase. Participant uses local and remote data to decide whether to abort or execute the transaction, generating and applying effects using the transaction body. Transaction body usually has a program that uses the same data and causes all participants to arrive at the same decision.
7+
1. **Reading phase**: The participant reads, persists and sends data that is needed by other participants. During this phase, KQP transactions (type `TX_KIND_DATA`, which have a non-empty `TDataTransaction.KqpTransaction` field and subtype `KQP_TX_TYPE_DATA`) validate optimistic locks. Older MiniKQL transactions (type `TX_KIND_DATA`, which have a non-empty `TDataTransaction.MiniKQL` field) perform reads and send arbitrary table data during this phase. Another example of using the reading phase is the distributed TTL transaction for deleting expired rows. The primary shard generates a bitmask of matching rows that have expired, ensuring that both the primary and index shards delete the same rows.
8+
2. **Waiting phase**: The participant waits until it has received all the necessary data from the other participants.
9+
3. **Execution phase**: The participant uses both local and remote data to determine whether to abort or complete the transaction. If the transaction is completed, the participant generates and applies the effects specified in the transaction body. The transaction body typically includes a program that uses the same data and leads all participants to come to the same conclusion.
1010

11-
Participants are allowed to execute transactions in a different order for efficiency, however it's important that this reordering cannot be observed by reads and writes from concurrent transactions. Transaction ordering based on coordinator assigned timestamps ensures strict serializable isolation. In practice, single-shard transactions do not involve coordinator tablets and shards assign a locally consistent timestamp to such transactions. Due to variations in transaction stream arrival times this weakens transaction isolation to serializable.
11+
Participants are allowed to execute transactions in any order for efficiency. However, it's important that this order cannot be observed by other transactions. Transaction ordering based on a coordinator's assigned timestamps ensures strict serializable isolation. In practice, single-shard transactions don't involve a coordinator, and shards use a locally consistent timestamp for such transactions. Due to variations in the arrival times of distributed transaction timestamps, this weakens the isolation level to serializable.
1212

13-
Version 24.1 of {{ydb-short-name}} added support for "volatile" distributed transaction. These transactions allow participants (including coordinator) to store transaction information in volatile memory (which is lost when shards are restarted) until transaction is executed and effects are persisted. This also allows participants to abort transactions (which is guaranteed to abort at all other participants) until the very last moment. Using volatile memory excludes persistent storage from the hot path of transaction execution and reduces latency.
13+
Version 24.1 of {{ydb-short-name}} has added support for "volatile" distributed transactions. These transactions allow participants, including coordinators, to store transaction data in volatile memory, which is lost when the shards are restarted, until the transaction is completed and the effects are persisted. This also allows participants to abort the transaction until the very last moment, which will be guaranteed to abort for all other participants. By using volatile memory, persistent storage is excluded from the critical path of the transaction execution, reducing latency.
1414

15-
When executing user's YQL transactions {{ydb-short-name}} currently only uses distributed transaction for the final commit, which applies effects in read-write transactions. Individual queries before the commit are executed as single-shard operations, ensuring consistency using optimistic locks and global MVCC snapshots.
15+
When executing user's YQL transactions, {{ydb-short-name}} currently uses distributed transactions only for the final commit of non-read-only transactions. Individual queries are executed as single-shard operations before the commit, using optimistic locks and global multi-version concurrency control (MVCC) snapshots to ensure data consistency.
1616

1717
## Basic distributed transactions protocol
1818

19-
Operations that may be executed as distributed transactions in {{ydb-short-name}} and types of supported participants are extended over time. Basic protocol for distributed transactions is however the same regardless of transaction type, with notable differences in schema changes, which have additional requirements to make sure these transactions are idempotent.
20-
21-
Distributed transactions are orchestrated using proposer actors, examples of which are:
22-
23-
* [TKqpDataExecutor](https://github.com/ydb-platform/ydb/blob/main/ydb/core/kqp/executer_actor/kqp_data_executer.cpp) executes DML queries, including distributed commits
24-
* [SchemeShard](https://github.com/ydb-platform/ydb/tree/main/ydb/core/tx/schemeshard) executes distributed transactions for schema changes
25-
* [TDistEraser](https://github.com/ydb-platform/ydb/blob/main/ydb/core/tx/datashard/datashard_distributed_erase.cpp) executes distributed transactions for consistently erasing rows in tables with secondary indexes that match TTL rules
26-
27-
Distributed transactions in {{ydb-short-name}} are a bit similar to two-phase commits. Overall the proposer actor has the following phases of distributed transaction execution:
28-
29-
0. Determining participants. The proposer actor selects specific shards (`TabletId`) that are required for transaction execution. For example, a table may consist of many shards (`DataShard` tablets with unique `TabletId` identifiers), but a particular transaction may be limited to a smaller set of shards based on affected primary keys. This shard set is fixed at the start of the transaction and cannot be extended later. Transactions that affect a single shard are called single-shard transactions and may be executed as immediate.
30-
1. Prepare phase. Proposer sends a special event (usually called `TEvProposeTransaction`, in the case of DataShard there is a `TEvWrite` variant), which specifies a `TxId` (a transaction identifier, unique within a particular cluster), and includes operations and parameters (transaction body). Participants validate whether it is possible to execute the specified transaction, select a range of allowed timestamps (`MinStep` and `MaxStep`), and reply with a `PREPARED` status on success.
31-
* For single-shard transactions the proposer usually specifies an "immediate execution" mode (`Immediate`). Shard executes such transactions as soon as possible (at an unspecified timestamp consistent with other transactions), and replies with the result instead of `PREPARED`, which causes planning phase to be skipped. Some special single-shard operations (e.g. `TEvUploadRowsRequest` which implements `BulkUpsert`) don't even have a globally unique `TxId`.
32-
* Persistent transaction body is persisted in the shard's local database and participant must guarantee that it will be executed when planned. In some cases (e.g. blind `UPSERT` into multiple shards) participants must also guarantee it will be executed successfully, which may conflict with certain schema changes.
33-
* Volatile transaction body is stored in memory and participant replies with `PREPARED` as soon as possible. Future execution (whether successful or not) is not guaranteed in any way.
34-
* Proposer moves to the next phase when it gathers replies from all participants.
35-
* It's not safe to retry sending the propose event, except for schema operations which thanks to special idempotency fields guarantee a particular transaction will execute exactly once.
36-
2. Planning phase. When the proposer receives `PREPARED` replies from all participants, it computes aggregated `MinStep` and `MaxStep` values and selects a coordinator used for assigning timestamp to the transaction. A `TEvTxProxy::TEvProposeTransaction` event is sent to the selected coordinator which includes `TxId` and a list of participants.
37-
* Transaction may only involve shards from the same database. Every shard includes its [ProcessingParams](https://github.com/ydb-platform/ydb/blob/a68faed0a7b525a750d5f566e5c3fc60424cc91e/ydb/core/protos/subdomains.proto#L31) in the reply, which has the same list of coordinators for shards in the same database.
38-
* Coordinator is selected based on received `ProcessingParams` because it was historically possible to execute queries without specifying a database, and the list of coordinators could only be found from participants.
39-
* When `TEvTxProxy::TEvProposeTransaction` event is retried (currently only for schema transactions) it is possible for the transaction to have multiple timestamps assigned. This is usually not a problem, the transaction executes at the minimum timestamp, and all other timestamps are ignored (transaction is executed and removed by that time).
40-
3. Execution and reply gathering phase. The proposer waits for replies from the selected coordinator and participants, gathering the overall transaction result.
41-
* In some cases (temporary disconnect, shard restart) the proposer may attempt to restore the connection and keep waiting for the result unless the transaction has executed and the result has been lost.
42-
* When it is impossible to acquire the transaction result from at least one participant due to network errors, transaction usually fails with the `UNDETERMINED` status, which signifies it's impossible to know whether the transaction executed successfully or not.
19+
Operations that can be performed as distributed transactions in {{ydb-short-name}} include various types of participants. The basic protocol for distributed transactions is the same regardless of the type of transaction, with some notable differences in the schema changes, which have additional requirements to make sure these transactions are idempotent.
20+
21+
Distributed transactions are managed using proposer actors. Some examples of these are:
22+
23+
* [TKqpDataExecutor](https://github.com/ydb-platform/ydb/blob/main/ydb/core/kqp/executer_actor/kqp_data_executer.cpp) executes DML queries, including distributed commits.
24+
* [SchemeShard](https://github.com/ydb-platform/ydb/tree/main/ydb/core/tx/schemeshard) executes distributed transactions for schema changes.
25+
* [TDistEraser](https://github.com/ydb-platform/ydb/blob/main/ydb/core/tx/datashard/datashard_distributed_erase.cpp) executes a distributed transaction to consistently erase rows in tables with secondary indexes that match time-to-live (TTL) rules.
26+
27+
Distributed transactions in {{ydb-short-name}} are similar to two-phase commit protocols. The proposer actor goes through the following phases when executing a distributed transaction:
28+
29+
0. **Determining participants**: The proposer actor selects specific shards (`TabletId`) that are required for transaction execution. A table may consist of many shards (`DataShard` tablets with unique `TabletId` identifiers), but a particular transaction may only affect a smaller set of these shards based on the affected primary keys. This subset is fixed at the start of the transaction and cannot be changed later. Transactions that only affect a single shard are called "single-shard" transactions and are processed in what is known as the "immediate execution" mode.
30+
1. **Prepare phase**: The proposer sends a special event, usually called `TEvProposeTransaction` (there is also `TEvWrite` variant in DataShards), which specifies a `TxId`, a transaction identifier, unique within a particular cluster, and includes the transaction body (operations and parameters). Participants validate whether the specified transaction can be executed, select a range of allowed timestamps, `MinStep` and `MaxStep`, and reply with a `PREPARED` status on success.
31+
* For single-shard transactions the proposer typically specifies an "immediate execution" mode (`Immediate`). The shard executes such transactions as soon as possible (at an unspecified timestamp consistent with other transactions) and replies with the result, rather than `PREPARED`, which causes the planning phase to be skipped. Some special single-shard operations, such as `TEvUploadRowsRequest` which implements `BulkUpsert`, don't even have a globally unique `TxId`.
32+
* The persistent transaction body is stored in the shard's local database, and the participant must ensure that it is executed when planned. In certain cases (for example, when performing a blind `UPSERT` into multiple shards), the participants must also ensure that the transaction is executed successfully, which may be in conflict with certain schema changes.
33+
* The volatile transaction body is stored in memory, and the participant responds with `PREPARED` as soon as possible. Future execution, whether successful or not, is not guaranteed in any way.
34+
* The proposer moves on to the next phase when they have received responses from all the participants.
35+
* It's not safe to re-send the propose event, except for schema operations, which thanks to special idempotent fields, guarantee that a particular transaction will be executed exactly once.
36+
2. **Planning phase**: When the proposer has received `PREPARED` replies from all participants, it calculates the aggregated `MinStep` and `MaxStep` values and selects a coordinator to assign the timestamp to the transaction. A `TEvTxProxy::TEvProposeTransaction` event is sent to the selected coordinator, which includes the `TxId` and the list of participants.
37+
* The transaction may only involve shards from the same database. Each shard attaches its [ProcessingParams](https://github.com/ydb-platform/ydb/blob/a68faed0a7b525a750d5f566e5c3fc60424cc91e/ydb/core/protos/subdomains.proto#L31) to the reply, which has the same list of coordinators when shards belong to the same database.
38+
* The coordinator is selected based on the received `ProcessingParams`, because historically it was possible to execute queries without specifying a database. The list of coordinators can only be found among the participants.
39+
* When the `TEvTxProxy::TEvProposeTransaction` event is re-sent (currently only for schema transactions), it is possible that the transaction may have multiple timestamps associated with it. This is not typically a problem as the transaction will execute at the earliest possible timestamp, and any later timestamps will be ignored (the transaction will have completed and been removed by the time they occur).
40+
3. **Execution phase**: The proposer waits for responses from the selected coordinator and participants, collecting the overall transaction outcome.
41+
* In some cases, such as a temporary network disconnection or a shard restart, the proposer may try to re-establish the connection and wait for the result. This process may continue until the transaction has completed and the result is available.
42+
* When it is impossible to retrieve the result of a transaction from at least one the participants due to network issues, the transaction usually fails with an `UNDETERMINED` status, indicating that it is impossible to determine whether the transaction was successful or not.
4343

4444
## Prepare phase in the DataShard tablet
4545

0 commit comments

Comments
 (0)