Skip to content

Commit 7827174

Browse files
Added the Troubleshooting section (#10888)
Co-authored-by: Ivan Blinkov <[email protected]>
1 parent 776b371 commit 7827174

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+1049
-1
lines changed

ydb/docs/en/core/concepts/glossary.md

+25-1
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,16 @@ Together, these mechanisms allow {{ ydb-short-name }} to provide [strict consist
101101

102102
The implementation of distributed transactions is covered in a separate article [{#T}](../contributor/datashard-distributed-txs.md), while below there's a list of several [related terms](#distributed-transaction-implementation).
103103

104+
### Interactive transactions {#interactive-transaction}
105+
106+
The term **interactive transactions** refers to transactions that are split into multiple queries and involve data processing by an application between these queries. For example:
107+
108+
1. Select some data.
109+
1. Process the selected data in the application.
110+
1. Update some data in the database.
111+
1. Commit the transaction in a separate query.
112+
113+
104114
### Multi-version concurrency control {#mvcc}
105115

106116
[**Multi-version concurrency control**](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) or **MVCC** is a method {{ ydb-short-name }} used to allow multiple concurrent transactions to access the database simultaneously without interfering with each other. It is described in more detail in a separate article [{#T}](mvcc.md).
@@ -255,6 +265,20 @@ The **actor system interconnect** or **interconnect** is the [cluster's](#cluste
255265

256266
A **Local** is an [actor service](#actor-service) running on each [node](#node). It directly manages the [tablets](#tablet) on its node and interacts with [Hive](#hive). It registers with Hive and receives commands to launch tablets.
257267

268+
#### Actor system pool {#actor-system-pool}
269+
270+
The **actor system pool** is a [thread pool](https://en.wikipedia.org/wiki/Thread_pool) used to run [actors](#actor). Each [node](#node) operates multiple pools to coarsely separate resources between different types of activities. A typical set of pools includes:
271+
272+
- **System**: A pool that handles internal operations within {{ ydb-short-name }} node. It serves system [tablets](#tablet), [state storage](#state-storage), [distributed storage](#distributed-storage) I/O, and so on.
273+
274+
- **User**: A pool dedicated to user-generated load, such as running non-system tablets or queries executed by the [KQP](#kqp).
275+
276+
- **Batch**: A pool for tasks without strict execution deadlines, including heavy queries handled by the [KQP](#kqp) background operations like backups, data compaction, and garbage collection.
277+
278+
- **IO**: A pool for tasks involving blocking operations, such as authentication or writing logs to files.
279+
280+
- **IC**: A pool for [interconnect](#actor-system-interconnect), responsible for system calls related to data transfers across the network, data serialization, message splitting and merging.
281+
258282
### Tablet implementation {#tablet-implementation}
259283

260284
A [**tablet**](#tablet) is an [actor](#actor) with a persistent state. It includes a set of data for which this tablet is responsible and a finite state machine through which the tablet's data (or state) changes. The tablet is a fault-tolerant entity because tablet data is stored in a [Distributed storage](#distributed-storage) that survives disk and node failures. The tablet is automatically restarted on another [node](#node) if the previous one is down or overloaded. The data in the tablet changes in a consistent manner because the system infrastructure ensures that there is no more than one [tablet leader](#tablet-leader) through which changes to the tablet data are carried out.
@@ -558,7 +582,7 @@ MiniKQL is a low-level language. The system's end users only see queries in the
558582

559583
#### KQP {#kqp}
560584

561-
**KQP** is a {{ ydb-short-name }} component responsible for the orchestration of user query execution and generating the final response.
585+
**KQP** or **Query Processor** is a {{ ydb-short-name }} component responsible for the orchestration of user query execution and generating the final response.
562586

563587
### Global schema {#global-schema}
564588

ydb/docs/en/core/dev/index.md

+2
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,6 @@ Main resources:
2727
- [{#T}](../postgresql/intro.md)
2828
- [{#T}](../reference/kafka-api/index.md)
2929

30+
- [{#T}](troubleshooting/index.md)
31+
3032
If you're interested in developing {{ ydb-short-name }} core or satellite projects, refer to the [documentation for contributors](../contributor/index.md).

ydb/docs/en/core/dev/toc_p.yaml

+5
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@ items:
1818
path: primary-key/toc_p.yaml
1919
- name: Secondary indexes
2020
href: secondary-indexes.md
21+
- name: Troubleshooting
22+
href: troubleshooting/index.md
23+
include:
24+
mode: link
25+
path: troubleshooting/toc_p.yaml
2126
- name: Query plans optimization
2227
href: query-plans-optimization.md
2328
- name: Batch upload
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Troubleshooting
2+
3+
This section of the {{ ydb-short-name }} documentation provides guidance on troubleshooting issues related to {{ ydb-short-name }} databases and the applications that interact with them.
4+
5+
- [{#T}](performance/index.md)
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
1. Use **Diagnostics** in the [Embedded UI](../../../../../reference/embedded-ui/index.md) to analyze CPU utilization in all pools:
2+
3+
1. In the [Embedded UI](../../../../../reference/embedded-ui/index.md), go to the **Databases** tab and click on the database.
4+
5+
1. On the **Navigation** tab, ensure the required database is selected.
6+
7+
1. Open the **Diagnostics** tab.
8+
9+
1. On the **Info** tab, click the **CPU** button and see if any pools show high CPU usage.
10+
11+
![](../_assets/embedded-ui-cpu-system-pool.png)
12+
13+
1. Use Grafana charts to analyze CPU utilization in all pools:
14+
15+
1. Open the **[CPU](../../../../../reference/observability/metrics/grafana-dashboards.md#cpu)** dashboard in Grafana.
16+
17+
1. See if the following charts show any spikes:
18+
19+
- **CPU by execution pool** chart
20+
21+
![](../_assets/cpu-by-pool.png)
22+
23+
- **User pool - CPU by host** chart
24+
25+
![](../_assets/cpu-user-pool.png)
26+
27+
- **System pool - CPU by host** chart
28+
29+
![](../_assets/cpu-system-pool.png)
30+
31+
- **Batch pool - CPU by host** chart
32+
33+
![](../_assets/cpu-batch-pool.png)
34+
35+
- **IC pool - CPU by host** chart
36+
37+
![](../_assets/cpu-ic-pool.png)
38+
39+
- **IO pool - CPU by host** chart
40+
41+
![](../_assets/cpu-io-pool.png)
42+
43+
1. If the spike is in the user pool, analyze changes in the user load that might have caused the CPU bottleneck. See the following charts on the **DB overview** dashboard in Grafana:
44+
45+
- **Requests** chart
46+
47+
![](../_assets/requests.png)
48+
49+
- **Request size** chart
50+
51+
![](../_assets/request-size.png)
52+
53+
- **Response size** chart
54+
55+
![](../_assets/response-size.png)
56+
57+
Also, see all of the charts in the **Operations** section of the **DataShard** dashboard.
58+
59+
2. If the spike is in the batch pool, check if there are any backups running.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
1. Open the **[Distributed Storage Overview](../../../../../reference/observability/metrics/grafana-dashboards.md)** dashboard in Grafana.
2+
3+
1. On the **DiskTimeAvailable and total Cost relation** chart, see if the **Total Cost** spikes cross the **DiskTimeAvailable** level.
4+
5+
![](../_assets/disk-time-available--disk-cost.png)
6+
7+
This chart shows the estimated total bandwidth capacity of the storage system in conventional units (green) and the total usage cost (blue). When the total usage cost exceeds the total bandwidth capacity, the {{ ydb-short-name }} storage system becomes overloaded, leading to increased latencies.
8+
9+
1. On the **Total burst duration** chart, check for any load spikes on the storage system. This chart displays microbursts of load on the storage system, measured in microseconds.
10+
11+
![](../_assets/microbursts.png)
12+
13+
{% note info %}
14+
15+
This chart might show microbursts of the load that are not detected by the average usage cost in the **Cost and DiskTimeAvailable relation** chart.
16+
17+
{% endnote %}
18+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# CPU bottleneck
2+
3+
High CPU usage can lead to slow query processing and increased response times. When CPU resources are constrained, the database may have difficulty handling complex queries or large transaction volumes.
4+
5+
{{ ydb-short-name }} nodes primarily consume CPU resources for running [actors](../../../../concepts/glossary.md#actor). On each node, actors are executed using multiple [actor system pools](../../../../concepts/glossary.md#actor-system-pools). The resource consumption of each pool is measured separately which allows to identify what kind of activity changed its behavior.
6+
7+
## Diagnostics
8+
9+
<!-- The include is added to allow partial overrides in overlays -->
10+
{% include notitle [#](_includes/cpu-bottleneck.md) %}
11+
12+
## Recommendation
13+
14+
Add additional [database nodes](../../../../concepts/glossary.md#database-node) to the cluster or allocate more CPU cores to the existing nodes. If that's not possible, consider distributing CPU cores between pools differently.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Disk space
2+
3+
A lack of available disk space can prevent the database from storing new data, resulting in the database becoming read-only. This can also cause slowdowns as the system tries to reclaim disk space by compacting existing data more aggressively.
4+
5+
## Diagnostics
6+
7+
1. See if the **[DB overview > Storage](../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** charts in Grafana show any spikes.
8+
9+
1. In [Embedded UI](../../../../reference/embedded-ui/index.md), on the **Storage** tab, analyze the list of available storage groups and nodes and their disk usage.
10+
11+
{% note tip %}
12+
13+
Use the **Out of Space** filter to list only the storage groups with full disks.
14+
15+
{% endnote %}
16+
17+
![](_assets/storage-groups-disk-space.png)
18+
19+
{% note info %}
20+
21+
It is also recommended to use the [Healthcheck API](../../../../reference/ydb-sdk/health-check-api.md) to get this information.
22+
23+
{% endnote %}
24+
25+
## Recommendations
26+
27+
Add more [storage groups](../../../../concepts/glossary.md#storage-group) to the database.
28+
29+
If the cluster doesn't have spare storage groups, configure them first. Add additional [storage nodes](../../../../concepts/glossary.md#storage-node), if necessary.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Insufficient memory (RAM)
2+
3+
If [swap](https://en.wikipedia.org/wiki/Memory_paging#Unix_and_Unix-like_systems) (paging of anonymous memory) is disabled on the server running {{ ydb-short-name }}, insufficient memory activates another kernel feature called the [OOM killer](https://en.wikipedia.org/wiki/Out_of_memory), which terminates the most memory-intensive processes (often the database itself). This feature also interacts with [cgroups](https://en.wikipedia.org/wiki/Cgroups) if multiple cgroups are configured.
4+
5+
If swap is enabled, insufficient memory may cause the database to rely heavily on disk I/O, which is significantly slower than accessing data directly from memory.
6+
7+
{% note warning %}
8+
9+
If {{ ydb-short-name }} nodes are running on servers with swap enabled, disable it. {{ ydb-short-name }} is a distributed system, so if a node restarts due to lack of memory, the client will simply connect to another node and continue accessing data as if nothing happened. Swap would allow the query to continue on the same node but with degraded performance from increased disk I/O, which is generally less desirable.
10+
11+
{% endnote %}
12+
13+
Even though the reasons and mechanics of performance degradation due to insufficient memory might differ, the symptoms of increased latencies during query execution and data retrieval are similar in all cases.
14+
15+
Additionally, which components within the {{ ydb-short-name }} process consume memory may also be significant.
16+
17+
## Diagnostics
18+
19+
1. Determine whether any {{ ydb-short-name }} nodes recently restarted for unknown reasons. Exclude cases of {{ ydb-short-name }} version upgrades and other planned maintenance. This could reveal nodes terminated by OOM killer and restarted by `systemd`.
20+
21+
1. Open [Embedded UI](../../../../reference/embedded-ui/index.md).
22+
23+
1. On the **Nodes** tab, look for nodes that have low uptime.
24+
25+
1. Chose a recently restarted node and log in to the server hosting it. Run the `dmesg` command to check if the kernel has recently activated the OOM killer mechanism.
26+
27+
Look for the lines like this:
28+
29+
[ 2203.393223] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=ydb,pid=1332,uid=1000
30+
[ 2203.393263] Out of memory: Killed process 1332 (ydb) total-vm:14219904kB, anon-rss:1771156kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:4736kB oom_score_adj:0
31+
32+
Additionally, review the `ydbd` logs for relevant details.
33+
34+
35+
1. Determine whether memory usage reached 100% of capacity.
36+
37+
1. Open the **[DB overview](../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** dashboard in Grafana.
38+
39+
1. Analyze the charts in the **Memory** section.
40+
41+
1. Determine whether the user load on {{ ydb-short-name }} has increased. Analyze the following charts on the **[DB overview](../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** dashboard in Grafana:
42+
43+
- **Requests** chart
44+
- **Request size** chart
45+
- **Response size** chart
46+
47+
1. Determine whether new releases or data access changes occurred in your applications working with {{ ydb-short-name }}.
48+
49+
## Recommendation
50+
51+
Consider the following solutions for addressing insufficient memory:
52+
53+
- If the load on {{ ydb-short-name }} has increased due to new usage patterns or increased query rate, try optimizing the application to reduce the load on {{ ydb-short-name }} or add more {{ ydb-short-name }} nodes.
54+
55+
- If the load on {{ ydb-short-name }} has not changed but nodes are still restarting, consider adding more {{ ydb-short-name }} nodes or raising the hard memory limit for the nodes. For more information about memory management in {{ ydb-short-name }}, see [{#T}](../../../../reference/configuration/index.md#memory-controller).
56+
57+
58+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# I/O bandwidth
2+
3+
A high rate of read and write operations can overwhelm the disk subsystem, leading to increased data access latencies. When the system cannot read or write data quickly enough, queries that rely on disk access will experience delays.
4+
5+
## Diagnostics
6+
7+
<!-- The include is added to allow partial overrides in overlays -->
8+
{% include notitle [io-bandwidth](./_includes/io-bandwidth.md) %}
9+
10+
## Recommendations
11+
12+
Add more [storage groups](../../../../concepts/glossary.md#storage-group) to the database.
13+
14+
In cases of high microburst rates, balancing the load across storage groups might help.
15+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
items:
2+
- name: CPU
3+
href: cpu-bottleneck.md
4+
- name: Memory
5+
href: insufficient-memory.md
6+
- name: I/O bandwidth
7+
href: io-bandwidth.md
8+
- name: Disk space
9+
href: disk-space.md

0 commit comments

Comments
 (0)