Skip to content

Commit 9bf7038

Browse files
committed
Initial commit
1 parent 827cf58 commit 9bf7038

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+810
-0
lines changed

ydb/docs/en/core/dev/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,6 @@ Main resources:
2727
- [{#T}](../postgresql/intro.md)
2828
- [{#T}](../reference/kafka-api/index.md)
2929

30+
- [{#T}](troubleshooting/index.md)
31+
3032
If you're interested in developing {{ ydb-short-name }} core or satellite projects, refer to the [documentation for contributors](../contributor/index.md).

ydb/docs/en/core/dev/toc_p.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@ items:
1818
path: primary-key/toc_p.yaml
1919
- name: Secondary indexes
2020
href: secondary-indexes.md
21+
- name: Troubleshooting
22+
href: troubleshooting/index.md
23+
include:
24+
mode: link
25+
path: troubleshooting/toc_p.yaml
2126
- name: Query plans optimization
2227
href: query-plans-optimization.md
2328
- name: Batch upload
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Troubleshooting
2+
3+
This section of {{ ydb-short-name }} documentation covers everything you need to know to troubleshoot {{ ydb-short-name }} issues.
4+
5+
- [{#T}](performance/index.md)
Loading
Loading
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
1. Analyze CPU utilization in all pools:
2+
3+
1. Open the **CPU** dashboard in Grafana.
4+
5+
1. See if the following charts show any spikes:
6+
7+
- **CPU by execution pool** chart
8+
- **User pool - CPU by host** chart
9+
- **System pool - CPU by host** chart
10+
- **Batch pool - CPU by host** chart
11+
- **IC pool - CPU by host** chart
12+
- **IO pool - CPU by host** chart
13+
14+
1. Analyze changes in the user load that might have caused the CPU bottleneck. See the following charts on the **DB overview** in Grafana:
15+
16+
- **Requests** chart
17+
18+
- **Request size** chart
19+
20+
- **Response size** chart
21+
22+
Also, see all of the charts in the **Operations** section of the **DataShard** dashboard. These charts show the number of rows processed per query.
23+
24+
1. Contact your DBA and inquire about {{ ydb-short-name }} backups.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
1. Open the **DB overview** dashboard in Grafana.
2+
3+
1. On the **Cost and DiskTimeAvailable relation** chart, see if the **Disk cost** spikes cross the **DiskTimeAvailable** level.
4+
5+
![](../_assets/disk-time-available--disk-cost.png)
6+
7+
This chart shows the estimated total bandwith capacity of the storage system in conventional units (green) and the average usage cost (blue). When the average usage cost exceeds the total bandwidth capacity, the storage system of {{ ydb-short-name }} gets overloaded, which results in higher latencies.
8+
9+
1. On the **Burst duration, ms** chart, check for any spikes of the load on the storage system. This chart shows microbursts of the load on the storage system, in microseconds.
10+
11+
![](../_assets/microbursts.png)
12+
13+
{% note info %}
14+
15+
This chart might show microbursts of the load that are not detected by the average usage cost in the **Cost and DiskTimeAvailable relation** chart.
16+
17+
{% endnote %}
18+
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# CPU bottleneck
2+
3+
High CPU usage can lead to slow query processing and increased response times. When CPU resources are constrained, the database may have difficulty handling complex queries or large transaction volumes.
4+
5+
The CPU resources are mainly used by the actor system. Depending on the type, all actors run in one of the following pools:
6+
7+
- **System**: A pool that is designed for running quick internal operations in YDB (it serves system tablets, state storage, distributed storage I/O, and erasure coding).
8+
9+
- **User**: A pool that serves the user load (user tablets, queries run in the Query Processor).
10+
Batch: A pool that serves tasks with no strict limit on the execution time, background operations like garbage collection and heavy queries run in the Query Processor.
11+
12+
- **IO**: A pool responsible for performing any tasks with blocking operations (such as authentication or writing logs to a file).
13+
14+
- **IC**: Interconnect, it serves the load related to internode communication (system calls to wait for sending and send data across the network, data serialization, as well as message splits and merges).
15+
16+
## Diagnostics
17+
18+
<!-- The include is added to allow partial overrides in overlays -->
19+
{% include notitle [#](_includes/cpu-bottleneck.md) %}
20+
21+
<!-- If the spikes on these charts align, the increased latencies may be related to the higher number of rows being read from the database. In this case, the available database nodes might not be sufficient to handle the increased load. -->
22+
23+
## Recommendation
24+
25+
Add additional [database nodes](../../../../concepts/glossary.md#database-node) or allocate more CPU cores to the existing nodes.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Disk space
2+
3+
A lack of available disk space can prevent the database from storing new data, resulting in the database becoming read-only. This can also cause slowdowns as the system tries to reclaim disk space by compacting existing data more aggressively.
4+
5+
## Diagnostics
6+
7+
<!-- TODO: Mention the limits metric, if it's operational -->
8+
9+
1. See if the **DB overview > Storage** charts in Grafana show any spikes.
10+
11+
1. In [Embedded UI](../../../../reference/embedded-ui/index.md), on the **Storage** tab, analyze the list of available storage groups and nodes and their disk usage.
12+
13+
{% note tip %}
14+
15+
Use the **Out of Space** filter to list only the storage groups with full disks.
16+
17+
{% endnote %}
18+
19+
![](_assets/storage-groups-disk-space.png)
20+
21+
{% note info %}
22+
23+
You can also use the [Healthcheck API](../../../../reference/ydb-sdk/health-check-api.md) in your application to get this information.
24+
25+
{% endnote %}
26+
27+
## Recommendations
28+
29+
Add more [storage groups](../../../../concepts/glossary.md#storage-group) to the database.
30+
31+
If the cluster doesn't have spare storage groups, configure them first. Add additional [storage nodes](../../../../concepts/glossary.md#storage-node), if necessary.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Insufficient memory (RAM)
2+
3+
If [swap](https://en.wikipedia.org/wiki/Memory_paging#Unix_and_Unix-like_systems) (paging of anonymous memory) is disabled on the server running {{ ydb-short-name }}, insufficient memory activates another kernel feature called the [OOM killer](https://en.wikipedia.org/wiki/Out_of_memory), which terminates the most memory-intensive processes (often the database itself). This feature also interacts with [cgroups](https://en.wikipedia.org/wiki/Cgroups) if multiple cgroups are configured.
4+
5+
If swap is enabled, insufficient memory may cause the database to rely heavily on disk I/O, which is significantly slower than accessing data directly from memory. This can result in increased latencies during query execution and data retrieval.
6+
7+
Additionally, which components within the {{ ydb-short-name }} process consume memory may also be significant.
8+
9+
## Diagnostics
10+
11+
1. Determine whether any {{ ydb-short-name }} nodes recently restarted for unknown reasons. Exclude cases of {{ ydb-short-name }} upgrades.
12+
13+
1. Open [Embedded UI](../../../../reference/embedded-ui/index.md).
14+
15+
1. On the **Nodes** tab, look for nodes that have low uptime.
16+
17+
1. Determine whether memory usage reached 100%.
18+
19+
1. Open the **DB overview** dashboard in Grafana.
20+
21+
1. Analyze the charts in the **Memory** section.
22+
23+
1. Determine whether the user load on {{ ydb-short-name }} has increased. Analyze the following charts on the **DB overview** dashboard in Grafana:
24+
25+
- **Requests** chart
26+
- **Request size** chart
27+
- **Response size** chart
28+
29+
1. Determine whether new releases or data usage changes occurred in your applications.
30+
31+
## Recommendation
32+
33+
Consider the following solutions to the problem of insufficient memory:
34+
35+
- If the load on {{ ydb-short-name }} increased because of new usage patterns or higher volume of queries, try to optimize the application to reduce the load on {{ ydb-short-name }} or add more {{ ydb-short-name }} nodes.
36+
37+
- If the load on {{ ydb-short-name }} has not changed, but {{ ydb-short-name }} nodes still restart, consider adding more {{ ydb-short-name }} nodes or raising the hard memory limit available to {{ ydb-short-name }} nodes.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# I/O bandwidth
2+
3+
A high rate of read/write operations can overwhelm the disk subsystem, resulting in increased latencies in data access. When the system cannot read or write data quickly enough, queries that depend on disk access will experience delays.
4+
5+
## Diagnostics
6+
7+
<!-- The include is added to allow partial overrides in overlays -->
8+
{% include notitle [io-bandwidth](./_includes/io-bandwidth.md) %}
9+
10+
## Recommendations
11+
12+
Add more [storage groups](../../../../concepts/glossary.md#storage-group) to the database.
13+
14+
In case of high microburst values, you can also try to balance the load across storage groups.
15+
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Troubleshooting performance issues
2+
3+
Addressing database performance issues often requires a holistic approach, which includes optimizing queries, properly configuring hardware resources, and ensuring that both the database and the application are well-designed. Regular monitoring and maintenance are essential for proactively identifying and resolving these issues.
4+
5+
## Tools to troubleshoot performance issues
6+
7+
Troubleshooting performance issues in {{ ydb-short-name }} involves the following tools:
8+
9+
- [{{ ydb-short-name }} metrics](../../../reference/observability/metrics/index.md)
10+
11+
- [{{ ydb-short-name }} logs](../../../devops/manual/logging.md)
12+
13+
- [{{ ydb-short-name }} CLI](../../../reference/ydb-cli/index.md)
14+
15+
- [Tracing](../../../reference/observability/tracing/setup.md)
16+
17+
- [Embedded UI](../../../reference/embedded-ui/index.md)
18+
19+
- [Query plans](../../query-plans-optimization.md)
20+
21+
22+
## Classification of {{ ydb-short-name }} performance issues
23+
24+
Database performance issues can be classified into several categories based on their nature. This documentation section provides a high-level overview of these categories, starting with the lowest layers of the system and going all the way to the client. Below is a separate section for the [actual performance troubleshooting instructions](#instructions).
25+
26+
- **Hardware infrastructure issues**.
27+
28+
- **[Network issues](infrastructure/network.md)**. Insufficient bandwidth or network congestion in data centers can significantly affect {{ ydb-short-name }} performance.
29+
30+
- **[Data center outages](infrastructure/dc-outage.md)**: Disruptions in data center operations that can cause service or data unavailability. These outages may result from various factors, such as power failures, natural disasters, or cyber-attacks. A common fault-tolerant setup for {{ ydb-short-name }} spans three data centers or availability zones (AZs). {{ ydb-short-name }} can continue operating without interruption, even if one data center and a server rack in another are lost. However, it will initiate the relocation of tablets from the offline AZ to the remaining online nodes, temporarily leading to higher query latencies. Distributed transactions involving tablets that are moving to other nodes might experience increased latencies.
31+
32+
- **[Data center maintenance and drills](infrastructure/dc-drills.md)**. Planned maintenance or drills, exercises conducted to prepare personnel for potential emergencies or outages, can also affect query performance. Depending on the maintenance scope or drill scenario, some {{ ydb-short-name }} servers might become unavailable, which leads to the same impact as an outage.
33+
34+
- **[Server hardware issues](infrastructure/hardware.md)**. Malfunctioning CPU, memory modules, and network cards, until replaced, significantly impact database performance up to the total unavailability of the affected server.
35+
36+
- **Insufficient resources**. These issues refer to situations when the workload demands more physical resources — such as CPU, memory, disk space, and network bandwidth — than allocated to a database.
37+
38+
- **[CPU bottlenecks](hardware/cpu-bottleneck.md)**. High CPU usage can result in slow query processing and increased response times. When CPU resources are limited, the database may struggle to handle complex queries or large transaction loads.
39+
40+
- **[Insufficient disk space](hardware/disk-space.md)**. A lack of available disk space can prevent the database from storing new data, resulting in the database becoming read-only. This can also cause slowdowns as the system tries to reclaim disk space by compacting existing data more aggressively.
41+
42+
- **[Insufficient memory (RAM)](hardware/insufficient-memory.md)**. If swap is disabled, insufficient memory can trigger [OOM killer](https://en.wikipedia.org/wiki/Out_of_memory) that terminates the most memory-hungry processes (for servers running databases, it's often the database itself). If swap is enabled, insufficient memory can cause the database to rely heavily on disk I/O for operations, which is significantly slower than accessing data from memory. This can lead to increased latencies in query execution and data retrieval.
43+
44+
- **[Insufficient disk I/O bandwidth](hardware/io-bandwidth.md)**. High read/write operations can overwhelm disk subsystems, leading to increased latencies in data access. When the system cannot read or write data quickly enough, queries that require disk access will be delayed.
45+
46+
- **OS issues**
47+
48+
- **Hardware resource allocation issues**. Suboptimal allocation of resources, for example poorly configured control groups (cgroups), may result in insufficient resources for {{ ydb-short-name }} and increase query latencies even though physical hardware resources are still available on the database server.
49+
50+
- **[System clock drift](system/system-clock-drift.md)**. If the system clocks on the {{ ydb-short-name }} servers start to drift apart, it will lead to increased distributed transaction latencies. In severe cases, {{ ydb-short-name }} might even refuse to process distributed transactions and return errors.
51+
52+
- Other processes running on the same nodes as YDB, such as antiviruses, observability agents, etc.
53+
54+
- Kernel misconfiguration.
55+
56+
- **YDB-related issues**
57+
58+
- **[Rolling restart](system/ydb-updates.md)**. {{ ydb-short-name }} is a distributed system that supports rolling restart, when database administrators update {{ ydb-short-name }} nodes one by one. This helps keep the {{ ydb-short-name }} cluster up and running during the update process or some {{ ydb-short-name }} configuration changes. However, when a YDB node is being restarted, Hive moves the tables that run on this node to other nodes, and that may lead to increased latencies for queries that are processed by the moving tables.
59+
60+
- Actor system pools misconfiguration.
61+
62+
- SDK usage issues (maybe worth being a separate category).
63+
64+
- **Schema design issues**. These issues stem from inefficient decisions made during the creation of tables and indices. They can significantly impact query performance.
65+
66+
- **Client application issues**. These issues refer to database queries executing slower than expected because of their inefficient design.
67+
68+
## Instructions {#instructions}
69+
70+
To troubleshoot {{ ydb-short-name }} performance issues, treat each potential cause as a hypothesis. Systematically review the list of hypotheses and verify whether they apply to your situation. The documentation for each cause provides a description, guidance on how to check diagnostics, and recommendations on what to do if the hypothesis is confirmed.
71+
72+
If any known changes occurred in the system around the time the performance issues first appeared, investigate those first. Otherwise, follow this recommended order for evaluating potential root causes. This order is loosely based on the descending frequency of their occurrence on large production {{ ydb-short-name }} clusters.
73+
74+
1. [Overloaded shards](schemas/overloaded-shards.md)
75+
1. [Excessive tablet splits and merges](schemas/splits-merges.md)
76+
1. [Frequent tablet moves between nodes](system/tablets-moved.md)
77+
1. Insufficient hardware resources:
78+
- [Disk I/O bandwidth](hardware/io-bandwidth.md)
79+
- [Disk space](hardware/disk-space.md)
80+
- [Insufficient CPU](hardware/cpu-bottleneck.md)
81+
1. [Hardware issues](infrastructure/hardware.md) and [data center outages](infrastructure/dc-outage.md)
82+
1. [Network issues](infrastructure/network.md)
83+
1. [{{ ydb-short-name }} updates](system/ydb-updates.md)
84+
1. [System clock drift](system/system-clock-drift.md)
85+
86+
87+
Loading
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
To determine if one of the data centers of the {{ ydb-short-name }} cluster is not available, follow these steps:
2+
3+
1. Open [Embedded UI](../../../../../reference/embedded-ui/index.md).
4+
5+
1. On the **Nodes** tab, analyze the [health indicators](../../../../../reference/embedded-ui/ydb-monitoring.md#colored_indicator) in the **Host** and **DC** columns.
6+
7+
![](../_assets/cluster-nodes.png)
8+
9+
If all of the nodes in one of the DCs (data centers) are not available, this data center is most likely offline.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
To diagnose network issues, use the healthcheck in the [Embedded UI](../../../../../reference/embedded-ui/index.md):
2+
3+
1. In the [Embedded UI](../../../../../reference/embedded-ui/index.md), go to the **Databases** tab and click on the database.
4+
5+
1. On the **Navigation** tab, ensure the required database is selected.
6+
7+
1. Open the **Diagnostics** tab.
8+
9+
1. On the **Network** tab, select the **With problems** filter.
10+
11+
![](../_assets/diagnostics-network.png)
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Data center maintenance and drills
2+
3+
Planned maintenance or drills, exercises conducted to prepare personnel for potential emergencies or outages, can also affect query performance. Depending on the maintenance scope or drill scenario, some {{ ydb-short-name }} nodes might become unavailable, which leads to the same impact as an [outage](./dc-outage.md).
4+
5+
## Diagnostics
6+
7+
Check the planned maintenance and drills schedules to see if their timelines match with observed performance issues, otherwise, check the [datacenter outage recommendations](dc-outage.md).
8+
9+
## Recommendations
10+
11+
Contact the person responsible for the current maintenance or drill to discuss whether the performance impact is severe enough for it to be finished/canceled early, if possible.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Data center outages
2+
3+
Data center outages are disruptions in data center operations that could cause service or data unavailability, but {{ ydb-short-name }} has means to avoid it. Various factors, such as power failures, natural disasters, or cyberattacks, may cause these outages. A common fault-tolerant setup for {{ ydb-short-name }} spans three data centers or availability zones (AZs). {{ ydb-short-name }} can maintain uninterrupted operation even if one data center and a server rack in another are lost. However, it will initiate the relocation of tablets from the offline AZ to the remaining online nodes, temporarily leading to higher query latencies. Distributed transactions involving tablets that are moving to other nodes might experience increased latencies.
4+
5+
## Diagnostics
6+
7+
<!-- The include is added to allow partial overrides in overlays -->
8+
{% include notitle [dc-outage](_includes/dc-outage.md) %}
9+
10+
## Recommendations
11+
12+
Contact the responsible party for the affected data center to resolve the underlying issue. If you are part of a larger organization, this could be an in-house team managing low-level infrastructure. Otherwise, contact the cloud service or hosting provider's support service.
13+
14+
Meanwhile, check the data center's status page if it has one.

0 commit comments

Comments
 (0)