Skip to content

Commit fdf1d2c

Browse files
changed description
1 parent a7fc167 commit fdf1d2c

File tree

2 files changed

+396
-110
lines changed

2 files changed

+396
-110
lines changed

ydb/docs/en/core/reference/ydb-sdk/health-check-api.md

+197-55
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ To initiate the check, call the `SelfCheck` method from `NYdb::NMonitoring` name
5252
}
5353
```
5454

55-
This is a short messages each about a single problem. All parameters will affect the amount of information the service returns for the specified database.
55+
This is a short messages each about a single issue. All parameters will affect the amount of information the service returns for the specified database.
5656

5757
The complete list of extra parameters is presented below:
5858

@@ -90,7 +90,7 @@ message SelfCheckResult {
9090
}
9191
```
9292

93-
The shortest HealthCheck response looks like [this](#examples) . It is returned if there is nothing wrong with the database.
93+
The shortest `HealthCheck` response looks like [this](#examples) . It is returned if there is nothing wrong with the database.
9494

9595
If any issues are detected, the `issue_log` field will contain descriptions of the issues with the following structure:
9696

@@ -157,59 +157,201 @@ Status (severity) of the current issue:
157157

158158
## Possible issues {#issues}
159159

160-
| Message | Description |
161-
|:----|:----|
162-
| **DATABASE** ||
163-
| `Database has multiple issues`</br>`Database has compute issues`</br>`Database has storage issues` | These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of a database. |
164-
| **STORAGE** ||
165-
| `There are no storage pools` | Storage pools aren't configured. |
166-
| `Storage degraded`</br>`Storage has no redundancy`</br>`Storage failed` | These issues depend solely on the underlying `STORAGE_POOLS` layer. |
167-
| `System tablet BSC didn't provide information` | Storage diagnostics will be generated alternatively. |
168-
| `Storage usage over 75%` <br>`Storage usage over 85%` <br>`Storage usage over 90%` | Some data needs to be removed, or the database needs to be reconfigured with additional disk space. |
169-
| **STORAGE_POOL** ||
170-
| `Pool degraded` <br>`Pool has no redundancy` <br>`Pool failed` | These issues depend solely on the underlying `STORAGE_GROUP` layer. |
171-
| **STORAGE_GROUP** ||
172-
| `Group has no vslots` | This case is not expected; it is an internal issue. |
173-
| `Group degraded` | A number of disks allowed in the group are not available. |
174-
| `Group has no redundancy` | A storage group lost its redundancy. Аnother failure of vdisk may lead to the loss of the group. |
175-
| `Group failed` | A storage group lost its integrity. Data is not available |
176-
||`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on them, sets the appropriate status and displays a message. |
177-
| **VDISK** ||
178-
| `System tablet BSC didn't provide known status` | This case is not expected; it is an internal issue. |
179-
| `VDisk is not available` | the disk is not operational at all. |
180-
| `VDisk is being initialized` | initialization in process. |
181-
| `Replication in progress` | the disk accepts queries, but not all the data was replicated. |
182-
| `VDisk have space issue` | These issues depend solely on the underlying `PDISK` layer. |
183-
| **PDISK** ||
184-
| `Unknown PDisk state` | `HealthCheck` the system can't parse pdisk state. |
185-
| `PDisk state is ...` | Indicates state of physical disk. |
186-
| `Available size is less than 12%` <br>`Available size is less than 9%` <br>`Available size is less than 6%` | Free space on the physical disk is running out. |
187-
| `PDisk is not available` | A physical disk is not available. |
188-
| **STORAGE_NODE** ||
189-
| `Storage node is not available` | A node with disks is not available. |
190-
| **COMPUTE** ||
191-
| `There are no compute nodes` | The database has no nodes to start the tablets. </br>Unable to determine `COMPUTE_NODE` issues below. |
192-
| `Compute has issues with system tablets` | These issues depend solely on the underlying `SYSTEM_TABLET` layer. |
193-
| `Some nodes are restarting too often` | These issues depend solely on the underlying `NODE_UPTIME` layer. |
194-
| `Compute is overloaded` | These issues depend solely on the underlying `COMPUTE_POOL` layer. |
195-
| `Compute quota usage` | These issues depend solely on the underlying `COMPUTE_QUOTA` layer. |
196-
| `Compute has issues with tablets`| These issues depend solely on the underlying `TABLET` layer. |
197-
| **COMPUTE_QUOTA** ||
198-
| `Paths quota usage is over than 90%` <br>`Paths quota usage is over than 99%` <br>`Paths quota exhausted` </br>`Shards quota usage is over than 90%` </br>`Shards quota usage is over than 99%` </br>`Shards quota exhausted` | Quotas exhausted |
199-
| **SYSTEM_TABLET** ||
200-
| `System tablet is unresponsive ` <br>`System tablet response time over 1000ms` <br>`System tablet response time over 5000ms`| The system tablet is not responding or it takes too long to respond. |
201-
| **TABLET** ||
202-
| `Tablets are restarting too often` | Tablets are restarting too often. |
203-
| `Tablets/Followers are dead` | Tablets are not running (probably cannot be started). |
204-
| **LOAD_AVERAGE** ||
205-
| `LoadAverage above 100%` | ([Load](https://en.wikipedia.org/wiki/Load_(computing))) A physical host is overloaded . </br> This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations. </br></br> Load Information: </br> Source: </br>`/proc/loadavg` </br> Logical Cores Information </br></br>The number of logical cores: </br>Primary Source: </br>`/sys/fs/cgroup/cpu.max` </br></br>Fallback Source: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br> `/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>The number of cores is calculated by dividing the quota by the period (quota / period)
206-
| **COMPUTE_POOL** ||
207-
| `Pool usage is over than 90%` <br>`Pool usage is over than 95%` <br>`Pool usage is over than 99%` | One of the pools' CPUs is overloaded. |
208-
| **NODE_UPTIME** ||
209-
| `The number of node restarts has increased` | The number of node restarts has exceeded the threshold. By default, 10 restarts per hour |
210-
| `Node is restarting too often` | The number of node restarts has exceeded the threshold. By default, 30 restarts per hour |
211-
| **NODES_TIME_DIFFERENCE** ||
212-
| `Node is ... ms behind peer [id]` <br>`Node is ... ms ahead of peer [id]` | Time drift on nodes might lead to potential issues with coordinating distributed transactions. This issus starts to appear from 5 ms |
160+
### DATABASE
161+
162+
#### Database has multiple issues, Database has compute issues, Database has storage issues
163+
164+
**Description:** These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of a database.
165+
166+
### STORAGE
167+
168+
#### There are no storage pools
169+
170+
**Description:** Information about storage pools is unavailable. Most likely, storage pools aren't configured.
171+
172+
#### Storage degraded, Storage has no redundancy, Storage failed
173+
174+
**Description:** These issues depend solely on the underlying `STORAGE_POOLS` layer.
175+
176+
#### System tablet BSC didn't provide information
177+
178+
**Description:** Storage diagnostics will be generated alternatively.
179+
180+
#### Storage usage over 75%, Storage usage over 85%, Storage usage over 90%
181+
182+
**Description:** Some data needs to be removed, or the database needs to be reconfigured with additional disk space.
183+
184+
### STORAGE_POOL
185+
186+
#### Pool degraded, Pool has no redundancy, Pool failed
187+
188+
**Description:** These issues depend solely on the underlying `STORAGE_GROUP` layer.
189+
190+
### STORAGE_GROUP
191+
192+
#### Group has no vslots
193+
194+
**Description:** This case is not expected; it is an internal issue.
195+
196+
#### Group degraded
197+
198+
**Description:** A number of disks allowed in the group are not available.operations.
199+
**Logic of work:** `HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
200+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
201+
202+
#### Group has no redundancy
203+
204+
**Description:** A storage group lost its redundancy. Another failure of vdisk may lead to the loss of the group.operations.
205+
**Logic of work:** `HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
206+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
207+
208+
#### Group failed
209+
210+
**Description:** A storage group lost its integrity. Data is not available. `HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on them, sets the appropriate status and displays a message.operations.
211+
**Logic of work:** `HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
212+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
213+
214+
### VDISK
215+
216+
#### System tablet BSC didn't provide known status
217+
218+
**Description:** This case is not expected; it is an internal issue.
219+
220+
#### VDisk is not available
221+
222+
**Description:** The disk is not operational at all.
223+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
224+
225+
#### VDisk is being initialized
226+
227+
**Description:** Initialization in process.
228+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
229+
230+
#### Replication in progress
231+
232+
**Description:** The disk accepts queries, but not all the data was replicated.
233+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
234+
235+
#### VDisk have space issue
236+
237+
**Description:** These issues depend solely on the underlying `PDISK` layer.
238+
239+
### PDISK
240+
241+
#### Unknown PDisk state
242+
243+
**Description:** `HealthCheck` the system can't parse pdisk state.
244+
245+
#### PDisk state is ...
246+
247+
**Description:** Indicates state of physical disk.
248+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node `id` and `pdisk` to check the availability of nodes and disks on the nodes.
249+
250+
#### Available size is less than 12%, Available size is less than 9%, Available size is less than 6%
251+
252+
**Description:** Free space on the physical disk is running out.
253+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Out of Space` filters, and use the known node `id` and `pdisk` to check the available space.
254+
255+
#### PDisk is not available
256+
257+
**Description:** A physical disk is not available.
258+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node `id` and `pdisk` to check the availability of nodes and disks on the nodes.
259+
260+
### STORAGE_NODE
261+
#### Storage node is not available
262+
**Description:** A node with disks is not available.
263+
264+
### COMPUTE
265+
266+
#### There are no compute nodes
267+
268+
**Description:** The database has no nodes to start the tablets. Unable to determine `COMPUTE_NODE` issues below.
269+
270+
#### Compute has issues with system tablets
271+
272+
**Description:** These issues depend solely on the underlying `SYSTEM_TABLET` layer.
273+
274+
#### Some nodes are restarting too often
275+
276+
**Description:** These issues depend solely on the underlying `NODE_UPTIME` layer.
277+
278+
#### Compute is overloaded
279+
280+
**Description:** These issues depend solely on the underlying `COMPUTE_POOL` layer.
281+
282+
#### Compute quota usage
283+
284+
**Description:** These issues depend solely on the underlying `COMPUTE_QUOTA` layer.
285+
286+
#### Compute has issues with tablets
287+
288+
**Description:** These issues depend solely on the underlying `TABLET` layer.
289+
290+
### COMPUTE_QUOTA
291+
292+
#### Paths quota usage is over than 90%, Paths quota usage is over than 99%, Paths quota exhausted, Shards quota usage is over than 90%, Shards quota usage is over than 99%, Shards quota exhausted
293+
294+
**Description:** Quotas exhausted.
295+
**Actions:** Check the number of objects (tables, topics) in the database and delete any unnecessary ones.
296+
297+
### SYSTEM_TABLET
298+
299+
#### System tablet is unresponsive, System tablet response time over 1000ms, System tablet response time over 5000ms
300+
301+
**Description:** The system tablet is not responding or it takes too long to respond.
302+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Storage` tab and set the `Nodes` filter. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
303+
304+
### TABLET
305+
306+
#### Tablets are restarting too often
307+
308+
**Description:** Tablets are restarting too often.
309+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Nodes` tab. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
310+
311+
#### Tablets/Followers are dead
312+
313+
**Description:** Tablets are not running (probably cannot be started).
314+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Nodes` tab. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
315+
316+
### LOAD_AVERAGE
317+
318+
#### LoadAverage above 100%
319+
320+
**Description:** (Load) A physical host is overloaded. This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations.
321+
**Logic of work:**
322+
Load Information:
323+
Source: `/proc/loadavg` Logical Cores Information
324+
The number of logical cores:
325+
Primary Source: `/sys/fs/cgroup/cpu.max`
326+
Fallback Source: `/sys/fs/cgroup/cpu/cpu.cfs_quota_us`, `/sys/fs/cgroup/cpu/cpu.cfs_period_us`.
327+
The number of cores is calculated by dividing the quota by the period (quota / period).
328+
**Actions:** Check the CPU load on the nodes.
329+
330+
### COMPUTE_POOL
331+
332+
#### Pool usage is over than 90%, Pool usage is over than 95%, Pool usage is over than 99%
333+
334+
**Description:** One of the pools' CPUs is overloaded.
335+
**Actions:** Add cores to the configuration of the actor system for the corresponding CPU pool.
336+
337+
### NODE_UPTIME
338+
339+
#### The number of node restarts has increased
340+
341+
**Description:** The number of node restarts has exceeded the threshold. By default, 10 restarts per hour.
342+
**Actions:** Check the logs to determine the reasons for the process restart.
343+
344+
#### Node is restarting too often
345+
346+
**Description:** The number of node restarts has exceeded the threshold. By default, 30 restarts per hour.
347+
**Actions:** Check the logs to determine the reasons for the process restart.
348+
349+
### NODES_TIME_DIFFERENCE
350+
351+
#### Node is ... ms behind peer [id], Node is ... ms ahead of peer [id]
352+
353+
**Description:** Time drift on nodes might lead to potential issues with coordinating distributed transactions. This issue starts to appear from 5 ms.
354+
**Actions:** Check for discrepancies in system time between the nodes listed in the alert, and verify the operation of the time synchronization process.
213355

214356
## Examples {#examples}
215357

0 commit comments

Comments
 (0)