You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ydb/docs/en/core/reference/ydb-sdk/health-check-api.md
+197-55
Original file line number
Diff line number
Diff line change
@@ -52,7 +52,7 @@ To initiate the check, call the `SelfCheck` method from `NYdb::NMonitoring` name
52
52
}
53
53
```
54
54
55
-
This is a short messages each about a single problem. All parameters will affect the amount of information the service returns for the specified database.
55
+
This is a short messages each about a single issue. All parameters will affect the amount of information the service returns for the specified database.
56
56
57
57
The complete list of extra parameters is presented below:
58
58
@@ -90,7 +90,7 @@ message SelfCheckResult {
90
90
}
91
91
```
92
92
93
-
The shortest HealthCheck response looks like [this](#examples) . It is returned if there is nothing wrong with the database.
93
+
The shortest `HealthCheck` response looks like [this](#examples) . It is returned if there is nothing wrong with the database.
94
94
95
95
If any issues are detected, the `issue_log` field will contain descriptions of the issues with the following structure:
96
96
@@ -157,59 +157,201 @@ Status (severity) of the current issue:
157
157
158
158
## Possible issues {#issues}
159
159
160
-
| Message | Description |
161
-
|:----|:----|
162
-
|**DATABASE**||
163
-
|`Database has multiple issues`</br>`Database has compute issues`</br>`Database has storage issues`| These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of a database. |
164
-
|**STORAGE**||
165
-
|`There are no storage pools`| Storage pools aren't configured. |
166
-
|`Storage degraded`</br>`Storage has no redundancy`</br>`Storage failed`| These issues depend solely on the underlying `STORAGE_POOLS` layer. |
167
-
|`System tablet BSC didn't provide information`| Storage diagnostics will be generated alternatively. |
168
-
|`Storage usage over 75%` <br>`Storage usage over 85%` <br>`Storage usage over 90%`| Some data needs to be removed, or the database needs to be reconfigured with additional disk space. |
169
-
|**STORAGE_POOL**||
170
-
|`Pool degraded` <br>`Pool has no redundancy` <br>`Pool failed`| These issues depend solely on the underlying `STORAGE_GROUP` layer. |
171
-
|**STORAGE_GROUP**||
172
-
|`Group has no vslots`| This case is not expected; it is an internal issue. |
173
-
|`Group degraded`| A number of disks allowed in the group are not available. |
174
-
|`Group has no redundancy`| A storage group lost its redundancy. Аnother failure of vdisk may lead to the loss of the group. |
175
-
|`Group failed`| A storage group lost its integrity. Data is not available |
176
-
||`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on them, sets the appropriate status and displays a message. |
177
-
|**VDISK**||
178
-
|`System tablet BSC didn't provide known status`| This case is not expected; it is an internal issue. |
179
-
|`VDisk is not available`| the disk is not operational at all. |
180
-
|`VDisk is being initialized`| initialization in process. |
181
-
|`Replication in progress`| the disk accepts queries, but not all the data was replicated. |
182
-
|`VDisk have space issue`| These issues depend solely on the underlying `PDISK` layer. |
183
-
|**PDISK**||
184
-
|`Unknown PDisk state`|`HealthCheck` the system can't parse pdisk state. |
185
-
|`PDisk state is ...`| Indicates state of physical disk. |
186
-
|`Available size is less than 12%` <br>`Available size is less than 9%` <br>`Available size is less than 6%`| Free space on the physical disk is running out. |
187
-
|`PDisk is not available`| A physical disk is not available. |
188
-
|**STORAGE_NODE**||
189
-
|`Storage node is not available`| A node with disks is not available. |
190
-
|**COMPUTE**||
191
-
|`There are no compute nodes`| The database has no nodes to start the tablets. </br>Unable to determine `COMPUTE_NODE` issues below. |
192
-
|`Compute has issues with system tablets`| These issues depend solely on the underlying `SYSTEM_TABLET` layer. |
193
-
|`Some nodes are restarting too often`| These issues depend solely on the underlying `NODE_UPTIME` layer. |
194
-
|`Compute is overloaded`| These issues depend solely on the underlying `COMPUTE_POOL` layer. |
195
-
|`Compute quota usage`| These issues depend solely on the underlying `COMPUTE_QUOTA` layer. |
196
-
|`Compute has issues with tablets`| These issues depend solely on the underlying `TABLET` layer. |
197
-
|**COMPUTE_QUOTA**||
198
-
|`Paths quota usage is over than 90%` <br>`Paths quota usage is over than 99%` <br>`Paths quota exhausted` </br>`Shards quota usage is over than 90%` </br>`Shards quota usage is over than 99%` </br>`Shards quota exhausted`| Quotas exhausted |
199
-
|**SYSTEM_TABLET**||
200
-
|`System tablet is unresponsive ` <br>`System tablet response time over 1000ms` <br>`System tablet response time over 5000ms`| The system tablet is not responding or it takes too long to respond. |
201
-
|**TABLET**||
202
-
|`Tablets are restarting too often`| Tablets are restarting too often. |
203
-
|`Tablets/Followers are dead`| Tablets are not running (probably cannot be started). |
204
-
|**LOAD_AVERAGE**||
205
-
| `LoadAverage above 100%` | ([Load](https://en.wikipedia.org/wiki/Load_(computing))) A physical host is overloaded . </br> This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations. </br></br> Load Information: </br> Source: </br>`/proc/loadavg` </br> Logical Cores Information </br></br>The number of logical cores: </br>Primary Source: </br>`/sys/fs/cgroup/cpu.max` </br></br>Fallback Source: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br> `/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>The number of cores is calculated by dividing the quota by the period (quota / period)
206
-
|**COMPUTE_POOL**||
207
-
|`Pool usage is over than 90%` <br>`Pool usage is over than 95%` <br>`Pool usage is over than 99%`| One of the pools' CPUs is overloaded. |
208
-
|**NODE_UPTIME**||
209
-
|`The number of node restarts has increased`| The number of node restarts has exceeded the threshold. By default, 10 restarts per hour |
210
-
|`Node is restarting too often`| The number of node restarts has exceeded the threshold. By default, 30 restarts per hour |
211
-
|**NODES_TIME_DIFFERENCE**||
212
-
|`Node is ... ms behind peer [id]` <br>`Node is ... ms ahead of peer [id]`| Time drift on nodes might lead to potential issues with coordinating distributed transactions. This issus starts to appear from 5 ms |
160
+
### DATABASE
161
+
162
+
#### Database has multiple issues, Database has compute issues, Database has storage issues
163
+
164
+
**Description:** These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of a database.
165
+
166
+
### STORAGE
167
+
168
+
#### There are no storage pools
169
+
170
+
**Description:** Information about storage pools is unavailable. Most likely, storage pools aren't configured.
171
+
172
+
#### Storage degraded, Storage has no redundancy, Storage failed
173
+
174
+
**Description:** These issues depend solely on the underlying `STORAGE_POOLS` layer.
175
+
176
+
#### System tablet BSC didn't provide information
177
+
178
+
**Description:** Storage diagnostics will be generated alternatively.
179
+
180
+
#### Storage usage over 75%, Storage usage over 85%, Storage usage over 90%
181
+
182
+
**Description:** Some data needs to be removed, or the database needs to be reconfigured with additional disk space.
183
+
184
+
### STORAGE_POOL
185
+
186
+
#### Pool degraded, Pool has no redundancy, Pool failed
187
+
188
+
**Description:** These issues depend solely on the underlying `STORAGE_GROUP` layer.
189
+
190
+
### STORAGE_GROUP
191
+
192
+
#### Group has no vslots
193
+
194
+
**Description:** This case is not expected; it is an internal issue.
195
+
196
+
#### Group degraded
197
+
198
+
**Description:** A number of disks allowed in the group are not available.operations.
199
+
**Logic of work:**`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
200
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
201
+
202
+
#### Group has no redundancy
203
+
204
+
**Description:** A storage group lost its redundancy. Another failure of vdisk may lead to the loss of the group.operations.
205
+
**Logic of work:**`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
206
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
207
+
208
+
#### Group failed
209
+
210
+
**Description:** A storage group lost its integrity. Data is not available. `HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on them, sets the appropriate status and displays a message.operations.
211
+
**Logic of work:**`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
212
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
213
+
214
+
### VDISK
215
+
216
+
#### System tablet BSC didn't provide known status
217
+
218
+
**Description:** This case is not expected; it is an internal issue.
219
+
220
+
#### VDisk is not available
221
+
222
+
**Description:** The disk is not operational at all.
223
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
224
+
225
+
#### VDisk is being initialized
226
+
227
+
**Description:** Initialization in process.
228
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
229
+
230
+
#### Replication in progress
231
+
232
+
**Description:** The disk accepts queries, but not all the data was replicated.
233
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
234
+
235
+
#### VDisk have space issue
236
+
237
+
**Description:** These issues depend solely on the underlying `PDISK` layer.
238
+
239
+
### PDISK
240
+
241
+
#### Unknown PDisk state
242
+
243
+
**Description:**`HealthCheck` the system can't parse pdisk state.
244
+
245
+
#### PDisk state is ...
246
+
247
+
**Description:** Indicates state of physical disk.
248
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node `id` and `pdisk` to check the availability of nodes and disks on the nodes.
249
+
250
+
#### Available size is less than 12%, Available size is less than 9%, Available size is less than 6%
251
+
252
+
**Description:** Free space on the physical disk is running out.
253
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Out of Space` filters, and use the known node `id` and `pdisk` to check the available space.
254
+
255
+
#### PDisk is not available
256
+
257
+
**Description:** A physical disk is not available.
258
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node `id` and `pdisk` to check the availability of nodes and disks on the nodes.
259
+
260
+
### STORAGE_NODE
261
+
#### Storage node is not available
262
+
**Description:** A node with disks is not available.
263
+
264
+
### COMPUTE
265
+
266
+
#### There are no compute nodes
267
+
268
+
**Description:** The database has no nodes to start the tablets. Unable to determine `COMPUTE_NODE` issues below.
269
+
270
+
#### Compute has issues with system tablets
271
+
272
+
**Description:** These issues depend solely on the underlying `SYSTEM_TABLET` layer.
273
+
274
+
#### Some nodes are restarting too often
275
+
276
+
**Description:** These issues depend solely on the underlying `NODE_UPTIME` layer.
277
+
278
+
#### Compute is overloaded
279
+
280
+
**Description:** These issues depend solely on the underlying `COMPUTE_POOL` layer.
281
+
282
+
#### Compute quota usage
283
+
284
+
**Description:** These issues depend solely on the underlying `COMPUTE_QUOTA` layer.
285
+
286
+
#### Compute has issues with tablets
287
+
288
+
**Description:** These issues depend solely on the underlying `TABLET` layer.
289
+
290
+
### COMPUTE_QUOTA
291
+
292
+
#### Paths quota usage is over than 90%, Paths quota usage is over than 99%, Paths quota exhausted, Shards quota usage is over than 90%, Shards quota usage is over than 99%, Shards quota exhausted
293
+
294
+
**Description:** Quotas exhausted.
295
+
**Actions:** Check the number of objects (tables, topics) in the database and delete any unnecessary ones.
296
+
297
+
### SYSTEM_TABLET
298
+
299
+
#### System tablet is unresponsive, System tablet response time over 1000ms, System tablet response time over 5000ms
300
+
301
+
**Description:** The system tablet is not responding or it takes too long to respond.
302
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Storage` tab and set the `Nodes` filter. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
303
+
304
+
### TABLET
305
+
306
+
#### Tablets are restarting too often
307
+
308
+
**Description:** Tablets are restarting too often.
309
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Nodes` tab. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
310
+
311
+
#### Tablets/Followers are dead
312
+
313
+
**Description:** Tablets are not running (probably cannot be started).
314
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Nodes` tab. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
315
+
316
+
### LOAD_AVERAGE
317
+
318
+
#### LoadAverage above 100%
319
+
320
+
**Description:** (Load) A physical host is overloaded. This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations.
The number of cores is calculated by dividing the quota by the period (quota / period).
328
+
**Actions:** Check the CPU load on the nodes.
329
+
330
+
### COMPUTE_POOL
331
+
332
+
#### Pool usage is over than 90%, Pool usage is over than 95%, Pool usage is over than 99%
333
+
334
+
**Description:** One of the pools' CPUs is overloaded.
335
+
**Actions:** Add cores to the configuration of the actor system for the corresponding CPU pool.
336
+
337
+
### NODE_UPTIME
338
+
339
+
#### The number of node restarts has increased
340
+
341
+
**Description:** The number of node restarts has exceeded the threshold. By default, 10 restarts per hour.
342
+
**Actions:** Check the logs to determine the reasons for the process restart.
343
+
344
+
#### Node is restarting too often
345
+
346
+
**Description:** The number of node restarts has exceeded the threshold. By default, 30 restarts per hour.
347
+
**Actions:** Check the logs to determine the reasons for the process restart.
348
+
349
+
### NODES_TIME_DIFFERENCE
350
+
351
+
#### Node is ... ms behind peer [id], Node is ... ms ahead of peer [id]
352
+
353
+
**Description:** Time drift on nodes might lead to potential issues with coordinating distributed transactions. This issue starts to appear from 5 ms.
354
+
**Actions:** Check for discrepancies in system time between the nodes listed in the alert, and verify the operation of the time synchronization process.
0 commit comments