You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ydb/docs/en/core/reference/ydb-sdk/health-check-api.md
+90-53
Original file line number
Diff line number
Diff line change
@@ -123,17 +123,17 @@ message IssueLog {
123
123
124
124
#### Issues hierarchy {#issues-hierarchy}
125
125
126
-
Issues can be arranged hierarchically with `id` and `reason` fields, which help to visualize how issues in a separate module affect the state of the system as a whole. All issues are arranged in a hierarchy where higher levels can depend on nested levels:
126
+
Issues can be arranged hierarchically using the `id` and `reason` fields, which help visualize how issues in different modules affect the overall system state. All issues are arranged in a hierarchy where higher levels can depend on nested levels:
Each issue has a nesting `level`. The higher the `level`, the deeper the issue is in the hierarchy. Issues with the same `type` always have the same `level`, and they can be represented as a hierarchy.
130
+
Each issue has a nesting `level`. The higher the `level`, the deeper the issue is within the hierarchy. Issues with the same `type` always have the same `level`, and they can be represented hierarchically.
The most general statuses of the database, which can have the following values:
136
+
The most general status of the database. It can have the following values:
137
137
138
138
| Value | Description |
139
139
|:----|:----|
@@ -144,14 +144,14 @@ The most general statuses of the database, which can have the following values:
144
144
145
145
#### Issue status {#issue-status}
146
146
147
-
Status (severity) of the current issue:
147
+
The status (severity) of the current issue:
148
148
149
149
| Value | Description |
150
150
|:----|:----|
151
-
|`GREY`|Failed to determine the status (an issue with the self-diagnostic subsystem). |
152
-
|`GREEN`| No issues were detected. |
153
-
|`BLUE`| Temporary minor degradation that does not affect database availability; the system is expected to switch to `GREEN`. |
154
-
|`YELLOW`| A minor issue, no risks to availability. It is recommended to continue monitoring the issue. |
151
+
|`GREY`|Unable to determine the status (an issue with the self-diagnostic subsystem). |
152
+
|`GREEN`| No issues detected. |
153
+
|`BLUE`| Temporary minor degradation that does not affect database availability; the system is expected to return to `GREEN`. |
154
+
|`YELLOW`| A minor issue with no risks to availability. It is recommended to continue monitoring the issue. |
155
155
|`ORANGE`| A serious issue, a step away from losing availability. Maintenance may be required. |
156
156
|`RED`| A component is faulty or unavailable. |
157
157
@@ -161,21 +161,21 @@ Status (severity) of the current issue:
161
161
162
162
#### Database has multiple issues, Database has compute issues, Database has storage issues
163
163
164
-
**Description:** These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of a database.
164
+
**Description:** These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This represents the most general status of a database.
165
165
166
166
### STORAGE
167
167
168
168
#### There are no storage pools
169
169
170
-
**Description:** Information about storage pools is unavailable. Most likely, storage pools aren't configured.
170
+
**Description:** Information about storage pools is unavailable. Most likely, the storage pools are not configured.
171
171
172
172
#### Storage degraded, Storage has no redundancy, Storage failed
173
173
174
174
**Description:** These issues depend solely on the underlying `STORAGE_POOLS` layer.
175
175
176
176
#### System tablet BSC didn't provide information
177
177
178
-
**Description:** Storage diagnostics will be generated alternatively.
178
+
**Description:** Storage diagnostics will be generated using an alternative method.
179
179
180
180
#### Storage usage over 75%, Storage usage over 85%, Storage usage over 90%
181
181
@@ -191,46 +191,57 @@ Status (severity) of the current issue:
191
191
192
192
#### Group has no vslots
193
193
194
-
**Description:** This case is not expected; it is an internal issue.
194
+
**Description:** This situation is not expected; it is an internal issue.
195
195
196
196
#### Group degraded
197
197
198
-
**Description:** A number of disks allowed in the group are not available.operations.
198
+
**Description:** A number of disks allowed in the group are not available.
199
+
199
200
**Logic of work:**`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
200
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
201
+
202
+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, apply the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
201
203
202
204
#### Group has no redundancy
203
205
204
-
**Description:** A storage group lost its redundancy. Another failure of vdisk may lead to the loss of the group.operations.
205
-
**Logic of work:**`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
206
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
206
+
**Description:** A storage group has lost its redundancy. Another VDisk failure could result in the loss of the group.
207
+
208
+
**Logic of work:**`HealthCheck` monitors various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group based on these parameters.
209
+
210
+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, apply the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on those nodes.
207
211
208
212
#### Group failed
209
213
210
-
**Description:** A storage group lost its integrity. Data is not available. `HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on them, sets the appropriate status and displays a message.operations.
211
-
**Logic of work:**`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
212
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
214
+
**Description:** A storage group has lost its integrity, and data is no longer available. `HealthCheck` evaluates various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and determines the appropriate status, displaying a message accordingly.
215
+
216
+
**Logic of work:**`HealthCheck` monitors various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
217
+
218
+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, apply the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on those nodes.```
213
219
214
220
### VDISK
215
221
216
-
#### System tablet BSC didn't provide known status
222
+
#### System tablet BSC did not provide known status
217
223
218
-
**Description:** This case is not expected; it is an internal issue.
224
+
**Description:** This situation is not expected; it is an internal issue.
219
225
220
226
#### VDisk is not available
221
227
222
-
**Description:** The disk is not operational at all.
223
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
228
+
**Description:** The disk is not operational.
229
+
230
+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and apply the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant VDisk to identify the node with the problem and check the availability of nodes and disks on those nodes.
224
231
225
232
#### VDisk is being initialized
226
233
227
-
**Description:** Initialization in process.
228
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
234
+
**Description:** The disk is in the process of initialization.
235
+
236
+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and apply the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant VDisk to identify the node with the problem and check the availability of nodes and disks on those nodes.
229
237
230
238
#### Replication in progress
231
239
232
-
**Description:** The disk accepts queries, but not all the data was replicated.
233
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
240
+
#### Replication in progress
241
+
242
+
**Description:** The disk is accepting queries, but not all data has been replicated.
243
+
244
+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and apply the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant VDisk to identify the node with the problem and check the availability of nodes and disks on those nodes.
234
245
235
246
#### VDisk have space issue
236
247
@@ -245,27 +256,31 @@ Status (severity) of the current issue:
245
256
#### PDisk state is ...
246
257
247
258
**Description:** Indicates state of physical disk.
248
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node `id` and `pdisk` to check the availability of nodes and disks on the nodes.
259
+
260
+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node id and PDisk to check the availability of nodes and disks on the nodes.
249
261
250
262
#### Available size is less than 12%, Available size is less than 9%, Available size is less than 6%
251
263
252
264
**Description:** Free space on the physical disk is running out.
253
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Out of Space` filters, and use the known node `id`and `pdisk` to check the available space.
265
+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Out of Space` filters, and use the known node and PDisk identifiers to check the available space.
254
266
255
267
#### PDisk is not available
256
268
257
269
**Description:** A physical disk is not available.
258
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node `id` and `pdisk` to check the availability of nodes and disks on the nodes.
270
+
271
+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node and PDisk identifiers to check the availability of nodes and disks on the nodes.
259
272
260
273
### STORAGE_NODE
274
+
261
275
#### Storage node is not available
276
+
262
277
**Description:** A storage node is not available.
263
278
264
279
### COMPUTE
265
280
266
281
#### There are no compute nodes
267
282
268
-
**Description:** The database has no nodes to start the tablets. Unable to determine `COMPUTE_NODE` issues below.
283
+
**Description:** The database has no nodes available to start the tablets. Unable to determine `COMPUTE_NODE` issues below.
269
284
270
285
#### Compute has issues with system tablets
271
286
@@ -291,72 +306,92 @@ Status (severity) of the current issue:
291
306
292
307
#### Paths quota usage is over than 90%, Paths quota usage is over than 99%, Paths quota exhausted, Shards quota usage is over than 90%, Shards quota usage is over than 99%, Shards quota exhausted
293
308
294
-
**Description:** Quotas exhausted.
309
+
**Description:** Quotas are exhausted.
310
+
295
311
**Actions:** Check the number of objects (tables, topics) in the database and delete any unnecessary ones.
296
312
297
313
### SYSTEM_TABLET
298
314
299
315
#### System tablet is unresponsive, System tablet response time over 1000ms, System tablet response time over 5000ms
300
316
301
-
**Description:** The system tablet is not responding or takes too long to respond.
302
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Storage` tab and set the `Nodes` filter. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
317
+
**Description:** The system tablet is either not responding or takes too long to respond.
318
+
319
+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the `Storage` tab and apply the `Nodes` filter. Check the `Uptime` and the nodes' statuses. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
303
320
304
321
### TABLET
305
322
306
323
#### Tablets are restarting too often
307
324
308
-
**Description:** Tablets are restarting too often.
309
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Nodes` tab. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
325
+
**Description:** Tablets are restarting too frequently.
326
+
327
+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the `Nodes` tab. Check the `Uptime` and the nodes' statuses. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
310
328
311
329
#### Tablets/Followers are dead
312
330
313
-
**Description:** Tablets are not running (probably cannot be started).
314
-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Nodes` tab. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
331
+
**Description:** Tablets are not running (likely cannot be started).
332
+
333
+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the `Nodes` tab. Check the `Uptime` and the nodes' statuses. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
334
+
335
+
### LOAD_AVERAGE
336
+
337
+
#### LoadAverage above 100%
315
338
316
339
### LOAD_AVERAGE
317
340
318
341
#### LoadAverage above 100%
319
342
320
-
**Description:** ([Load](https://en.wikipedia.org/wiki/Load_(computing)).) A physical host is overloaded. This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations.
343
+
**Description:** A physical host is overloaded, meaning the system is operating at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations. For more information on load, see [Load (computing)](https://en.wikipedia.org/wiki/Load_(computing)).
344
+
321
345
**Logic of work:**
322
-
Load Information:
323
-
Source: `/proc/loadavg`
324
-
We use the first number of the three — the average load over the last 1 minute.
The number of cores is calculated by dividing the quota by the period (quota / period).
358
+
329
359
**Actions:** Check the CPU load on the nodes.
330
360
331
361
### COMPUTE_POOL
332
362
333
363
#### Pool usage is over than 90%, Pool usage is over than 95%, Pool usage is over than 99%
334
364
335
365
**Description:** One of the pools' CPUs is overloaded.
366
+
336
367
**Actions:** Add cores to the configuration of the actor system for the corresponding CPU pool.
337
368
338
369
### NODE_UPTIME
339
370
340
371
#### The number of node restarts has increased
341
372
342
-
**Description:** The number of node restarts has exceeded the threshold. By default, 10 restarts per hour.
343
-
**Actions:** Check the logs to determine the reasons for the process restart.
373
+
**Description:** The number of node restarts has exceeded the threshold. By default, this is set to 10 restarts per hour.
374
+
375
+
**Actions:** Check the logs to determine the reasons for the process restarts.
344
376
345
377
#### Node is restarting too often
346
378
347
-
**Description:** The number of node restarts has exceeded the threshold. By default, 30 restarts per hour.
348
-
**Actions:** Check the logs to determine the reasons for the process restart.
379
+
**Description:** The number of node restarts has exceeded the threshold. By default, this is set to 30 restarts per hour.
380
+
381
+
**Actions:** Check the logs to determine the reasons for the process restarts.
349
382
350
383
### NODES_TIME_DIFFERENCE
351
384
352
385
#### Node is ... ms behind peer [id], Node is ... ms ahead of peer [id]
353
386
354
-
**Description:** Time drift on nodes might lead to potential issues with coordinating distributed transactions. This issue starts to appear from 5 ms.
387
+
**Description:** Time drift on nodes might lead to potential issues with coordinating distributed transactions. This issue starts to appear when the time difference is 5 ms or more.
388
+
355
389
**Actions:** Check for discrepancies in system time between the nodes listed in the alert, and verify the operation of the time synchronization process.
356
390
357
391
## Examples {#examples}
358
392
359
-
The shortest `HealthCheck` response looks like this. It is returned if there is nothing wrong with the database
393
+
The shortest `HealthCheck` response looks like this. It is returned if there is nothing wrong with the database:
394
+
360
395
```json
361
396
{
362
397
"self_check_result": "GOOD"
@@ -366,6 +401,7 @@ The shortest `HealthCheck` response looks like this. It is returned if there is
366
401
#### Verbose example {#example-verbose}
367
402
368
403
`GOOD` response with `verbose` parameter:
404
+
369
405
```json
370
406
{
371
407
"self_check_result": "GOOD",
@@ -615,6 +651,7 @@ The shortest `HealthCheck` response looks like this. It is returned if there is
0 commit comments