You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ydb/docs/en/core/reference/ydb-sdk/health-check-api.md
+269-16
Original file line number
Diff line number
Diff line change
@@ -46,29 +46,33 @@ Each issue has a nesting `level` - the higher the `level`, the deeper the ish is
46
46
|`issue_log.id`| A unique problem ID within this response. |
47
47
|`issue_log.status`| Status (severity) of the current problem. <br/>It can take one of the following values:</li><li>`RED`: A component is faulty or unavailable.</li><li>`ORANGE`: A serious problem, we are one step away from losing availability. Maintenance may be required.</li><li>`YELLOW`: A minor problem, no risks to availability. We recommend you continue monitoring the problem.</li><li>`BLUE`: Temporary minor degradation that does not affect database availability. The system is expected to switch to `GREEN`.</li><li>`GREEN`: No problems were detected.</li><li>`GREY`: Failed to determine the status (a problem with the self-diagnostic mechanism).</li></ul> |
48
48
|`issue_log.message`| Text that describes the problem. |
49
-
|`issue_log.location`| Location of the problem. |
49
+
|`issue_log.location`| Location of the problem. This can be a physical location or an execution context. |
50
50
|`issue_log.reason`| This is a set of elements, each of which describes a problem in the system at a certain level. |
51
-
|`issue_log.type`| Problem category (by subsystem). |
52
-
|`issue_log.level`|Depth of the problem nesting. |
53
-
|`database_status`| If settings contains `verbose` parameter than `database_status` field will be filled. <br/>It provides a summary of the overall health of the database. <br/>It's used to quickly review the overall health of the database, helping to assess its health and whether there are any serious problems at a high level. |
51
+
|`issue_log.type`| Problem category (by subsystem). Each type is at a certain level and interconnected with others through a rigid hierarchy (as shown in the picture above). |
52
+
|`issue_log.level`|The depth of problem nesting. |
53
+
|`database_status`| If settings contains `verbose` parameter than `database_status` field will be filled. <br/>It provides a summary of the overall health of the database. <br/>It's used to quickly review the overall health of the database, helping to assess its health and whether there are any serious problems at a high level. [Example](#example-verbose). |
54
54
|`location`| Contains information about host, where `HealthCheck` service was called |
55
55
56
56
57
57
## Call parameters {#call-parameters}
58
58
The whole list of extra parameters presented below:
| `ReturnVerboseStatus` | This parameter determins whether the `database_status` response fieldis filled. |
70
-
| `MinimumStatus` | The minimum status that will be included in the response. Issues with a lower status will be discarded. |
71
-
| `MaximumLevel` | The maximum depth of issues to include in the response. Deeper levels will be discarded. |
73
+
| `ReturnVerboseStatus` | `bool` | As mentioned earlier, this parameter affects the filling of the `database_status` field. Default is false. |
74
+
| `MinimumStatus` | `EStatusFlag` | The minimum severity status that will appear in the response. Less severe issues will be discarded. By default, all issues will be listed. |
75
+
| `MaximumLevel` | `int32` | The maximum depth of issues in the response. Issues at deeper levels will be discarded. By default, all issues will be listed. |
72
76
73
77
## Possible problems {#problems}
74
78
@@ -80,9 +84,9 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
80
84
| `There are no storage pools` | Storage pools aren't configured. |
81
85
| `Storage degraded`</br>`Storage has no redundancy`</br>`Storage failed` | These issues depend solely on the underlying `STORAGE_POOLS` layer. |
82
86
| `System tablet BSC didn't provide information` | Storage diagnostics will be generated with alternative way. |
83
-
| `Storage usage over 75%/85%/90%` | Need to increase disk space. |
87
+
| `Storage usage over 75%` <br>`Storage usage over 85%` <br>`Storage usage over 90%` | Need to increase disk space. |
84
88
| **STORAGE_POOL** ||
85
-
| `Pool degraded/has no redundancy/failed` | These issues depend solely on the underlying `STORAGE_GROUP` layer. |
89
+
| `Pool degraded` <br>`Pool has no redundancy` <br>`Pool failed` | These issues depend solely on the underlying `STORAGE_GROUP` layer. |
86
90
| **STORAGE_GROUP** ||
87
91
| `Group has no vslots` | This case is not expected, it inner problem. |
88
92
| `Group degraded` | The number of disks allowed in the group is not available. |
@@ -97,8 +101,8 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
97
101
| `VDisk have space issue` | These issues depend solely on the underlying `PDISK` layer. |
98
102
| **PDISK** ||
99
103
| `Unknown PDisk state` | `HealthCheck` the system can't parse pdisk state. |
100
-
| `PDisk is inactive/PDisk state is FAULTY/BROKEN/TO_BE_REMOVED` | Indicates problems with a physical disk. |
101
-
| `Available size is less than 12%/9%/6%` | Free space on the physical disk is running out. |
104
+
| `PDisk is inactive` <br>`PDisk state is FAULTY` <br>`PDisk state is BROKEN` <br>`PDisk state is TO_BE_REMOVED` | Indicates problems with a physical disk. |
105
+
| `Available size is less than 12%` <br>`Available size is less than 9%` <br>`Available size is less than 6%` | Free space on the physical disk is running out. |
102
106
| `PDisk is not available` | A physical disk is not available. |
103
107
| **STORAGE_NODE** ||
104
108
| `Storage node is not available` | A node with disks is not available. |
@@ -110,17 +114,17 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
110
114
| `Compute quota usage` | These issues depend solely on the underlying `COMPUTE_QUOTA` layer. |
111
115
| `Compute has issues with tablets`| These issues depend solely on the underlying `TABLET` layer. |
112
116
| **COMPUTE_QUOTA** ||
113
-
| `Paths quota usage is over than 90%/99%/Paths quota exhausted` </br>`Shards quota usage is over than 90%/99%/Shards quota exhausted` |Quotas exhausted|
117
+
| `Paths quota usage is over than 90%` <br>`Paths quota usage is over than 99%` <br>`Paths quota exhausted` </br>`Shards quota usage is over than 90%` </br>`Shards quota usage is over than 99%` </br>`Shards quota exhausted` |Quotas exhausted|
114
118
| **COMPUTE_NODE** | *There is no specific issues on this layer.* |
115
119
| **SYSTEM_TABLET** ||
116
-
| `System tablet is unresponsive / response time over 1000ms/5000ms`| The system tablet is not responding or it takes too long to respond. |
120
+
| `System tablet is unresponsive ` <br>`System tablet response time over 1000ms` <br>`System tablet response time over 5000ms`| The system tablet is not responding or it takes too long to respond. |
117
121
| **TABLET** ||
118
122
| `Tablets are restarting too often` | Tablets are restarting too often. |
119
123
| `Tablets/Followers are dead` | Tablets are not running (probably cannot be started). |
120
124
| **LOAD_AVERAGE** ||
121
125
| `LoadAverage above 100%` | A physical host is overloaded. </br> The `Healthcheck` tool monitors system load by evaluating the current workload in terms of running and waiting processes (load) and comparing it to the total number of logical cores on the host (cores). For example, if a system has 8 logical cores and the current load value is 16, the load is considered to be 200%. </br> `Healthcheck` only checks if the load exceeds the number of cores (load > cores) and reports based on this condition. This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations. </br></br> Load Information: </br> Source: </br>`/proc/loadavg` </br> Logical Cores Information </br></br>The number of logical cores: </br>Primary Source: </br>`/sys/fs/cgroup/cpu.max` </br></br>Fallback Source: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br> `/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>The number of cores is calculated by dividing the quota by the period (quota / period)
122
126
| **COMPUTE_POOL** ||
123
-
| `Pool usage is over than 90/95/99%` | One of the pools' CPUs is overloaded. |
127
+
| `Pool usage is over than 90%` <br>`Pool usage is over than 95%` <br>`Pool usage is over than 99%` | One of the pools' CPUs is overloaded. |
124
128
| **NODE_UPTIME** ||
125
129
| `The number of node restarts has increased` | The number of node restarts has exceeded the threshold. By default, 10 restarts per hour |
126
130
| `Node is restarting too often` | The number of node restarts has exceeded the threshold. By default, 30 restarts per hour |
@@ -136,7 +140,256 @@ The shortest `HealthCheck` response looks like this. It is returned if there is
0 commit comments