You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ydb/docs/en/core/reference/ydb-sdk/health-check-api.md
+215-14
Original file line number
Diff line number
Diff line change
@@ -19,13 +19,6 @@ message SelfCheckResult {
19
19
}
20
20
```
21
21
22
-
The shortest `HealthCheck` response looks like this. It is returned if there is nothing wrong with the database
23
-
```protobuf
24
-
SelfCheckResult {
25
-
self_check_result: GOOD
26
-
}
27
-
```
28
-
29
22
If any issues are detected, the `issue_log` field will contain descriptions of the problems with the following structure:
30
23
```protobuf
31
24
message IssueLog {
@@ -84,17 +77,17 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
84
77
| **DATABASE** ||
85
78
| `Database has multiple issues`</br>`Database has compute issues`</br>`Database has storage issues` | These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of the database. |
86
79
| **STORAGE** ||
87
-
| `There are no storage pools` | Unable to determine `STORAGE_POOLS` issues below. |
80
+
| `There are no storage pools` | Storage pools aren't configured. |
88
81
| `Storage degraded`</br>`Storage has no redundancy`</br>`Storage failed` | These issues depend solely on the underlying `STORAGE_POOLS` layer. |
89
82
| `System tablet BSC didn't provide information` | Storage diagnostics will be generated with alternative way. |
90
83
| `Storage usage over 75%/85%/90%` | Need to increase disk space. |
91
84
| **STORAGE_POOL** ||
92
85
| `Pool degraded/has no redundancy/failed` | These issues depend solely on the underlying `STORAGE_GROUP` layer. |
93
86
| **STORAGE_GROUP** ||
94
-
| `Group has no vslots` ||
87
+
| `Group has no vslots` | This case is not expected, it inner problem. |
95
88
| `Group degraded` | The number of disks allowed in the group is not available. |
96
-
| `Group has no redundancy` | A storage group lost its redundancy. |
97
-
| `Group failed` | A storage group lost its integrity. |
89
+
| `Group has no redundancy` | A storage group lost its redundancy. Аnother failure of vdisk may lead to the loss of the group. |
90
+
| `Group failed` | A storage group lost its integrity. Data is not available |
98
91
||`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on this, sets the appropriate status and displays a message. |
99
92
| **VDISK** ||
100
93
| `System tablet BSC didn't provide known status` | This case is not expected, it inner problem. |
@@ -129,6 +122,214 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
129
122
| **COMPUTE_POOL** ||
130
123
| `Pool usage is over than 90/95/99%` | One of the pools' CPUs is overloaded. |
131
124
| **NODE_UPTIME** ||
132
-
| `Node is restarting too often/The number of node restarts has increased` | The number of node restarts has exceeded the threshold. |
133
-
| **NODES_SYNC** ||
134
-
| `The nodes have a time difference of ... ms` | Time drift on nodes might lead to potential issues with coordinating distributed transactions. |
125
+
| `The number of node restarts has increased` | The number of node restarts has exceeded the threshold. By default, 10 restarts per hour |
126
+
| `Node is restarting too often` | The number of node restarts has exceeded the threshold. By default, 30 restarts per hour |
127
+
| **NODES_TIME_DIFFERENCE** ||
128
+
| `The nodes have a time difference of ... ms` | Time drift on nodes might lead to potential issues with coordinating distributed transactions. This message starts to appear from 5 ms |
129
+
130
+
131
+
## Example {#examples}
132
+
The shortest `HealthCheck` response looks like this. It is returned if there is nothing wrong with the database
133
+
```json
134
+
{
135
+
"self_check_result": "GOOD"
136
+
}
137
+
```
138
+
139
+
Response with `EMERGENCY` status
140
+
```json
141
+
{
142
+
"self_check_result": "EMERGENCY",
143
+
"issue_log": [
144
+
{
145
+
"id": "RED-27c3-70fb",
146
+
"status": "RED",
147
+
"message": "Database has multiple issues",
148
+
"location": {
149
+
"database": {
150
+
"name": "/slice"
151
+
}
152
+
},
153
+
"reason": [
154
+
"RED-27c3-4e47",
155
+
"RED-27c3-53b5",
156
+
"YELLOW-27c3-5321"
157
+
],
158
+
"type": "DATABASE",
159
+
"level": 1
160
+
},
161
+
{
162
+
"id": "RED-27c3-4e47",
163
+
"status": "RED",
164
+
"message": "Compute has issues with system tablets",
165
+
"location": {
166
+
"database": {
167
+
"name": "/slice"
168
+
}
169
+
},
170
+
"reason": [
171
+
"RED-27c3-c138-BSController"
172
+
],
173
+
"type": "COMPUTE",
174
+
"level": 2
175
+
},
176
+
{
177
+
"id": "RED-27c3-c138-BSController",
178
+
"status": "RED",
179
+
"message": "System tablet is unresponsive",
180
+
"location": {
181
+
"compute": {
182
+
"tablet": {
183
+
"type": "BSController",
184
+
"id": [
185
+
"72057594037989391"
186
+
]
187
+
}
188
+
},
189
+
"database": {
190
+
"name": "/slice"
191
+
}
192
+
},
193
+
"type": "SYSTEM_TABLET",
194
+
"level": 3
195
+
},
196
+
{
197
+
"id": "RED-27c3-53b5",
198
+
"status": "RED",
199
+
"message": "System tablet BSC didn't provide information",
0 commit comments