Skip to content

Commit e9a1b94

Browse files
doc update
1 parent 63b7382 commit e9a1b94

File tree

4 files changed

+431
-28
lines changed

4 files changed

+431
-28
lines changed
Loading

ydb/docs/en/core/reference/ydb-sdk/health-check-api.md

+215-14
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,6 @@ message SelfCheckResult {
1919
}
2020
```
2121

22-
The shortest `HealthCheck` response looks like this. It is returned if there is nothing wrong with the database
23-
```protobuf
24-
SelfCheckResult {
25-
self_check_result: GOOD
26-
}
27-
```
28-
2922
If any issues are detected, the `issue_log` field will contain descriptions of the problems with the following structure:
3023
```protobuf
3124
message IssueLog {
@@ -84,17 +77,17 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
8477
| **DATABASE** ||
8578
| `Database has multiple issues`</br>`Database has compute issues`</br>`Database has storage issues` | These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of the database. |
8679
| **STORAGE** ||
87-
| `There are no storage pools` | Unable to determine `STORAGE_POOLS` issues below. |
80+
| `There are no storage pools` | Storage pools aren't configured. |
8881
| `Storage degraded`</br>`Storage has no redundancy`</br>`Storage failed` | These issues depend solely on the underlying `STORAGE_POOLS` layer. |
8982
| `System tablet BSC didn't provide information` | Storage diagnostics will be generated with alternative way. |
9083
| `Storage usage over 75%/85%/90%` | Need to increase disk space. |
9184
| **STORAGE_POOL** ||
9285
| `Pool degraded/has no redundancy/failed` | These issues depend solely on the underlying `STORAGE_GROUP` layer. |
9386
| **STORAGE_GROUP** ||
94-
| `Group has no vslots` ||
87+
| `Group has no vslots` | This case is not expected, it inner problem. |
9588
| `Group degraded` | The number of disks allowed in the group is not available. |
96-
| `Group has no redundancy` | A storage group lost its redundancy. |
97-
| `Group failed` | A storage group lost its integrity. |
89+
| `Group has no redundancy` | A storage group lost its redundancy. Аnother failure of vdisk may lead to the loss of the group. |
90+
| `Group failed` | A storage group lost its integrity. Data is not available |
9891
||`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on this, sets the appropriate status and displays a message. |
9992
| **VDISK** ||
10093
| `System tablet BSC didn't provide known status` | This case is not expected, it inner problem. |
@@ -129,6 +122,214 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
129122
| **COMPUTE_POOL** ||
130123
| `Pool usage is over than 90/95/99%` | One of the pools' CPUs is overloaded. |
131124
| **NODE_UPTIME** ||
132-
| `Node is restarting too often/The number of node restarts has increased` | The number of node restarts has exceeded the threshold. |
133-
| **NODES_SYNC** ||
134-
| `The nodes have a time difference of ... ms` | Time drift on nodes might lead to potential issues with coordinating distributed transactions. |
125+
| `The number of node restarts has increased` | The number of node restarts has exceeded the threshold. By default, 10 restarts per hour |
126+
| `Node is restarting too often` | The number of node restarts has exceeded the threshold. By default, 30 restarts per hour |
127+
| **NODES_TIME_DIFFERENCE** ||
128+
| `The nodes have a time difference of ... ms` | Time drift on nodes might lead to potential issues with coordinating distributed transactions. This message starts to appear from 5 ms |
129+
130+
131+
## Example {#examples}
132+
The shortest `HealthCheck` response looks like this. It is returned if there is nothing wrong with the database
133+
```json
134+
{
135+
"self_check_result": "GOOD"
136+
}
137+
```
138+
139+
Response with `EMERGENCY` status
140+
```json
141+
{
142+
"self_check_result": "EMERGENCY",
143+
"issue_log": [
144+
{
145+
"id": "RED-27c3-70fb",
146+
"status": "RED",
147+
"message": "Database has multiple issues",
148+
"location": {
149+
"database": {
150+
"name": "/slice"
151+
}
152+
},
153+
"reason": [
154+
"RED-27c3-4e47",
155+
"RED-27c3-53b5",
156+
"YELLOW-27c3-5321"
157+
],
158+
"type": "DATABASE",
159+
"level": 1
160+
},
161+
{
162+
"id": "RED-27c3-4e47",
163+
"status": "RED",
164+
"message": "Compute has issues with system tablets",
165+
"location": {
166+
"database": {
167+
"name": "/slice"
168+
}
169+
},
170+
"reason": [
171+
"RED-27c3-c138-BSController"
172+
],
173+
"type": "COMPUTE",
174+
"level": 2
175+
},
176+
{
177+
"id": "RED-27c3-c138-BSController",
178+
"status": "RED",
179+
"message": "System tablet is unresponsive",
180+
"location": {
181+
"compute": {
182+
"tablet": {
183+
"type": "BSController",
184+
"id": [
185+
"72057594037989391"
186+
]
187+
}
188+
},
189+
"database": {
190+
"name": "/slice"
191+
}
192+
},
193+
"type": "SYSTEM_TABLET",
194+
"level": 3
195+
},
196+
{
197+
"id": "RED-27c3-53b5",
198+
"status": "RED",
199+
"message": "System tablet BSC didn't provide information",
200+
"location": {
201+
"database": {
202+
"name": "/slice"
203+
}
204+
},
205+
"type": "STORAGE",
206+
"level": 2
207+
},
208+
{
209+
"id": "YELLOW-27c3-5321",
210+
"status": "YELLOW",
211+
"message": "Storage degraded",
212+
"location": {
213+
"database": {
214+
"name": "/slice"
215+
}
216+
},
217+
"reason": [
218+
"YELLOW-27c3-595f-8d1d"
219+
],
220+
"type": "STORAGE",
221+
"level": 2
222+
},
223+
{
224+
"id": "YELLOW-27c3-595f-8d1d",
225+
"status": "YELLOW",
226+
"message": "Pool degraded",
227+
"location": {
228+
"storage": {
229+
"pool": {
230+
"name": "static"
231+
}
232+
},
233+
"database": {
234+
"name": "/slice"
235+
}
236+
},
237+
"reason": [
238+
"YELLOW-27c3-ef3e-0"
239+
],
240+
"type": "STORAGE_POOL",
241+
"level": 3
242+
},
243+
{
244+
"id": "RED-84d8-3-3-1",
245+
"status": "RED",
246+
"message": "PDisk is not available",
247+
"location": {
248+
"storage": {
249+
"node": {
250+
"id": 3,
251+
"host": "man0-0026.ydb-dev.nemax.nebiuscloud.net",
252+
"port": 19001
253+
},
254+
"pool": {
255+
"group": {
256+
"vdisk": {
257+
"pdisk": [
258+
{
259+
"id": "3-1",
260+
"path": "/dev/disk/by-partlabel/NVMEKIKIMR01"
261+
}
262+
]
263+
}
264+
}
265+
}
266+
}
267+
},
268+
"type": "PDISK",
269+
"level": 6
270+
},
271+
{
272+
"id": "RED-27c3-4847-3-0-1-0-2-0",
273+
"status": "RED",
274+
"message": "VDisk is not available",
275+
"location": {
276+
"storage": {
277+
"node": {
278+
"id": 3,
279+
"host": "man0-0026.ydb-dev.nemax.nebiuscloud.net",
280+
"port": 19001
281+
},
282+
"pool": {
283+
"name": "static",
284+
"group": {
285+
"vdisk": {
286+
"id": [
287+
"0-1-0-2-0"
288+
]
289+
}
290+
}
291+
}
292+
},
293+
"database": {
294+
"name": "/slice"
295+
}
296+
},
297+
"reason": [
298+
"RED-84d8-3-3-1"
299+
],
300+
"type": "VDISK",
301+
"level": 5
302+
},
303+
{
304+
"id": "YELLOW-27c3-ef3e-0",
305+
"status": "YELLOW",
306+
"message": "Group degraded",
307+
"location": {
308+
"storage": {
309+
"pool": {
310+
"name": "static",
311+
"group": {
312+
"id": [
313+
"0"
314+
]
315+
}
316+
}
317+
},
318+
"database": {
319+
"name": "/slice"
320+
}
321+
},
322+
"reason": [
323+
"RED-27c3-4847-3-0-1-0-2-0"
324+
],
325+
"type": "STORAGE_GROUP",
326+
"level": 4
327+
}
328+
],
329+
"location": {
330+
"id": 5,
331+
"host": "man0-0028.ydb-dev.nemax.nebiuscloud.net",
332+
"port": 19001
333+
}
334+
}
335+
```
Loading

0 commit comments

Comments
 (0)