Skip to content

Commit 783826f

Browse files
added examples
1 parent e9a1b94 commit 783826f

File tree

2 files changed

+537
-32
lines changed

2 files changed

+537
-32
lines changed

ydb/docs/en/core/reference/ydb-sdk/health-check-api.md

+269-16
Original file line numberDiff line numberDiff line change
@@ -46,29 +46,33 @@ Each issue has a nesting `level` - the higher the `level`, the deeper the ish is
4646
| `issue_log.id` | A unique problem ID within this response. |
4747
| `issue_log.status` | Status (severity) of the current problem. <br/>It can take one of the following values:</li><li>`RED`: A component is faulty or unavailable.</li><li>`ORANGE`: A serious problem, we are one step away from losing availability. Maintenance may be required.</li><li>`YELLOW`: A minor problem, no risks to availability. We recommend you continue monitoring the problem.</li><li>`BLUE`: Temporary minor degradation that does not affect database availability. The system is expected to switch to `GREEN`.</li><li>`GREEN`: No problems were detected.</li><li>`GREY`: Failed to determine the status (a problem with the self-diagnostic mechanism).</li></ul> |
4848
| `issue_log.message` | Text that describes the problem. |
49-
| `issue_log.location` | Location of the problem. |
49+
| `issue_log.location` | Location of the problem. This can be a physical location or an execution context. |
5050
| `issue_log.reason` | This is a set of elements, each of which describes a problem in the system at a certain level. |
51-
| `issue_log.type` | Problem category (by subsystem). |
52-
| `issue_log.level` | Depth of the problem nesting. |
53-
| `database_status` | If settings contains `verbose` parameter than `database_status` field will be filled. <br/>It provides a summary of the overall health of the database. <br/>It's used to quickly review the overall health of the database, helping to assess its health and whether there are any serious problems at a high level. |
51+
| `issue_log.type` | Problem category (by subsystem). Each type is at a certain level and interconnected with others through a rigid hierarchy (as shown in the picture above). |
52+
| `issue_log.level` | The depth of problem nesting. |
53+
| `database_status` | If settings contains `verbose` parameter than `database_status` field will be filled. <br/>It provides a summary of the overall health of the database. <br/>It's used to quickly review the overall health of the database, helping to assess its health and whether there are any serious problems at a high level. [Example](#example-verbose). |
5454
| `location` | Contains information about host, where `HealthCheck` service was called |
5555

5656

5757
## Call parameters {#call-parameters}
5858
The whole list of extra parameters presented below:
59+
60+
{% list tabs %}
61+
- C++
5962
```c++
6063
struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>{
6164
FLUENT_SETTING_OPTIONAL(bool, ReturnVerboseStatus);
6265
FLUENT_SETTING_OPTIONAL(EStatusFlag, MinimumStatus);
6366
FLUENT_SETTING_OPTIONAL(ui32, MaximumLevel);
6467
};
6568
```
69+
{% endlist %}
6670
67-
| Parameter | Description |
71+
| Parameter | Type | Description |
6872
|:----|:----|
69-
| `ReturnVerboseStatus` | This parameter determins whether the `database_status` response field is filled. |
70-
| `MinimumStatus` | The minimum status that will be included in the response. Issues with a lower status will be discarded. |
71-
| `MaximumLevel` | The maximum depth of issues to include in the response. Deeper levels will be discarded. |
73+
| `ReturnVerboseStatus` | `bool` | As mentioned earlier, this parameter affects the filling of the `database_status` field. Default is false. |
74+
| `MinimumStatus` | `EStatusFlag` | The minimum severity status that will appear in the response. Less severe issues will be discarded. By default, all issues will be listed. |
75+
| `MaximumLevel` | `int32` | The maximum depth of issues in the response. Issues at deeper levels will be discarded. By default, all issues will be listed. |
7276
7377
## Possible problems {#problems}
7478
@@ -80,9 +84,9 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
8084
| `There are no storage pools` | Storage pools aren't configured. |
8185
| `Storage degraded`</br>`Storage has no redundancy`</br>`Storage failed` | These issues depend solely on the underlying `STORAGE_POOLS` layer. |
8286
| `System tablet BSC didn't provide information` | Storage diagnostics will be generated with alternative way. |
83-
| `Storage usage over 75%/85%/90%` | Need to increase disk space. |
87+
| `Storage usage over 75%` <br>`Storage usage over 85%` <br>`Storage usage over 90%` | Need to increase disk space. |
8488
| **STORAGE_POOL** ||
85-
| `Pool degraded/has no redundancy/failed` | These issues depend solely on the underlying `STORAGE_GROUP` layer. |
89+
| `Pool degraded` <br>`Pool has no redundancy` <br>`Pool failed` | These issues depend solely on the underlying `STORAGE_GROUP` layer. |
8690
| **STORAGE_GROUP** ||
8791
| `Group has no vslots` | This case is not expected, it inner problem. |
8892
| `Group degraded` | The number of disks allowed in the group is not available. |
@@ -97,8 +101,8 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
97101
| `VDisk have space issue` | These issues depend solely on the underlying `PDISK` layer. |
98102
| **PDISK** ||
99103
| `Unknown PDisk state` | `HealthCheck` the system can't parse pdisk state. |
100-
| `PDisk is inactive/PDisk state is FAULTY/BROKEN/TO_BE_REMOVED` | Indicates problems with a physical disk. |
101-
| `Available size is less than 12%/9%/6%` | Free space on the physical disk is running out. |
104+
| `PDisk is inactive` <br>`PDisk state is FAULTY` <br>`PDisk state is BROKEN` <br>`PDisk state is TO_BE_REMOVED` | Indicates problems with a physical disk. |
105+
| `Available size is less than 12%` <br>`Available size is less than 9%` <br>`Available size is less than 6%` | Free space on the physical disk is running out. |
102106
| `PDisk is not available` | A physical disk is not available. |
103107
| **STORAGE_NODE** ||
104108
| `Storage node is not available` | A node with disks is not available. |
@@ -110,17 +114,17 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
110114
| `Compute quota usage` | These issues depend solely on the underlying `COMPUTE_QUOTA` layer. |
111115
| `Compute has issues with tablets`| These issues depend solely on the underlying `TABLET` layer. |
112116
| **COMPUTE_QUOTA** ||
113-
| `Paths quota usage is over than 90%/99%/Paths quota exhausted` </br>`Shards quota usage is over than 90%/99%/Shards quota exhausted` |Quotas exhausted|
117+
| `Paths quota usage is over than 90%` <br>`Paths quota usage is over than 99%` <br>`Paths quota exhausted` </br>`Shards quota usage is over than 90%` </br>`Shards quota usage is over than 99%` </br>`Shards quota exhausted` |Quotas exhausted|
114118
| **COMPUTE_NODE** | *There is no specific issues on this layer.* |
115119
| **SYSTEM_TABLET** ||
116-
| `System tablet is unresponsive / response time over 1000ms/5000ms`| The system tablet is not responding or it takes too long to respond. |
120+
| `System tablet is unresponsive ` <br>`System tablet response time over 1000ms` <br>`System tablet response time over 5000ms`| The system tablet is not responding or it takes too long to respond. |
117121
| **TABLET** ||
118122
| `Tablets are restarting too often` | Tablets are restarting too often. |
119123
| `Tablets/Followers are dead` | Tablets are not running (probably cannot be started). |
120124
| **LOAD_AVERAGE** ||
121125
| `LoadAverage above 100%` | A physical host is overloaded. </br> The `Healthcheck` tool monitors system load by evaluating the current workload in terms of running and waiting processes (load) and comparing it to the total number of logical cores on the host (cores). For example, if a system has 8 logical cores and the current load value is 16, the load is considered to be 200%. </br> `Healthcheck` only checks if the load exceeds the number of cores (load > cores) and reports based on this condition. This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations. </br></br> Load Information: </br> Source: </br>`/proc/loadavg` </br> Logical Cores Information </br></br>The number of logical cores: </br>Primary Source: </br>`/sys/fs/cgroup/cpu.max` </br></br>Fallback Source: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br> `/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>The number of cores is calculated by dividing the quota by the period (quota / period)
122126
| **COMPUTE_POOL** ||
123-
| `Pool usage is over than 90/95/99%` | One of the pools' CPUs is overloaded. |
127+
| `Pool usage is over than 90%` <br>`Pool usage is over than 95%` <br>`Pool usage is over than 99%` | One of the pools' CPUs is overloaded. |
124128
| **NODE_UPTIME** ||
125129
| `The number of node restarts has increased` | The number of node restarts has exceeded the threshold. By default, 10 restarts per hour |
126130
| `Node is restarting too often` | The number of node restarts has exceeded the threshold. By default, 30 restarts per hour |
@@ -136,7 +140,256 @@ The shortest `HealthCheck` response looks like this. It is returned if there is
136140
}
137141
```
138142

139-
Response with `EMERGENCY` status
143+
#### Verbose example {#example-verbose}
144+
`GOOD` response with `verbose` parameter:
145+
```json
146+
{
147+
"self_check_result": "GOOD",
148+
"database_status": [
149+
{
150+
"name": "/amy/db",
151+
"overall": "GREEN",
152+
"storage": {
153+
"overall": "GREEN",
154+
"pools": [
155+
{
156+
"id": "/amy/db:ssdencrypted",
157+
"overall": "GREEN",
158+
"groups": [
159+
{
160+
"id": "2181038132",
161+
"overall": "GREEN",
162+
"vdisks": [
163+
{
164+
"id": "9-1-1010",
165+
"overall": "GREEN",
166+
"pdisk": {
167+
"id": "9-1",
168+
"overall": "GREEN"
169+
}
170+
},
171+
{
172+
"id": "11-1004-1009",
173+
"overall": "GREEN",
174+
"pdisk": {
175+
"id": "11-1004",
176+
"overall": "GREEN"
177+
}
178+
},
179+
{
180+
"id": "10-1003-1011",
181+
"overall": "GREEN",
182+
"pdisk": {
183+
"id": "10-1003",
184+
"overall": "GREEN"
185+
}
186+
},
187+
{
188+
"id": "8-1005-1010",
189+
"overall": "GREEN",
190+
"pdisk": {
191+
"id": "8-1005",
192+
"overall": "GREEN"
193+
}
194+
},
195+
{
196+
"id": "7-1-1008",
197+
"overall": "GREEN",
198+
"pdisk": {
199+
"id": "7-1",
200+
"overall": "GREEN"
201+
}
202+
},
203+
{
204+
"id": "6-1-1007",
205+
"overall": "GREEN",
206+
"pdisk": {
207+
"id": "6-1",
208+
"overall": "GREEN"
209+
}
210+
},
211+
{
212+
"id": "4-1005-1010",
213+
"overall": "GREEN",
214+
"pdisk": {
215+
"id": "4-1005",
216+
"overall": "GREEN"
217+
}
218+
},
219+
{
220+
"id": "2-1003-1013",
221+
"overall": "GREEN",
222+
"pdisk": {
223+
"id": "2-1003",
224+
"overall": "GREEN"
225+
}
226+
},
227+
{
228+
"id": "1-1-1008",
229+
"overall": "GREEN",
230+
"pdisk": {
231+
"id": "1-1",
232+
"overall": "GREEN"
233+
}
234+
}
235+
]
236+
}
237+
]
238+
}
239+
]
240+
},
241+
"compute": {
242+
"overall": "GREEN",
243+
"nodes": [
244+
{
245+
"id": "50073",
246+
"overall": "GREEN",
247+
"pools": [
248+
{
249+
"overall": "GREEN",
250+
"name": "System",
251+
"usage": 0.000405479
252+
},
253+
{
254+
"overall": "GREEN",
255+
"name": "User",
256+
"usage": 0.00265229
257+
},
258+
{
259+
"overall": "GREEN",
260+
"name": "Batch",
261+
"usage": 0.000347933
262+
},
263+
{
264+
"overall": "GREEN",
265+
"name": "IO",
266+
"usage": 0.000312022
267+
},
268+
{
269+
"overall": "GREEN",
270+
"name": "IC",
271+
"usage": 0.000945925
272+
}
273+
],
274+
"load": {
275+
"overall": "GREEN",
276+
"load": 0.2,
277+
"cores": 4
278+
}
279+
},
280+
{
281+
"id": "50074",
282+
"overall": "GREEN",
283+
"pools": [
284+
{
285+
"overall": "GREEN",
286+
"name": "System",
287+
"usage": 0.000619053
288+
},
289+
{
290+
"overall": "GREEN",
291+
"name": "User",
292+
"usage": 0.00463859
293+
},
294+
{
295+
"overall": "GREEN",
296+
"name": "Batch",
297+
"usage": 0.000596071
298+
},
299+
{
300+
"overall": "GREEN",
301+
"name": "IO",
302+
"usage": 0.0006241
303+
},
304+
{
305+
"overall": "GREEN",
306+
"name": "IC",
307+
"usage": 0.00218465
308+
}
309+
],
310+
"load": {
311+
"overall": "GREEN",
312+
"load": 0.08,
313+
"cores": 4
314+
}
315+
},
316+
{
317+
"id": "50075",
318+
"overall": "GREEN",
319+
"pools": [
320+
{
321+
"overall": "GREEN",
322+
"name": "System",
323+
"usage": 0.000579126
324+
},
325+
{
326+
"overall": "GREEN",
327+
"name": "User",
328+
"usage": 0.00344293
329+
},
330+
{
331+
"overall": "GREEN",
332+
"name": "Batch",
333+
"usage": 0.000592347
334+
},
335+
{
336+
"overall": "GREEN",
337+
"name": "IO",
338+
"usage": 0.000525747
339+
},
340+
{
341+
"overall": "GREEN",
342+
"name": "IC",
343+
"usage": 0.00174265
344+
}
345+
],
346+
"load": {
347+
"overall": "GREEN",
348+
"load": 0.26,
349+
"cores": 4
350+
}
351+
}
352+
],
353+
"tablets": [
354+
{
355+
"overall": "GREEN",
356+
"type": "SchemeShard",
357+
"state": "GOOD",
358+
"count": 1
359+
},
360+
{
361+
"overall": "GREEN",
362+
"type": "SysViewProcessor",
363+
"state": "GOOD",
364+
"count": 1
365+
},
366+
{
367+
"overall": "GREEN",
368+
"type": "Coordinator",
369+
"state": "GOOD",
370+
"count": 3
371+
},
372+
{
373+
"overall": "GREEN",
374+
"type": "Mediator",
375+
"state": "GOOD",
376+
"count": 3
377+
},
378+
{
379+
"overall": "GREEN",
380+
"type": "Hive",
381+
"state": "GOOD",
382+
"count": 1
383+
}
384+
]
385+
}
386+
}
387+
]
388+
}
389+
```
390+
391+
#### Emergency example {#example-emergency}
392+
Response with `EMERGENCY` status:
140393
```json
141394
{
142395
"self_check_result": "EMERGENCY",

0 commit comments

Comments
 (0)