You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ydb/docs/en/core/reference/ydb-sdk/health-check-api.md
+36-19
Original file line number
Diff line number
Diff line change
@@ -7,10 +7,26 @@ description: "The article will tell you how to initiate the check using the Heal
7
7
8
8
{{ ydb-short-name }} has a built-in self-diagnostic system, which can be used to get a brief report on the database status and information about existing problems.
9
9
10
-
To initiate the check, call the `SelfCheck` method from `Ydb.Monitoring`. You must also pass the name of the checked DB as usual.
10
+
To initiate the check, call the `SelfCheck` method from SDK `Ydb.Monitoring`. You must also pass the name of the checked DB as usual.
11
+
12
+
{% list tabs %}
13
+
- C++
14
+
App code snippet for creating a client:
15
+
```cpp
16
+
auto client = NYdb::NMonitoring::TMonitoringClient(driver);
17
+
```
18
+
Calling `SelfCheck` method:
19
+
```
20
+
21
+
auto settings = TSelfCheckSettings();
22
+
settings.ReturnVerboseStatus(true);
23
+
auto result = client.SelfCheck(settings).GetValueSync();
24
+
```
25
+
{% endlist %}
11
26
12
27
## Response Structure {#response-structure}
13
-
Calling the method will return the following structure:
28
+
For the full response structure, see the [ydb_monitoring.proto](https://github.com/ydb-platform/ydb/public/api/protos/ydb_monitoring.proto) file in the {{ ydb-short-name }} Git repository.
29
+
Calling the `SelfCheck` method will return the following message:
14
30
15
31
```protobuf
16
32
message SelfCheckResult {
@@ -44,6 +60,8 @@ Each issue has a nesting `level` - the higher the `level`, the deeper the ish is
44
60
|`self_check_result`| enum field which contains the DB check result:<ul><li>`GOOD`: No problems were detected.</li><li>`DEGRADED`: Degradation of one of the database systems was detected, but the database is still functioning (for example, allowable disk loss).</li><li>`MAINTENANCE_REQUIRED`: Significant degradation was detected, there is a risk of availability loss, and human maintenance is required.</li><li>`EMERGENCY`: A serious problem was detected in the database, with complete or partial loss of availability.</li></ul> |
45
61
|`issue_log`| This is a set of elements, each of which describes a problem in the system at a certain level. |
46
62
|`issue_log.id`| A unique problem ID within this response. |
63
+
|`issue_log.id`| A unique problem ID within this response. |
64
+
|`issue_log.id`| A unique problem ID within this response. |
47
65
|`issue_log.status`| Status (severity) of the current problem. <br/>It can take one of the following values:</li><li>`RED`: A component is faulty or unavailable.</li><li>`ORANGE`: A serious problem, we are one step away from losing availability. Maintenance may be required.</li><li>`YELLOW`: A minor problem, no risks to availability. We recommend you continue monitoring the problem.</li><li>`BLUE`: Temporary minor degradation that does not affect database availability. The system is expected to switch to `GREEN`.</li><li>`GREEN`: No problems were detected.</li><li>`GREY`: Failed to determine the status (a problem with the self-diagnostic mechanism).</li></ul> |
48
66
|`issue_log.message`| Text that describes the problem. |
49
67
|`issue_log.location`| Location of the problem. This can be a physical location or an execution context. |
@@ -59,13 +77,13 @@ The whole list of extra parameters presented below:
@@ -79,20 +97,20 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
79
97
| Message | Description |
80
98
|:----|:----|
81
99
| **DATABASE** ||
82
-
| `Database has multiple issues`</br>`Database has compute issues`</br>`Database has storage issues` | These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of the database. |
100
+
| `Database has multiple issues`</br>`Database has compute issues`</br>`Database has storage issues` | These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of a database. |
83
101
| **STORAGE** ||
84
102
| `There are no storage pools` | Storage pools aren't configured. |
85
103
| `Storage degraded`</br>`Storage has no redundancy`</br>`Storage failed` | These issues depend solely on the underlying `STORAGE_POOLS` layer. |
86
-
| `System tablet BSC didn't provide information` | Storage diagnostics will be generated with alternative way. |
87
-
| `Storage usage over 75%` <br>`Storage usage over 85%` <br>`Storage usage over 90%` | Need to increase disk space. |
104
+
| `System tablet BSC didn't provide information` | Storage diagnostics will be generated alternatively. |
105
+
| `Storage usage over 75%` <br>`Storage usage over 85%` <br>`Storage usage over 90%` | Some data needs to be removed, or the database needs to be reconfigured with additional disk space. |
88
106
| **STORAGE_POOL** ||
89
107
| `Pool degraded` <br>`Pool has no redundancy` <br>`Pool failed` | These issues depend solely on the underlying `STORAGE_GROUP` layer. |
90
108
| **STORAGE_GROUP** ||
91
109
| `Group has no vslots` | This case is not expected, it inner problem. |
92
-
| `Group degraded` | The number of disks allowed in the group is not available. |
110
+
| `Group degraded` | A number of disks allowed in the group are not available. |
93
111
| `Group has no redundancy` | A storage group lost its redundancy. Аnother failure of vdisk may lead to the loss of the group. |
94
112
| `Group failed` | A storage group lost its integrity. Data is not available |
95
-
||`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on this, sets the appropriate status and displays a message. |
113
+
||`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on them, sets the appropriate status and displays a message. |
96
114
| **VDISK** ||
97
115
| `System tablet BSC didn't provide known status` | This case is not expected, it inner problem. |
98
116
| `VDisk is not available` | the disk is not operational at all. |
@@ -101,7 +119,7 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
101
119
| `VDisk have space issue` | These issues depend solely on the underlying `PDISK` layer. |
102
120
| **PDISK** ||
103
121
| `Unknown PDisk state` | `HealthCheck` the system can't parse pdisk state. |
104
-
| `PDisk is inactive` <br>`PDisk state is FAULTY` <br>`PDisk state is BROKEN` <br>`PDisk state is TO_BE_REMOVED` | Indicates problems with a physical disk. |
122
+
| `PDisk state is ...` | Indicates state of physical disk. |
105
123
| `Available size is less than 12%` <br>`Available size is less than 9%` <br>`Available size is less than 6%` | Free space on the physical disk is running out. |
106
124
| `PDisk is not available` | A physical disk is not available. |
107
125
| **STORAGE_NODE** ||
@@ -114,22 +132,21 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
114
132
| `Compute quota usage` | These issues depend solely on the underlying `COMPUTE_QUOTA` layer. |
115
133
| `Compute has issues with tablets`| These issues depend solely on the underlying `TABLET` layer. |
116
134
| **COMPUTE_QUOTA** ||
117
-
| `Paths quota usage is over than 90%` <br>`Paths quota usage is over than 99%` <br>`Paths quota exhausted` </br>`Shards quota usage is over than 90%` </br>`Shards quota usage is over than 99%` </br>`Shards quota exhausted` |Quotas exhausted|
118
-
| **COMPUTE_NODE** | *There is no specific issues on this layer.* |
135
+
| `Paths quota usage is over than 90%` <br>`Paths quota usage is over than 99%` <br>`Paths quota exhausted` </br>`Shards quota usage is over than 90%` </br>`Shards quota usage is over than 99%` </br>`Shards quota exhausted` | Quotas exhausted |
119
136
| **SYSTEM_TABLET** ||
120
137
| `System tablet is unresponsive ` <br>`System tablet response time over 1000ms` <br>`System tablet response time over 5000ms`| The system tablet is not responding or it takes too long to respond. |
121
138
| **TABLET** ||
122
139
| `Tablets are restarting too often` | Tablets are restarting too often. |
123
140
| `Tablets/Followers are dead` | Tablets are not running (probably cannot be started). |
124
141
| **LOAD_AVERAGE** ||
125
-
| `LoadAverage above 100%` | A physical host is overloaded. </br> The `Healthcheck` tool monitors system load by evaluating the current workload in terms of running and waiting processes (load) and comparing it to the total number of logical cores on the host (cores). For example, if a system has 8 logical cores and the current load value is 16, the load is considered to be 200%. </br> `Healthcheck` only checks if the load exceeds the number of cores (load > cores) and reports based on this condition. This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations. </br></br> Load Information: </br> Source: </br>`/proc/loadavg` </br> Logical Cores Information </br></br>The number of logical cores: </br>Primary Source: </br>`/sys/fs/cgroup/cpu.max` </br></br>Fallback Source: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br> `/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>The number of cores is calculated by dividing the quota by the period (quota / period)
142
+
| `LoadAverage above 100%` | ([Load](https://en.wikipedia.org/wiki/Load_(computing))) A physical host is overloaded . </br> This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations. </br></br> Load Information: </br> Source: </br>`/proc/loadavg` </br> Logical Cores Information </br></br>The number of logical cores: </br>Primary Source: </br>`/sys/fs/cgroup/cpu.max` </br></br>Fallback Source: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br> `/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>The number of cores is calculated by dividing the quota by the period (quota / period)
126
143
| **COMPUTE_POOL** ||
127
144
| `Pool usage is over than 90%` <br>`Pool usage is over than 95%` <br>`Pool usage is over than 99%` | One of the pools' CPUs is overloaded. |
128
145
| **NODE_UPTIME** ||
129
146
| `The number of node restarts has increased` | The number of node restarts has exceeded the threshold. By default, 10 restarts per hour |
130
147
| `Node is restarting too often` | The number of node restarts has exceeded the threshold. By default, 30 restarts per hour |
131
148
| **NODES_TIME_DIFFERENCE** ||
132
-
| `The nodes have a time difference of ... ms` | Time drift on nodes might lead to potential issues with coordinating distributed transactions. This message starts to appear from 5 ms |
149
+
| `The nodes have a time difference of ... ms` | Time drift on nodes might lead to potential issues with coordinating distributed transactions. This issus starts to appear from 5 ms |
Copy file name to clipboardExpand all lines: ydb/docs/ru/core/reference/ydb-sdk/health-check-api.md
+27-11
Original file line number
Diff line number
Diff line change
@@ -7,8 +7,25 @@ description: "Из статьи вы узнаете, как инициирова
7
7
8
8
{{ ydb-short-name }} имеет встроенную систему самодиагностики, с помощью которой можно получить краткий отчет о состоянии базы данных и информацию об имеющихся проблемах.
9
9
10
-
Для инициации проверки необходимо сделать вызов метода `SelfCheck` из сервиса `Ydb.Monitoring`. Также необходимо передать имя проверяемой БД стандартным способом.
10
+
Для инициации проверки необходимо сделать вызов метода `SelfCheck` из сервиса YDB `Ydb.Monitoring`. Также необходимо передать имя проверяемой БД стандартным способом.
11
11
12
+
{% list tabs %}
13
+
- C++
14
+
Пример кода приложения для создания клиента:
15
+
```cpp
16
+
auto client = NYdb::NMonitoring::TMonitoringClient(driver);
17
+
```
18
+
19
+
Вызов метода `SelfCheck`:
20
+
```
21
+
auto settings = TSelfCheckSettings();
22
+
settings.ReturnVerboseStatus(true);
23
+
auto result = client.SelfCheck(settings).GetValueSync();
24
+
```
25
+
{% endlist %}
26
+
27
+
## Response Structure {#response-structure}
28
+
Полную структуру ответа можно посмотреть в файле [ydb_monitoring.proto](https://github.com/ydb-platform/ydb/public/api/protos/ydb_monitoring.proto) в {{ ydb-short-name }} Git репозитории.
12
29
В результате вызова этого метода будет возвращена следующая структура:
13
30
14
31
```protobuf
@@ -59,12 +76,12 @@ message IssueLog {
59
76
Полный список дополнительных параметров представлен ниже:
@@ -101,7 +118,7 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
101
118
| `VDisk have space issue` | Зависит от нижележащего слоя `PDISK`. |
102
119
| **PDISK** ||
103
120
| `Unknown PDisk state` | `HealthCheck` не может разобрать состояние PDisk. Внутренняя ошибка. |
104
-
| `PDisk is inactive` <br>`PDisk state is FAULTY` <br>`PDisk state is BROKEN` <br>`PDisk state is TO_BE_REMOVED` | Cообщает о проблемах с физическим диском. |
121
+
| `PDisk state is ...` | Cообщает состояние физического диска. |
105
122
| `Available size is less than 12%` <br>`Available size is less than 9%` <br>`Available size is less than 6%` | Заканчивается свободное место на физическом диске. |
106
123
| `PDisk is not available` | Отсутствует физический диск. |
107
124
| **STORAGE_NODE** ||
@@ -115,21 +132,20 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
115
132
| `Compute has issues with tablets` | Зависит от нижележащего слоя `TABLET`. |
116
133
| **COMPUTE_QUOTA** ||
117
134
| `Paths quota usage is over than 90%` <br>`Paths quota usage is over than 99%` <br>`Paths quota exhausted` </br>`Shards quota usage is over than 90%` <br>`Shards quota usage is over than 99%` <br>`Shards quota exhausted` | Квоты исчерпаны. |
118
-
| **COMPUTE_NODE** | *Нет сообщений на этом уровне.* |
119
135
| **SYSTEM_TABLET** ||
120
136
| `System tablet is unresponsive ` <br>`System tablet response time over 1000ms` <br>`System tablet response time over 5000ms` | Системная таблетка не отвечает или отвечает долго |
121
137
| **TABLET** ||
122
138
| `Tablets are restarting too often` | Таблетки слишком часто перезапускаются. |
123
139
| `Tablets are dead` <br>`Followers are dead` | Таблетки не запущены (или не могут быть запущены). |
124
140
| **LOAD_AVERAGE** ||
125
-
| `LoadAverage above 100%` | Физический хост перегружен. </br>Сервис `Healthcheck` мониторит системную нагрузку, оценивая ее в терминах выполняющихся, ожидающих процессов (load) и сравнивая её с общим числом логических ядер на хосте (cores). Например, если у системы 8 логических ядер и текущая нагрузка составляет 16, нагрузка считается равной 200%. </br>`Healthcheck` проверяет только превышение нагрузки над количеством ядер (load > cores) и сообщает на основе этого предупреждение. Это указывает на то, что система работает на пределе, скорее всего из-за большого количества процессов, ожидающих операций ввода-вывода. </br></br>Информация о нагрузке: </br>Источник: </br>`/proc/loadavg` </br>Информация о логических ядрах </br></br>Количество логических ядер: </br>Основной источник: </br>`/sys/fs/cgroup/cpu.max` </br></br>Дополнительный источник: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br>`/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>Количество ядер вычисляется путем деления квоты на период (quota / period) |
141
+
| `LoadAverage above 100%` | ([Load](https://en.wikipedia.org/wiki/Load_(computing))) Физический хост перегружен. </br>Это указывает на то, что система работает на пределе, скорее всего из-за большого количества процессов, ожидающих операций ввода-вывода. </br></br>Информация о нагрузке: </br>Источник: </br>`/proc/loadavg` </br>Информация о логических ядрах </br></br>Количество логических ядер: </br>Основной источник: </br>`/sys/fs/cgroup/cpu.max` </br></br>Дополнительный источник: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br>`/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>Количество ядер вычисляется путем деления квоты на период (quota / period) |
126
142
| **COMPUTE_POOL** ||
127
143
| `Pool usage is over than 90%` <br>`Pool usage is over than 95%` <br>`Pool usage is over than 99%` | один из CPU пулов перегружен. |
128
144
| **NODE_UPTIME** ||
129
145
| `The number of node restarts has increased` | Количество рестартов ноды превысило порог. По-умолчанию, это 10 рестартов в час. |
130
146
| `Node is restarting too often` | Узлы слишком часто перезапускаются. По-умолчанию, это 30 рестартов в час. |
131
147
| **NODES_TIME_DIFFERENCE** ||
132
-
| `The nodes have a time difference of ... ms` | Расхождение времени на узлах, что может приводить к возможным проблемам с координацией распределённых транзакций. |
148
+
| `The nodes have a time difference of ... ms` | Расхождение времени на узлах, что может приводить к возможным проблемам с координацией распределённых транзакций. Начинает появляться с расхождения в 5ms|
0 commit comments