Skip to content

Commit 64f66af

Browse files
authored
Apply suggestions from code review
1 parent 6c82d1f commit 64f66af

File tree

2 files changed

+161
-119
lines changed

2 files changed

+161
-119
lines changed

ydb/docs/en/core/reference/ydb-sdk/health-check-api.md

+90-53
Original file line numberDiff line numberDiff line change
@@ -123,17 +123,17 @@ message IssueLog {
123123

124124
#### Issues hierarchy {#issues-hierarchy}
125125

126-
Issues can be arranged hierarchically with `id` and `reason` fields, which help to visualize how issues in a separate module affect the state of the system as a whole. All issues are arranged in a hierarchy where higher levels can depend on nested levels:
126+
Issues can be arranged hierarchically using the `id` and `reason` fields, which help visualize how issues in different modules affect the overall system state. All issues are arranged in a hierarchy where higher levels can depend on nested levels:
127127

128128
![cards_hierarchy](./_assets/hc_cards_hierarchy.png)
129129

130-
Each issue has a nesting `level`. The higher the `level`, the deeper the issue is in the hierarchy. Issues with the same `type` always have the same `level`, and they can be represented as a hierarchy.
130+
Each issue has a nesting `level`. The higher the `level`, the deeper the issue is within the hierarchy. Issues with the same `type` always have the same `level`, and they can be represented hierarchically.
131131

132132
![issues_hierarchy](./_assets/hc_types_hierarchy.png)
133133

134134
#### Database check result {#selfcheck-result}
135135

136-
The most general statuses of the database, which can have the following values:
136+
The most general status of the database. It can have the following values:
137137

138138
| Value | Description |
139139
|:----|:----|
@@ -144,14 +144,14 @@ The most general statuses of the database, which can have the following values:
144144

145145
#### Issue status {#issue-status}
146146

147-
Status (severity) of the current issue:
147+
The status (severity) of the current issue:
148148

149149
| Value | Description |
150150
|:----|:----|
151-
| `GREY` | Failed to determine the status (an issue with the self-diagnostic subsystem). |
152-
| `GREEN` | No issues were detected. |
153-
| `BLUE` | Temporary minor degradation that does not affect database availability; the system is expected to switch to `GREEN`. |
154-
| `YELLOW` | A minor issue, no risks to availability. It is recommended to continue monitoring the issue. |
151+
| `GREY` | Unable to determine the status (an issue with the self-diagnostic subsystem). |
152+
| `GREEN` | No issues detected. |
153+
| `BLUE` | Temporary minor degradation that does not affect database availability; the system is expected to return to `GREEN`. |
154+
| `YELLOW` | A minor issue with no risks to availability. It is recommended to continue monitoring the issue. |
155155
| `ORANGE` | A serious issue, a step away from losing availability. Maintenance may be required. |
156156
| `RED` | A component is faulty or unavailable. |
157157

@@ -161,21 +161,21 @@ Status (severity) of the current issue:
161161

162162
#### Database has multiple issues, Database has compute issues, Database has storage issues
163163

164-
**Description:** These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of a database.
164+
**Description:** These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This represents the most general status of a database.
165165

166166
### STORAGE
167167

168168
#### There are no storage pools
169169

170-
**Description:** Information about storage pools is unavailable. Most likely, storage pools aren't configured.
170+
**Description:** Information about storage pools is unavailable. Most likely, the storage pools are not configured.
171171

172172
#### Storage degraded, Storage has no redundancy, Storage failed
173173

174174
**Description:** These issues depend solely on the underlying `STORAGE_POOLS` layer.
175175

176176
#### System tablet BSC didn't provide information
177177

178-
**Description:** Storage diagnostics will be generated alternatively.
178+
**Description:** Storage diagnostics will be generated using an alternative method.
179179

180180
#### Storage usage over 75%, Storage usage over 85%, Storage usage over 90%
181181

@@ -191,46 +191,57 @@ Status (severity) of the current issue:
191191

192192
#### Group has no vslots
193193

194-
**Description:** This case is not expected; it is an internal issue.
194+
**Description:** This situation is not expected; it is an internal issue.
195195

196196
#### Group degraded
197197

198-
**Description:** A number of disks allowed in the group are not available.operations.
198+
**Description:** A number of disks allowed in the group are not available.
199+
199200
**Logic of work:** `HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
200-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
201+
202+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, apply the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
201203

202204
#### Group has no redundancy
203205

204-
**Description:** A storage group lost its redundancy. Another failure of vdisk may lead to the loss of the group.operations.
205-
**Logic of work:** `HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
206-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
206+
**Description:** A storage group has lost its redundancy. Another VDisk failure could result in the loss of the group.
207+
208+
**Logic of work:** `HealthCheck` monitors various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group based on these parameters.
209+
210+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, apply the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on those nodes.
207211

208212
#### Group failed
209213

210-
**Description:** A storage group lost its integrity. Data is not available. `HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on them, sets the appropriate status and displays a message.operations.
211-
**Logic of work:** `HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
212-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on the nodes.
214+
**Description:** A storage group has lost its integrity, and data is no longer available. `HealthCheck` evaluates various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and determines the appropriate status, displaying a message accordingly.
215+
216+
**Logic of work:** `HealthCheck` monitors various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and sets the appropriate status for the group accordingly.
217+
218+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, apply the `Groups` and `Degraded` filters, and use the known group `id` to check the availability of nodes and disks on those nodes.```
213219

214220
### VDISK
215221

216-
#### System tablet BSC didn't provide known status
222+
#### System tablet BSC did not provide known status
217223

218-
**Description:** This case is not expected; it is an internal issue.
224+
**Description:** This situation is not expected; it is an internal issue.
219225

220226
#### VDisk is not available
221227

222-
**Description:** The disk is not operational at all.
223-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
228+
**Description:** The disk is not operational.
229+
230+
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and apply the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant VDisk to identify the node with the problem and check the availability of nodes and disks on those nodes.
224231

225232
#### VDisk is being initialized
226233

227-
**Description:** Initialization in process.
228-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
234+
**Description:** The disk is in the process of initialization.
235+
236+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and apply the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant VDisk to identify the node with the problem and check the availability of nodes and disks on those nodes.
229237

230238
#### Replication in progress
231239

232-
**Description:** The disk accepts queries, but not all the data was replicated.
233-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and set the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant `vdisk` to identify the node with the problem. Check the availability of nodes and disks on the nodes.
240+
#### Replication in progress
241+
242+
**Description:** The disk is accepting queries, but not all data has been replicated.
243+
244+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, and apply the `Groups` and `Degraded` filters. The group `id` can be found through the related `STORAGE_GROUP` issue. Hover over the relevant VDisk to identify the node with the problem and check the availability of nodes and disks on those nodes.
234245

235246
#### VDisk have space issue
236247

@@ -245,27 +256,31 @@ Status (severity) of the current issue:
245256
#### PDisk state is ...
246257

247258
**Description:** Indicates state of physical disk.
248-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node `id` and `pdisk` to check the availability of nodes and disks on the nodes.
259+
260+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node id and PDisk to check the availability of nodes and disks on the nodes.
249261

250262
#### Available size is less than 12%, Available size is less than 9%, Available size is less than 6%
251263

252264
**Description:** Free space on the physical disk is running out.
253-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Out of Space` filters, and use the known node `id` and `pdisk` to check the available space.
265+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Out of Space` filters, and use the known node and PDisk identifiers to check the available space.
254266

255267
#### PDisk is not available
256268

257269
**Description:** A physical disk is not available.
258-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node `id` and `pdisk` to check the availability of nodes and disks on the nodes.
270+
271+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the database page, select the `Storage` tab, set the `Nodes` and `Degraded` filters, and use the known node and PDisk identifiers to check the availability of nodes and disks on the nodes.
259272

260273
### STORAGE_NODE
274+
261275
#### Storage node is not available
276+
262277
**Description:** A storage node is not available.
263278

264279
### COMPUTE
265280

266281
#### There are no compute nodes
267282

268-
**Description:** The database has no nodes to start the tablets. Unable to determine `COMPUTE_NODE` issues below.
283+
**Description:** The database has no nodes available to start the tablets. Unable to determine `COMPUTE_NODE` issues below.
269284

270285
#### Compute has issues with system tablets
271286

@@ -291,72 +306,92 @@ Status (severity) of the current issue:
291306

292307
#### Paths quota usage is over than 90%, Paths quota usage is over than 99%, Paths quota exhausted, Shards quota usage is over than 90%, Shards quota usage is over than 99%, Shards quota exhausted
293308

294-
**Description:** Quotas exhausted.
309+
**Description:** Quotas are exhausted.
310+
295311
**Actions:** Check the number of objects (tables, topics) in the database and delete any unnecessary ones.
296312

297313
### SYSTEM_TABLET
298314

299315
#### System tablet is unresponsive, System tablet response time over 1000ms, System tablet response time over 5000ms
300316

301-
**Description:** The system tablet is not responding or takes too long to respond.
302-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Storage` tab and set the `Nodes` filter. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
317+
**Description:** The system tablet is either not responding or takes too long to respond.
318+
319+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the `Storage` tab and apply the `Nodes` filter. Check the `Uptime` and the nodes' statuses. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
303320

304321
### TABLET
305322

306323
#### Tablets are restarting too often
307324

308-
**Description:** Tablets are restarting too often.
309-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Nodes` tab. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
325+
**Description:** Tablets are restarting too frequently.
326+
327+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the `Nodes` tab. Check the `Uptime` and the nodes' statuses. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
310328

311329
#### Tablets/Followers are dead
312330

313-
**Description:** Tablets are not running (probably cannot be started).
314-
**Actions:** In [YDB Embedded UI](../embedded-ui/ydb-monitoring.md), go to the `Nodes` tab. Check the `Uptime` and status of the nodes. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
331+
**Description:** Tablets are not running (likely cannot be started).
332+
333+
**Actions:** In [Embedded UI](../embedded-ui/ydb-monitoring.md), navigate to the `Nodes` tab. Check the `Uptime` and the nodes' statuses. If the `Uptime` is short, review the logs to determine the reasons for the node restarts.
334+
335+
### LOAD_AVERAGE
336+
337+
#### LoadAverage above 100%
315338

316339
### LOAD_AVERAGE
317340

318341
#### LoadAverage above 100%
319342

320-
**Description:** ([Load](https://en.wikipedia.org/wiki/Load_(computing)).) A physical host is overloaded. This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations.
343+
**Description:** A physical host is overloaded, meaning the system is operating at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations. For more information on load, see [Load (computing)](https://en.wikipedia.org/wiki/Load_(computing)).
344+
321345
**Logic of work:**
322-
Load Information:
323-
Source: `/proc/loadavg`
324-
We use the first number of the three — the average load over the last 1 minute.
325-
Logical Cores Information:
326-
Primary Source: `/sys/fs/cgroup/cpu.max`
327-
Fallback Source: `/sys/fs/cgroup/cpu/cpu.cfs_quota_us`, `/sys/fs/cgroup/cpu/cpu.cfs_period_us`.
328-
The number of cores is calculated by dividing the quota by the period (quota / period).
346+
347+
- Load Information:
348+
349+
- Source: `/proc/loadavg`
350+
- The first number of the three represents the average load over the last 1 minute.
351+
352+
- Logical Cores Information:
353+
354+
- Primary Source: `/sys/fs/cgroup/cpu.max`
355+
- Fallback Sources: `/sys/fs/cgroup/cpu/cpu.cfs_quota_us`, `/sys/fs/cgroup/cpu/cpu.cfs_period_us`
356+
-
357+
The number of cores is calculated by dividing the quota by the period (quota / period).
358+
329359
**Actions:** Check the CPU load on the nodes.
330360

331361
### COMPUTE_POOL
332362

333363
#### Pool usage is over than 90%, Pool usage is over than 95%, Pool usage is over than 99%
334364

335365
**Description:** One of the pools' CPUs is overloaded.
366+
336367
**Actions:** Add cores to the configuration of the actor system for the corresponding CPU pool.
337368

338369
### NODE_UPTIME
339370

340371
#### The number of node restarts has increased
341372

342-
**Description:** The number of node restarts has exceeded the threshold. By default, 10 restarts per hour.
343-
**Actions:** Check the logs to determine the reasons for the process restart.
373+
**Description:** The number of node restarts has exceeded the threshold. By default, this is set to 10 restarts per hour.
374+
375+
**Actions:** Check the logs to determine the reasons for the process restarts.
344376

345377
#### Node is restarting too often
346378

347-
**Description:** The number of node restarts has exceeded the threshold. By default, 30 restarts per hour.
348-
**Actions:** Check the logs to determine the reasons for the process restart.
379+
**Description:** The number of node restarts has exceeded the threshold. By default, this is set to 30 restarts per hour.
380+
381+
**Actions:** Check the logs to determine the reasons for the process restarts.
349382

350383
### NODES_TIME_DIFFERENCE
351384

352385
#### Node is ... ms behind peer [id], Node is ... ms ahead of peer [id]
353386

354-
**Description:** Time drift on nodes might lead to potential issues with coordinating distributed transactions. This issue starts to appear from 5 ms.
387+
**Description:** Time drift on nodes might lead to potential issues with coordinating distributed transactions. This issue starts to appear when the time difference is 5 ms or more.
388+
355389
**Actions:** Check for discrepancies in system time between the nodes listed in the alert, and verify the operation of the time synchronization process.
356390

357391
## Examples {#examples}
358392

359-
The shortest `HealthCheck` response looks like this. It is returned if there is nothing wrong with the database
393+
The shortest `HealthCheck` response looks like this. It is returned if there is nothing wrong with the database:
394+
360395
```json
361396
{
362397
"self_check_result": "GOOD"
@@ -366,6 +401,7 @@ The shortest `HealthCheck` response looks like this. It is returned if there is
366401
#### Verbose example {#example-verbose}
367402

368403
`GOOD` response with `verbose` parameter:
404+
369405
```json
370406
{
371407
"self_check_result": "GOOD",
@@ -615,6 +651,7 @@ The shortest `HealthCheck` response looks like this. It is returned if there is
615651
#### Emergency example {#example-emergency}
616652

617653
Response with `EMERGENCY` status:
654+
618655
```json
619656
{
620657
"self_check_result": "EMERGENCY",

0 commit comments

Comments
 (0)