Skip to content

Commit a650533

Browse files
committed
config-linux: add support for rsvd hugetlb cgroup
The previous non-rsvd max/limit_in_bytes does not account for reserved huge page memory, making it possible for a process to reserve all the huge page memory, without being able to allocate it (due to hugetlb cgroup page fault accounting restrictions). In practice this makes it possible to successfully mmap more huge page memory than allowed via the cgroup settings, but when using the memory the process will get a SIGBUS and crash. This is bad for applications trying to mmap at startup (and it succeeds), but the program crashes when starting to use the memory. eg. postgres is doing this by default. This patch updates and clarifies `LinuxResources.HugepageLimits` and `LinuxHugepageLimit` by defaulting the configurations go to rsvd hugetlb cgroup (when supported) and fallback to page fault accounting if not supported. Fixes #1050 Signed-off-by: Kailun Qin <[email protected]>
1 parent 8961758 commit a650533

File tree

2 files changed

+25
-10
lines changed

2 files changed

+25
-10
lines changed

config-linux.md

+19-5
Original file line numberDiff line numberDiff line change
@@ -389,17 +389,31 @@ The following parameters can be specified to set up the controller:
389389

390390
### <a name="configLinuxHugePageLimits" />Huge page limits
391391

392-
**`hugepageLimits`** (array of objects, OPTIONAL) represents the `hugetlb` controller which allows to limit the
393-
HugeTLB usage per control group and enforces the controller limit during page fault.
392+
**`hugepageLimits`** (array of objects, OPTIONAL) represents the `hugetlb` controller which allows to limit the HugeTLB reservations (if supported) or usage (page fault).
393+
By default if supported by the kernel, `hugepageLimits` defines the hugepage sizes and limits for HugeTLB controller
394+
reservation accounting, which allows to limit the HugeTLB reservations per control group and enforces the controller
395+
limit at reservation time and at the fault of HugeTLB memory for which no reservation exists.
396+
Otherwise if not supported by the kernel, this should fallback to the page fault accounting, which allows users to limit
397+
the HugeTLB usage (page fault) per control group and enforces the limit during page fault.
398+
399+
Note that reservation limits are superior to page fault limits, since reservation limits are enforced at reservation
400+
time (on mmap or shget), and never causes the application to get SIGBUS signal if the memory was reserved before hand.
401+
This allows for easier fallback to alternatives such as non-HugeTLB memory for example. In the case of page fault
402+
accounting, it's very hard to avoid processes getting SIGBUS since the sysadmin needs precisely know the HugeTLB usage
403+
of all the tasks in the system and make sure there is enough pages to satisfy all requests. Avoiding tasks getting
404+
SIGBUS on overcommited systems is practically impossible with page fault accounting.
405+
394406
For more information, see the kernel cgroups documentation about [HugeTLB][cgroup-v1-hugetlb].
395407

396408
Each entry has the following structure:
397409

398-
* **`pageSize`** *(string, REQUIRED)* - hugepage size
410+
* **`pageSize`** *(string, REQUIRED)* - hugepage size.
399411
The value has the format `<size><unit-prefix>B` (64KB, 2MB, 1GB), and must match the `<hugepagesize>` of the
400-
corresponding control file found in `/sys/fs/cgroup/hugetlb/hugetlb.<hugepagesize>.limit_in_bytes`.
412+
corresponding control file found in `/sys/fs/cgroup/hugetlb/hugetlb.<hugepagesize>.rsvd.limit_in_bytes` (if
413+
hugetlb_cgroup reservation is supported) or `/sys/fs/cgroup/hugetlb/hugetlb.<hugepagesize>.limit_in_bytes` (if not
414+
supported).
401415
Values of `<unit-prefix>` are intended to be parsed using base 1024 ("1KB" = 1024, "1MB" = 1048576, etc).
402-
* **`limit`** *(uint64, REQUIRED)* - limit in bytes of *hugepagesize* HugeTLB usage
416+
* **`limit`** *(uint64, REQUIRED)* - limit in bytes of *hugepagesize* HugeTLB reservations (if supported) or usage.
403417

404418
#### Example
405419

specs-go/config.go

+6-5
Original file line numberDiff line numberDiff line change
@@ -233,12 +233,13 @@ type POSIXRlimit struct {
233233
Soft uint64 `json:"soft"`
234234
}
235235

236-
// LinuxHugepageLimit structure corresponds to limiting kernel hugepages
236+
// LinuxHugepageLimit structure corresponds to limiting kernel hugepages.
237+
// Default to reservation limits if supported. Otherwise fallback to page fault limits.
237238
type LinuxHugepageLimit struct {
238-
// Pagesize is the hugepage size
239-
// Format: "<size><unit-prefix>B' (e.g. 64KB, 2MB, 1GB, etc.)
239+
// Pagesize is the hugepage size.
240+
// Format: "<size><unit-prefix>B' (e.g. 64KB, 2MB, 1GB, etc.).
240241
Pagesize string `json:"pageSize"`
241-
// Limit is the limit of "hugepagesize" hugetlb usage
242+
// Limit is the limit of "hugepagesize" hugetlb reservations (if supported) or usage.
242243
Limit uint64 `json:"limit"`
243244
}
244245

@@ -364,7 +365,7 @@ type LinuxResources struct {
364365
Pids *LinuxPids `json:"pids,omitempty"`
365366
// BlockIO restriction configuration
366367
BlockIO *LinuxBlockIO `json:"blockIO,omitempty"`
367-
// Hugetlb limit (in bytes)
368+
// Hugetlb limits (in bytes). Default to reservation limits if supported.
368369
HugepageLimits []LinuxHugepageLimit `json:"hugepageLimits,omitempty"`
369370
// Network restriction configuration
370371
Network *LinuxNetwork `json:"network,omitempty"`

0 commit comments

Comments
 (0)