config-linux: add support for rsvd hugetlb cgroup

kailun-qin · kailun-qin · commit a65053392045 · 2021-08-06T13:31:00.000-04:00
The previous non-rsvd max/limit_in_bytes does not account for reserved huge page memory, making it possible for a process to reserve all the huge page memory, without being able to allocate it (due to hugetlb cgroup page fault accounting restrictions). In practice this makes it possible to successfully mmap more huge page memory than allowed via the cgroup settings, but when using the memory the process will get a SIGBUS and crash. This is bad for applications trying to mmap at startup (and it succeeds), but the program crashes when starting to use the memory. eg. postgres is doing this by default. This patch updates and clarifies `LinuxResources.HugepageLimits` and `LinuxHugepageLimit` by defaulting the configurations go to rsvd hugetlb cgroup (when supported) and fallback to page fault accounting if not supported. Fixes #1050 Signed-off-by: Kailun Qin <kailun.qin@intel.com>
diff --git a/config-linux.md b/config-linux.md
@@ -389,17 +389,31 @@ The following parameters can be specified to set up the controller:
 
 ### <a name="configLinuxHugePageLimits" />Huge page limits
 
-**`hugepageLimits`** (array of objects, OPTIONAL) represents the `hugetlb` controller which allows to limit the
-HugeTLB usage per control group and enforces the controller limit during page fault.
+**`hugepageLimits`** (array of objects, OPTIONAL) represents the `hugetlb` controller which allows to limit the HugeTLB reservations (if supported) or usage (page fault).
+By default if supported by the kernel, `hugepageLimits` defines the hugepage sizes and limits for HugeTLB controller
+reservation accounting, which allows to limit the HugeTLB reservations per control group and enforces the controller
+limit at reservation time and at the fault of HugeTLB memory for which no reservation exists.
+Otherwise if not supported by the kernel, this should fallback to the page fault accounting, which allows users to limit
+the HugeTLB usage (page fault) per control group and enforces the limit during page fault.
+
+Note that reservation limits are superior to page fault limits, since reservation limits are enforced at reservation
+time (on mmap or shget), and never causes the application to get SIGBUS signal if the memory was reserved before hand.
+This allows for easier fallback to alternatives such as non-HugeTLB memory for example. In the case of page fault
+accounting, it's very hard to avoid processes getting SIGBUS since the sysadmin needs precisely know the HugeTLB usage
+of all the tasks in the system and make sure there is enough pages to satisfy all requests. Avoiding tasks getting
+SIGBUS on overcommited systems is practically impossible with page fault accounting.
+
 For more information, see the kernel cgroups documentation about [HugeTLB][cgroup-v1-hugetlb].
 
 Each entry has the following structure:
 
-* **`pageSize`** *(string, REQUIRED)* - hugepage size
+* **`pageSize`** *(string, REQUIRED)* - hugepage size.
     The value has the format `<size><unit-prefix>B` (64KB, 2MB, 1GB), and must match the `<hugepagesize>` of the
-    corresponding control file found in `/sys/fs/cgroup/hugetlb/hugetlb.<hugepagesize>.limit_in_bytes`.
+    corresponding control file found in `/sys/fs/cgroup/hugetlb/hugetlb.<hugepagesize>.rsvd.limit_in_bytes` (if
+    hugetlb_cgroup reservation is supported) or `/sys/fs/cgroup/hugetlb/hugetlb.<hugepagesize>.limit_in_bytes` (if not
+    supported).
     Values of `<unit-prefix>` are intended to be parsed using base 1024 ("1KB" = 1024, "1MB" = 1048576, etc).
-* **`limit`** *(uint64, REQUIRED)* - limit in bytes of *hugepagesize* HugeTLB usage
+* **`limit`** *(uint64, REQUIRED)* - limit in bytes of *hugepagesize* HugeTLB reservations (if supported) or usage.
 
 #### Example
 
diff --git a/specs-go/config.go b/specs-go/config.go
@@ -233,12 +233,13 @@ type POSIXRlimit struct {
 	Soft uint64 `json:"soft"`
 }
 
-// LinuxHugepageLimit structure corresponds to limiting kernel hugepages
+// LinuxHugepageLimit structure corresponds to limiting kernel hugepages.
+// Default to reservation limits if supported. Otherwise fallback to page fault limits.
 type LinuxHugepageLimit struct {
-	// Pagesize is the hugepage size
-	// Format: "<size><unit-prefix>B' (e.g. 64KB, 2MB, 1GB, etc.)
+	// Pagesize is the hugepage size.
+	// Format: "<size><unit-prefix>B' (e.g. 64KB, 2MB, 1GB, etc.).
 	Pagesize string `json:"pageSize"`
-	// Limit is the limit of "hugepagesize" hugetlb usage
+	// Limit is the limit of "hugepagesize" hugetlb reservations (if supported) or usage.
 	Limit uint64 `json:"limit"`
 }
 
@@ -364,7 +365,7 @@ type LinuxResources struct {
 	Pids *LinuxPids `json:"pids,omitempty"`
 	// BlockIO restriction configuration
 	BlockIO *LinuxBlockIO `json:"blockIO,omitempty"`
-	// Hugetlb limit (in bytes)
+	// Hugetlb limits (in bytes). Default to reservation limits if supported.
 	HugepageLimits []LinuxHugepageLimit `json:"hugepageLimits,omitempty"`
 	// Network restriction configuration
 	Network *LinuxNetwork `json:"network,omitempty"`