You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currrently the controller was dependent on reading cr status,
update the controller to not depend on cr status
Signed-off-by: parth-gr <[email protected]>
2) With every run calculate different device class highest osd filled percentage.
62
-
62
+
63
63
OsdPercentage:
64
64
65
65
ssd 69%
@@ -72,36 +72,54 @@ Controller:
72
72
73
73
1) Create a named controller that watches for [channel generic event](https://book-v1.book.kubebuilder.io/beyond_basics/controller_watches) per device class.
74
74
75
-
2) If the expansion is not in progress set the status `phase` to `NotStarted`
75
+
2) Set the status `phase` to `NotStarted`, if no expansion has triggered(status.phase=="").
76
+
77
+
3) Check if the expansion is in progress:
78
+
79
+
1) Query actualOsdCount and actualOsdSize from the from Prometheus .
80
+
81
+
2) Calculate the desiredOsdCount and desiredOsdSize from the Storagecluster.
82
+
83
+
3) If the (actualOsdSize < desiredOsdSize || actualOsdCount < desiredOsdCount), expansion is in progress.
76
84
77
-
3) If an expansion is in progress(expectedOsdSize!=startOsdSize || expectedOsdCount!=startOsdCount), check the progress and then requeue each 1 minute until the expansion is completed successfully(jump to step 11).
85
+
4) If the expansion is in progress, check the progress and then requeue each 1 minute until the expansion is completed successfully(jump to step 13).
78
86
79
-
4) If the LSO storageclass is detected in the storageClassDeviceSet, raise a warning and do not recocnile further.
87
+
5) If no-expansion is in progress,
80
88
81
-
5) If the highest osd percentage reported in the sync map is more than osdScalingThresholdPercent(70%) means reaching osd nearfull, scaling is needed.
89
+
1) Check the (status.phase == "InProgress"), if yes move the status phase to succeded.
82
90
83
-
6) If scaling is needed calculate the `expectedOsdSize` and `expectedOsdCount`.
91
+
2) Proceed with further steps.
92
+
93
+
6) If the LSO storageclass is detected in the storageClassDeviceSet, raise a warning and do not recocnile further.
94
+
95
+
7) If the highest osd percentage reported in the sync map is more than osdScalingThresholdPercent(70%) means reaching osd nearfull, scaling is needed.
96
+
97
+
8) If scaling is needed calculate the `expectedOsdSize` and `expectedOsdCount`.
84
98
85
99
1) If the Osd size is less than maxOsdSize(default:8Tib), do vertical scaling by doubling the each osd sizes for that device class.
86
100
87
101
2) If the Osd size is equal to maxOsdSize(default:8Tib), do a horizontal scaling, by adding 1 osd of maxOsdSize(default:8Tib) on each `storageDeviceSet`.
88
102
89
-
7) Calculate the `expectedStorageCapacity` based on expected size and count.
103
+
9) Calculate the `expectedStorageCapacity` based on expected size and count.
90
104
91
-
8) Check if the `storageCapacityLimit` > `expectedStorageCapacity`.
105
+
10) Check if the `storageCapacityLimit` > `expectedStorageCapacity`.
92
106
93
107
1) If yes, Update `phase` to `InProgress` and `lastExpansionStartTime` as `current-time` on the `StorageAutoScaling` CR which need scaling.
94
108
95
109
2) If no, don't reconcile further and the set the `storageCapacityLimitReached` as true in the status.
96
110
97
-
9) Update the status, set the `expectedOsdSize` and `expectedOsdCount` to reflect the new expected value and also set `startOsdSize` and `startOsdCount` with current storagecluster values.
111
+
11) Update the status, set the `expectedOsdSize` and `expectedOsdCount` to reflect the new expected value and also set `startOsdSize` and `startOsdCount` with current storagecluster values.
98
112
99
-
10) Scale by patching the Storagecluster, with all the device sets update needed at the same time.
113
+
12) Scale by patching the Storagecluster, with all the device sets update needed at the same time.
100
114
101
-
11) Verify and Alert:
115
+
13) Verify and Alert:
102
116
103
117
1) Verify the Storagecluster whether the new osds are added or scaled in size, for all the device sets.
104
118
119
+
1) For vertical scaling, Query osd size from Prometheus and match it with storagecluster.spec..size.
120
+
121
+
2) For horizontal scaling, Query osd count from Prometheus and match it with storagecluster.spec..count.
122
+
105
123
2) If the scaling is successful will update the status of the `StorageAutoScaling` CR with `lastExpansionCompletionTime` and `phase` and also osd count and size.
106
124
107
125
3) If the auto scale is not completed, it will do a requeue every 1 min and, change the phase to `failed` if scaling not `Succeeded` with in timeoutSeconds(default:1800) interval.
@@ -113,19 +131,31 @@ Controller:
113
131
Based on the above algorithm there would be two conditions where in-progress is set, elaborating those conditions,
114
132
115
133
1) If scaling is just started:
134
+
116
135
1) Set `phase` to `InProgress`.
117
-
2) Verify is the scaling is successful.
136
+
137
+
2) Verify is the scaling is successful.
138
+
118
139
3) If the scaling is successful set the `phase` to `Succeeded`.
140
+
119
141
4) Alert the user if the phase changes to `Succeeded`, alerting will be implemented with ocs-metrics-exporter.
142
+
120
143
5) If the scaling is not yet completed requeue every 1 min, we have the 2nd case.
121
144
122
145
2) If the scaling has already started and its requeue
146
+
123
147
1) Now the requeue will happen every 1 min.
148
+
124
149
2) At the start of reconcile will match that `startOsdSize` and `expectedOsdSize` is not equal and similar for osd count.
150
+
125
151
3) And another validation will do is equating storagecluster spec with prometheus response.
152
+
126
153
4) Will requeue till the scaling is in-progress.
154
+
127
155
5) If the scaling is in-progress with more than timeoutSeconds(default:1800) interval we set the phase to `failed`.
156
+
128
157
6) Alert the user if the phase changes to `Failed`, alerting will be implemented with ocs-metrics-exporter.
158
+
129
159
7) If there as a failure alert, provide a mitigation guide for the user.
0 commit comments