-
Notifications
You must be signed in to change notification settings - Fork 64
[Regression] Infinite reconciliation for unschedulable AppWrappers #618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@sutaakar Thanks for reporting the issue, do we know why requests and limits are different for the above AW:
|
I have adjusted them to be able to reproduce the issue (my cluster has a bit less than 4 CPUs). |
OK, can you please try requests equals limits and see if the issue still reproduces? |
You mean for the memory? As for CPU the limits are already equal. Edit: With same memory in request and limit the result is same. |
@sutaakar Thanks, can you please share a snippet of the logs here? |
Here is the complete MCAD log of first several seconds: |
Thanks, the fix would be to improve back-off policy inside MCAD. In this scenario as we have only one AW in the system, we are attempting to a dispatch in short time window. |
Describe the Bug
When I create an AppWrapper with custompodresources CPU requests and limits larger than available cluster CPU then MCAD controller gets stuck in infinite reconciliation loop - starts to reconcile the AppWrapper every couple of milliseconds, completely cluttering MCAD log.
Created AppWrapper (taken from CodeFlare operator e2e test suite and adjusted requests and limits):
The behavior was observed in MCAD 1.34.0, it is a regression from 1.33.0 as I didn't reproduce it there.
Codeflare Stack Component Versions
Please specify the component versions in which you have encountered this bug.
Codeflare SDK: N/A
MCAD: v1.34.0
Instascale: N/A
Codeflare Operator: latest version from main branch, running locally
Other: Tested on OpenShift CRC
Steps to Reproduce the Bug
What Have You Already Tried to Debug the Issue?
I have observed MCAD logs for MCAD 1.33.0 and 1.34.0 to see the difference in log size and content. No other debugging.
Expected Behavior
Reconciliation for unschedulable AppWrappers should respect and keep retry intervals, like in MCAD 1.33.0.
Screenshots, Console Output, Logs, etc.
N/A
Affected Releases
v1.34.0
Additional Context
Add as applicable and when known:
Add any other information you think might be useful here.
The text was updated successfully, but these errors were encountered: