Skip to content

[Regression] Infinite reconciliation for unschedulable AppWrappers #618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sutaakar opened this issue Aug 30, 2023 · 7 comments
Closed

[Regression] Infinite reconciliation for unschedulable AppWrappers #618

sutaakar opened this issue Aug 30, 2023 · 7 comments

Comments

@sutaakar
Copy link
Contributor

Describe the Bug

When I create an AppWrapper with custompodresources CPU requests and limits larger than available cluster CPU then MCAD controller gets stuck in infinite reconciliation loop - starts to reconcile the AppWrapper every couple of milliseconds, completely cluttering MCAD log.

Created AppWrapper (taken from CodeFlare operator e2e test suite and adjusted requests and limits):

apiVersion: mcad.ibm.com/v1beta1
kind: AppWrapper
metadata:
  name: mnist
spec:
  resources:
    GenericItems:
      - allocated: 0
        custompodresources:
          - limits:
              cpu: '4'
              memory: 1G
            replicas: 1
            requests:
              cpu: '4'
              memory: 512Mi
        generictemplate:
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: mnist
            namespace: test-ns-xqlv6
          spec:
            completions: 1
            parallelism: 1
            template:
              metadata:
                creationTimestamp: null
              spec:
                containers:
                  - command:
                      - /bin/sh
                      - '-c'
                      - >-
                        pip install -r /test/requirements.txt && torchrun
                        /test/mnist.py
                    image: 'pytorch/pytorch:1.11.0-cuda11.3-cudnn8-runtime'
                    name: job
                    resources: {}
                    volumeMounts:
                      - mountPath: /test
                        name: test
                restartPolicy: Never
                volumes:
                  - configMap:
                      name: mnist-mcad
                    name: test
          status: {}
        priority: 0
        priorityslope: 0
        replicas: 1
  schedulingSpec:
    dispatchDuration: {}
    requeuing:
      growthType: exponential
      maxNumRequeuings: 0
      maxTimeInSeconds: 0
      numRequeuings: 0
      timeInSeconds: 300
  service:
    spec: {}

The behavior was observed in MCAD 1.34.0, it is a regression from 1.33.0 as I didn't reproduce it there.

Codeflare Stack Component Versions

Please specify the component versions in which you have encountered this bug.

Codeflare SDK: N/A
MCAD: v1.34.0
Instascale: N/A
Codeflare Operator: latest version from main branch, running locally
Other: Tested on OpenShift CRC

Steps to Reproduce the Bug

  1. Deploy latest CodeFlare operator and MCAD (i.e. using OLM)
  2. Check cluster node CPU
  3. Adjust the sample AppWrapper to require more CPU then cluster node CPU
  4. Create AppWrapper on cluster
  5. Check MCAD logs

What Have You Already Tried to Debug the Issue?

I have observed MCAD logs for MCAD 1.33.0 and 1.34.0 to see the difference in log size and content. No other debugging.

Expected Behavior

Reconciliation for unschedulable AppWrappers should respect and keep retry intervals, like in MCAD 1.33.0.

Screenshots, Console Output, Logs, etc.

N/A

Affected Releases

v1.34.0

Additional Context

Add as applicable and when known:

  • OS: Linux
  • OS Version: Fedora 38
  • Browser (UI issues): N/A
  • Browser Version (UI issues): N/A
  • Cloud: on-premise
  • Kubernetes: OpenShift CRC, was observed also in KinD
  • OpenShift or K8s version: OCP 4.13
  • Other relevant info

Add any other information you think might be useful here.

@asm582
Copy link
Member

asm582 commented Aug 30, 2023

@sutaakar Thanks for reporting the issue, do we know why requests and limits are different for the above AW:

          - limits:
              cpu: '4'
              memory: 1G
            replicas: 1
            requests:
              cpu: '4'
              memory: 512Mi

@sutaakar
Copy link
Contributor Author

I have adjusted them to be able to reproduce the issue (my cluster has a bit less than 4 CPUs).

@asm582
Copy link
Member

asm582 commented Aug 30, 2023

OK, can you please try requests equals limits and see if the issue still reproduces?

@sutaakar
Copy link
Contributor Author

sutaakar commented Aug 30, 2023

You mean for the memory? As for CPU the limits are already equal.

Edit: With same memory in request and limit the result is same.

@asm582
Copy link
Member

asm582 commented Aug 30, 2023

@sutaakar Thanks, can you please share a snippet of the logs here?

@sutaakar
Copy link
Contributor Author

Here is the complete MCAD log of first several seconds:
log.txt

@asm582
Copy link
Member

asm582 commented Aug 31, 2023

Thanks, the fix would be to improve back-off policy inside MCAD. In this scenario as we have only one AW in the system, we are attempting to a dispatch in short time window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants