[Regression] Infinite reconciliation for unschedulable AppWrappers #618

sutaakar · 2023-08-30T14:09:04Z

Describe the Bug

When I create an AppWrapper with custompodresources CPU requests and limits larger than available cluster CPU then MCAD controller gets stuck in infinite reconciliation loop - starts to reconcile the AppWrapper every couple of milliseconds, completely cluttering MCAD log.

Created AppWrapper (taken from CodeFlare operator e2e test suite and adjusted requests and limits):

apiVersion: mcad.ibm.com/v1beta1
kind: AppWrapper
metadata:
  name: mnist
spec:
  resources:
    GenericItems:
      - allocated: 0
        custompodresources:
          - limits:
              cpu: '4'
              memory: 1G
            replicas: 1
            requests:
              cpu: '4'
              memory: 512Mi
        generictemplate:
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: mnist
            namespace: test-ns-xqlv6
          spec:
            completions: 1
            parallelism: 1
            template:
              metadata:
                creationTimestamp: null
              spec:
                containers:
                  - command:
                      - /bin/sh
                      - '-c'
                      - >-
                        pip install -r /test/requirements.txt && torchrun
                        /test/mnist.py
                    image: 'pytorch/pytorch:1.11.0-cuda11.3-cudnn8-runtime'
                    name: job
                    resources: {}
                    volumeMounts:
                      - mountPath: /test
                        name: test
                restartPolicy: Never
                volumes:
                  - configMap:
                      name: mnist-mcad
                    name: test
          status: {}
        priority: 0
        priorityslope: 0
        replicas: 1
  schedulingSpec:
    dispatchDuration: {}
    requeuing:
      growthType: exponential
      maxNumRequeuings: 0
      maxTimeInSeconds: 0
      numRequeuings: 0
      timeInSeconds: 300
  service:
    spec: {}

The behavior was observed in MCAD 1.34.0, it is a regression from 1.33.0 as I didn't reproduce it there.

Codeflare Stack Component Versions

Please specify the component versions in which you have encountered this bug.

Codeflare SDK: N/A
MCAD: v1.34.0
Instascale: N/A
Codeflare Operator: latest version from main branch, running locally
Other: Tested on OpenShift CRC

Steps to Reproduce the Bug

Deploy latest CodeFlare operator and MCAD (i.e. using OLM)
Check cluster node CPU
Adjust the sample AppWrapper to require more CPU then cluster node CPU
Create AppWrapper on cluster
Check MCAD logs

What Have You Already Tried to Debug the Issue?

I have observed MCAD logs for MCAD 1.33.0 and 1.34.0 to see the difference in log size and content. No other debugging.

Expected Behavior

Reconciliation for unschedulable AppWrappers should respect and keep retry intervals, like in MCAD 1.33.0.

Screenshots, Console Output, Logs, etc.

N/A

Affected Releases

v1.34.0

Additional Context

Add as applicable and when known:

OS: Linux
OS Version: Fedora 38
Browser (UI issues): N/A
Browser Version (UI issues): N/A
Cloud: on-premise
Kubernetes: OpenShift CRC, was observed also in KinD
OpenShift or K8s version: OCP 4.13
Other relevant info

Add any other information you think might be useful here.

The text was updated successfully, but these errors were encountered:

asm582 · 2023-08-30T14:22:30Z

@sutaakar Thanks for reporting the issue, do we know why requests and limits are different for the above AW:

          - limits:
              cpu: '4'
              memory: 1G
            replicas: 1
            requests:
              cpu: '4'
              memory: 512Mi

sutaakar · 2023-08-30T14:25:39Z

I have adjusted them to be able to reproduce the issue (my cluster has a bit less than 4 CPUs).

asm582 · 2023-08-30T14:42:41Z

OK, can you please try requests equals limits and see if the issue still reproduces?

sutaakar · 2023-08-30T14:44:41Z

You mean for the memory? As for CPU the limits are already equal.

Edit: With same memory in request and limit the result is same.

asm582 · 2023-08-30T15:55:28Z

@sutaakar Thanks, can you please share a snippet of the logs here?

sutaakar · 2023-08-31T06:42:59Z

Here is the complete MCAD log of first several seconds:
log.txt

asm582 · 2023-08-31T14:57:20Z

Thanks, the fix would be to improve back-off policy inside MCAD. In this scenario as we have only one AW in the system, we are attempting to a dispatch in short time window.

github-project-automation bot added this to Project CodeFlare Sprint Board Aug 30, 2023

github-actions bot added the triage/needs-triage label Aug 30, 2023

asm582 mentioned this issue Sep 6, 2023

remove threads, fix backoff #623

Merged

4 tasks

asm582 closed this as completed Sep 6, 2023

github-project-automation bot moved this to Done in Project CodeFlare Sprint Board Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Regression] Infinite reconciliation for unschedulable AppWrappers #618

[Regression] Infinite reconciliation for unschedulable AppWrappers #618

sutaakar commented Aug 30, 2023

asm582 commented Aug 30, 2023

sutaakar commented Aug 30, 2023

asm582 commented Aug 30, 2023

sutaakar commented Aug 30, 2023 •

edited

Loading

asm582 commented Aug 30, 2023

sutaakar commented Aug 31, 2023

asm582 commented Aug 31, 2023

[Regression] Infinite reconciliation for unschedulable AppWrappers #618

[Regression] Infinite reconciliation for unschedulable AppWrappers #618

Comments

sutaakar commented Aug 30, 2023

Describe the Bug

Codeflare Stack Component Versions

Steps to Reproduce the Bug

What Have You Already Tried to Debug the Issue?

Expected Behavior

Screenshots, Console Output, Logs, etc.

Affected Releases

Additional Context

asm582 commented Aug 30, 2023

sutaakar commented Aug 30, 2023

asm582 commented Aug 30, 2023

sutaakar commented Aug 30, 2023 • edited Loading

asm582 commented Aug 30, 2023

sutaakar commented Aug 31, 2023

asm582 commented Aug 31, 2023

sutaakar commented Aug 30, 2023 •

edited

Loading