Skip to content

Critical error in FlowLog code #1931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mattzech opened this issue Oct 26, 2023 · 27 comments
Closed

Critical error in FlowLog code #1931

mattzech opened this issue Oct 26, 2023 · 27 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. service/ec2 Indicates issues or PRs that are related to ec2-controller.

Comments

@mattzech
Copy link

Describe the bug
Any time a FlowLog. resource is created, the controller crashes with the following error message:

panic: runtime error: index out of range [0] with length 0 [recovered]
        panic: runtime error: index out of range [0] with length 0

goroutine 1860 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 +0x1e5
panic({0x2061360?, 0xc0007ac1f8?})
        /usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log.(*resourceManager).sdkCreate(0xc001625ce0, {0x26a14a0, 0xc000ffa330}, 0xc001620120)
        /github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log/sdk.go:251 +0x594
github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log.(*resourceManager).Create(0xc000ffa300?, {0x26a14a0?, 0xc000ffa330?}, {0x26a99d8?, 0xc001620120})
        /github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log/manager.go:127 +0x4b
github.com/aws-controllers-k8s/runtime/pkg/runtime.(*resourceReconciler).createResource(0xc001124d00, {0x26a14a0, 0xc000ffa330}, {0x26a9970, 0xc001625ce0}, {0x26a99d8, 0xc001620120})
        /go/pkg/mod/github.com/aws-controllers-k8s/[email protected]/pkg/runtime/reconciler.go:417 +0x33f
github.com/aws-controllers-k8s/runtime/pkg/runtime.(*resourceReconciler).Sync(0xc001124d00, {0x26a14a0, 0xc000ffa330}, {0x26a9970, 0xc001625ce0}, {0x26a99d8, 0xc001620120})
        /go/pkg/mod/github.com/aws-controllers-k8s/[email protected]/pkg/runtime/reconciler.go:275 +0x645
github.com/aws-controllers-k8s/runtime/pkg/runtime.(*resourceReconciler).reconcile(0xc001124d00, {0x26a14a0, 0xc000ffa330}, {0x26a9970, 0xc001625ce0}, {0x26a99d8, 0xc001620120})
        /go/pkg/mod/github.com/aws-controllers-k8s/[email protected]/pkg/runtime/reconciler.go:215 +0x1cd
github.com/aws-controllers-k8s/runtime/pkg/runtime.(*resourceReconciler).Reconcile(0xc001124d00, {0x26a14a0, 0xc000a20db0}, {{{0xc000594210?, 0x0?}, {0xc000594200?, 0x4105a5?}}})
        /go/pkg/mod/github.com/aws-controllers-k8s/[email protected]/pkg/runtime/reconciler.go:186 +0x75b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x26a14a0?, {0x26a14a0?, 0xc000a20db0?}, {{{0xc000594210?, 0x1ca8ee0?}, {0xc000594200?, 0x0?}}})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0011f8820, {0x26a14d8, 0xc000910280}, {0x1f297c0?, 0xc0000a2440?})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323 +0x345
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0011f8820, {0x26a14d8, 0xc000910280})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274 +0x1c9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 1216
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231 +0x565

I can propose a change in the ec2-controller repository pointing out the problem

Steps to reproduce
Create a FlowLog with basic spec

apiVersion: ec2.services.k8s.aws/v1alpha1
kind: FlowLog
metadata:
  name: flowlog-test
  namespace: test
spec:
  logDestination: "arn:aws:s3:::vpc-flowlogs-test"
  logDestinationType: s3
  resourceID: vpc-03b59daf7264a79c7
  resourceType: VPC
  trafficType: ALL

Expected outcome
A FlowLog created successfully

Environment
Openshift 4.13.12

  • Kubernetes version 1.26
  • Using EKS (yes/no), if so version? No
  • AWS service targeted (S3, RDS, etc.) E2
@a-hilaly a-hilaly added kind/bug Categorizes issue or PR as related to a bug. service/ec2 Indicates issues or PRs that are related to ec2-controller. labels Oct 26, 2023
@a-hilaly
Copy link
Member

I don't know why is the ec2 API returning an empty FlowLogs array.. Do you see any errors returned by the create step? Also could you confirm whether the resource is getting created or not?

cc-ing @nnbu since he worked on this feature.

@mattzech
Copy link
Author

Thanks for looking into this @a-hilaly @nnbu

  1. No other errors in the create step. I will include all the logs below.
2023-10-26T17:04:10.829Z        DEBUG   ackrt   <<<< rm.sdkCreate       {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z        DEBUG   ackrt   <<< r.createResource    {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z        DEBUG   ackrt   >>> r.ensureConditions  {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z        DEBUG   ackrt   >>>> rm.IsSynced        {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z        DEBUG   ackrt   <<<< rm.IsSynced        {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z        DEBUG   ackrt   <<< r.ensureConditions  {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z        DEBUG   ackrt   << r.Sync       {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1, "error": "resource not found"}
2023-10-26T17:04:10.829Z        INFO    Observed a panic in reconciler: runtime error: index out of range [0] with length 0     {"controller": "flowlog", "controllerGroup": "ec2.services.k8s.aws", "controllerKind": "FlowLog", "FlowLog": {"name":"vpc-flowlog-test","namespace":"netgen-test"}, "namespace": "netgen-test", "name": "vpc-flowlog-test", "reconcileID": "46854c76-b0fd-4eb0-b7ab-52b4451034c6"}
panic: runtime error: index out of range [0] with length 0 [recovered]
        panic: runtime error: index out of range [0] with length 0

goroutine 1821 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 +0x1e5
panic({0x2061360?, 0xc000907fc8?})
        /usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log.(*resourceManager).sdkCreate(0xc0002a14a0, {0x26a14a0, 0xc000fff200}, 0xc0014a84c8)
        /github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log/sdk.go:251 +0x594
  1. Confirming that resource does not get created in AWS
    Thank you

@a-hilaly
Copy link
Member

Could you change the log level to debug @mattzech ? I believe the API might be returning an error.

@a-hilaly
Copy link
Member

But still very weird that we reach L251 with 0 elements in the FlowLogs array

@nnbu
Copy link

nnbu commented Oct 26, 2023

I will check further. But if there is not error at L212 , then resource should have been created.

Could you also try executing corresponding CLI command to create flowlog and see if you get the flowlog in the response
aws ec2 create-flow-logs \ --resource-type VPC \ --resource-ids vpc-00112233344556677 \ --traffic-type ALL \ --log-destination-type s3 \ --log-destination arn:aws:s3:::flow-log-bucket/my-custom-flow-logs

(Of course, edit this with the parameters that you provided)

(Taken from https://docs.aws.amazon.com/cli/latest/reference/ec2/create-flow-logs.html )

@nnbu
Copy link

nnbu commented Oct 26, 2023

I am assuming VPC and logDestination exist

@mattzech
Copy link
Author

Hi @nnbu yes it is very strange that is not hitting the error from L212.
I ran the command and it produced the desired output

{
    "ClientToken": "123456789NYzyP0I4Zd9kSRul7eqDlBXPkCGsCFZw=",
    "FlowLogIds": [
        "fl-0001112223334445555"
    ],
    "Unsuccessful": []
}

@a-hilaly
Copy link
Member

Is the issue reproducible @mattzech ? i'm wondering whether we hit an API hiccup. Also looks like the response has a Unsuccessful field... which might mean that the API doesn't return an error but a 200 + Unsuccessful when things go wrong...
@nnbu Should we check for Unsuccessful size after the error check in L212? and return an error if any..

@mattzech
Copy link
Author

Yes the error is reproducible. Looking at the logs I sent earlier, Debug mode seems to be on already. I tested the command with a nonexistent bucket to see how the Unsuccessful field populates:

{
    "ClientToken": "123456789P0I4Zd9kSRul7eqDlBXPkCGsCFZw=",
    "FlowLogIds": [],
    "Unsuccessful": [
        {
            "Error": {
                "Code": "400",
                "Message": "LogDestination: bad-bucket-test does not exist"
            },
            "ResourceId": "vpc-000112233445566"
        }
    ]
}

@mattzech
Copy link
Author

If the error reporting could print the Unsuccessful[0].Error.Message that could give some more insight on what is happening. I will run some tests to see if it could be a permissions issue

@a-hilaly
Copy link
Member

What's the HTTP code sent with this response? I assume 200?

{
    "ClientToken": "123456789P0I4Zd9kSRul7eqDlBXPkCGsCFZw=",
    "FlowLogIds": [],
    "Unsuccessful": [
        {
            "Error": {
                "Code": "400",
                "Message": "LogDestination: bad-bucket-test does not exist"
            },
            "ResourceId": "vpc-000112233445566"
        }
    ]
}

@a-hilaly
Copy link
Member

This makes more sense now. We should rely on Unsuccessful field to determine whether a resource was created successfully or not. cc @nnbu

@mattzech
Copy link
Author

What's the HTTP code sent with this response? I assume 200?

{
    "ClientToken": "123456789P0I4Zd9kSRul7eqDlBXPkCGsCFZw=",
    "FlowLogIds": [],
    "Unsuccessful": [
        {
            "Error": {
                "Code": "400",
                "Message": "LogDestination: bad-bucket-test does not exist"
            },
            "ResourceId": "vpc-000112233445566"
        }
    ]
}

This one returned 200. I tried with incorrect permissions and that request returned an error with a 254 Response Code

@mattzech
Copy link
Author

resp, err = rm.sdkapi.CreateFlowLogsWithContext(ctx, input)
	rm.metrics.RecordAPICall("CREATE", "CreateFlowLogs", err)
	if err != nil {
		return nil, err
	}
        if resp.Unsuccessful[0] != nil {
                 return nil, <error message from response>
        }

I think checking both err and Unsuccessful makes sense

@nnbu
Copy link

nnbu commented Oct 26, 2023

If Unsuccessful then we go to L245 . So, it is worth checking if Unsuccessful condition happened, and if so, don't go to L251

@nnbu
Copy link

nnbu commented Oct 26, 2023

@mattzech Is your LogDestination non existent?

@nnbu
Copy link

nnbu commented Oct 26, 2023

resp, err = rm.sdkapi.CreateFlowLogsWithContext(ctx, input)
	rm.metrics.RecordAPICall("CREATE", "CreateFlowLogs", err)
	if err != nil {
		return nil, err
	}
        if resp.Unsuccessful[0] != nil {
                 return nil, <error message from response>
        }

I think checking both err and Unsuccessful makes sense

I dont think, returning early is correct approach here. We should just skip L251 in case Unsuccessful is non nil.

@mattzech
Copy link
Author

@mattzech Is your LogDestination non existent?

It exists, I was able to successfully execute the create-flow-logs command using the same information from the yaml above. I was just testing the command with an incorrect bucket to see how it returned.

@nnbu
Copy link

nnbu commented Oct 26, 2023

ok. In that case, even after avoiding this crash, your problem would not be solved. Because the resource is not getting created anyway.

@mattzech
Copy link
Author

If the error message showed the message inside the Unsuccessful field, I would at least know more about the problem. As it stands, I have no idea what is causing this

@mattzech
Copy link
Author

Additionally, the panic is causing the controller pod to crash and restart every time it gets to this point. So avoiding the index out of bounds error would be beneficial to allow the reconciler to continue

@a-hilaly
Copy link
Member

I dont think, returning early is correct approach here. We should just skip L251 in case Unsuccessful is non nil.

@nnbu doesn't this mean that resource creation encountered an issue and that we should retry or set a terminal condition?

@a-hilaly
Copy link
Member

Or, maybe, we could return a re-queue error to force recreation after few seconds?

@ack-bot
Copy link
Collaborator

ack-bot commented Apr 24, 2024

Issues go stale after 180d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 60d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale

@ack-prow ack-prow bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 24, 2024
@ack-bot
Copy link
Collaborator

ack-bot commented Jun 23, 2024

Stale issues rot after 60d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 60d of inactivity.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle rotten

@ack-prow ack-prow bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 23, 2024
@ack-bot
Copy link
Collaborator

ack-bot commented Aug 22, 2024

Rotten issues close after 60d of inactivity.
Reopen the issue with /reopen.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/close

@ack-prow ack-prow bot closed this as completed Aug 22, 2024
Copy link

ack-prow bot commented Aug 22, 2024

@ack-bot: Closing this issue.

In response to this:

Rotten issues close after 60d of inactivity.
Reopen the issue with /reopen.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ack-prow bot pushed a commit to aws-controllers-k8s/ec2-controller that referenced this issue Sep 20, 2024
Issue #, if available: [#1931](aws-controllers-k8s/community#1931)

Description of changes:
While creating FlowLog, if logDestnation does not exist then CreateFlowLogsWithContext does not return the error. However, Unsuccessful field in the response is set. FlowLog is not created in aws. Due to this, controller crashes while accessing the flow log id from the response (resp.FlowLogIds[0]).
Fixing the issue by checking if FlowLogIds has valid length before accessing it.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. service/ec2 Indicates issues or PRs that are related to ec2-controller.
Projects
None yet
Development

No branches or pull requests

4 participants