diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml new file mode 100644 index 0000000..953372c --- /dev/null +++ b/.github/workflows/ci.yaml @@ -0,0 +1,40 @@ +name: CI + +on: + pull_request: + types: [ opened, edited ] + push: + +jobs: + ci: + runs-on: ubuntu-latest + steps: + - name: Checkout + uses: actions/checkout@v4 + with: + fetch-depth: 0 + + - name: Setup Go + uses: actions/setup-go@v4 + with: + go-version: "1.23.1" + cache: true + + - name: Check formatting + run: | + cd gitpod-network-check + test -z "$(gofmt -l .)" + + - name: Run tests + run: | + cd gitpod-network-check + go test -v ./... + + - name: Build with GoReleaser + uses: goreleaser/goreleaser-action@v3 + with: + distribution: goreleaser + version: v2.8.2 + args: build --snapshot --clean + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} \ No newline at end of file diff --git a/.gitpod.yml b/.gitpod.yml index fee6fb6..2a2d375 100644 --- a/.gitpod.yml +++ b/.gitpod.yml @@ -1,4 +1 @@ -image: gitpod/workspace-full - -checkoutLocation: enterprise-deployment-toolkit -workspaceLocation: enterprise-deployment-toolkit/enterprise-deployment-toolkit.code-workspace +image: gitpod/workspace-full \ No newline at end of file diff --git a/.goreleaser.yaml b/.goreleaser.yaml index 1eb3053..131a5f3 100644 --- a/.goreleaser.yaml +++ b/.goreleaser.yaml @@ -14,6 +14,10 @@ builds: ignore: - goos: windows goarch: arm64 + flags: + - -a + ldflags: + - -s -w -extldflags=-static binary: gitpod-network-check archives: diff --git a/enterprise-deployment-toolkit.code-workspace b/enterprise-deployment-toolkit.code-workspace index 2d4c4ee..d1f6711 100644 --- a/enterprise-deployment-toolkit.code-workspace +++ b/enterprise-deployment-toolkit.code-workspace @@ -20,7 +20,6 @@ }, "go.lintTool": "golangci-lint", "gopls": { - "allowModfileModifications": true }, } } diff --git a/gitpod-network-check/.gitignore b/gitpod-network-check/.gitignore new file mode 100644 index 0000000..0a5c0fe --- /dev/null +++ b/gitpod-network-check/.gitignore @@ -0,0 +1,2 @@ +gitpod-network-check +*.zip \ No newline at end of file diff --git a/gitpod-network-check/Makefile b/gitpod-network-check/Makefile new file mode 100644 index 0000000..3730157 --- /dev/null +++ b/gitpod-network-check/Makefile @@ -0,0 +1,4 @@ +.PHONY: build + +build: + GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -a -ldflags="-s -w -extldflags=-static" -o gitpod-network-check main.go \ No newline at end of file diff --git a/gitpod-network-check/README.md b/gitpod-network-check/README.md index d5c59bb..d5e79b8 100644 --- a/gitpod-network-check/README.md +++ b/gitpod-network-check/README.md @@ -34,27 +34,41 @@ A CLI to check if your network setup is suitable for the installation of Gitpod. 1. Preparation - To run a diagnosis of the network that you want to use for Gitpod, the CLI command needs to know the subnets you have chosen to be used as the `Main` subnets and the `Pod` subnets. You can read more about the distinction here in [our docs](https://www.gitpod.io/docs/enterprise/getting-started/networking#2-subnet-separation). The CLI expects to read the IDs of these subnets in a configuration file. By default it tries to read it from a file name `gitpod-network-check.yaml` in your current directory, but you can override this behavior by using the `--config` flag of the CLI. + To run a diagnosis of the network that you want to use for Gitpod, the CLI command needs to know the subnets you have chosen to be used as the `Main` subnets. You can read more about those here in [our docs](https://www.gitpod.io/docs/enterprise/getting-started/networking#2-subnet-separation). The CLI expects to read the IDs of these subnets in a configuration file. By default it tries to read it from a file name `gitpod-network-check.yaml` in your current directory, but you can override this behavior by using the `--config` flag of the CLI. For the sake of simplicity, let us create a file `gitpod-network-check.yaml` in the current directory and populate it with the subnet IDs and AWS region as shown below: ```yaml log-level: debug # Options: debug, info, warning, error region: eu-central-1 main-subnets: subnet-0554e84f033a64c56, subnet-08584621e7754e505, subnet-094c6fd68aea493b7 - pod-subnets: subnet-028d11dce93b8eefc, subnet-04ec8257d95c434b7,subnet-00a83550ce709f39c https-hosts: accounts.google.com, github.com - instance-ami: # put your custom ami id here if you want to use it, otherwise it will using latest ubuntu AMI from aws - api-endpoint: # optional, put your API endpoint regional sub-domain here to test connectivity, like when the execute-api vpc endpoint is not in the same account as Gitpod + api-endpoint: # optional, put your API endpoint regional sub-domain here to test connectivity, like when the execute-api vpc endpoint is not in the same account as Gitpod + + ## EC2 runner + #instance-ami: # put your custom ami id here if you want to use it, otherwise it will using latest ubuntu AMI from aws + + ## Lambda runner + # lambda-role-arn: arn:aws:iam::123456789012:role/MyExistingLambdaRole # Optional: Use existing IAM Role for Lambda mode + # lambda-sg-id: sg-0123456789abcdef0 # Optional: Use existing Security Group for Lambda mode ``` - note: if using a custom AMI, please ensure the [SSM agent](https://docs.aws.amazon.com/systems-manager/latest/userguide/manually-install-ssm-agent-linux.html) and [curl](https://curl.se/) are both installed. We rely on SSM's [SendCommand](https://docs.aws.amazon.com/code-library/latest/ug/ssm_example_ssm_SendCommand_section.html) to test HTTPS connectivity. + **Note:** The `lambda-role-arn` and `lambda-sg-id` fields correspond to the `--lambda-role-arn` and `--lambda-sg-id` command-line flags, respectively. Setting them in the config file or via environment variables (e.g., `NTCHK_LAMBDA_ROLE_ARN`) achieves the same result. + + **EC2 Mode Note:** If using a custom AMI (`instance-ami`), please ensure the [SSM agent](https://docs.aws.amazon.com/systems-manager/latest/userguide/manually-install-ssm-agent-linux.html) and [curl](https://curl.se/) are both installed. We rely on SSM's [SendCommand](https://docs.aws.amazon.com/code-library/latest/ug/ssm_example_ssm_SendCommand_section.html) to test HTTPS connectivity in EC2 mode. 2. Run the network diagnosis - To start the diagnosis, the the command: `./gitpod-network-check diagnose` + The tool supports different runners for executing the checks, specified by the `--runner` flag (`ec2`, `lambda`, `local`). + + **Using EC2 Runner (Default):** + + This mode launches temporary EC2 instances in your specified subnets to perform the network checks. This most closely simulates the environment where Gitpod components will run. + + To start the diagnosis using the EC2 runner: `./gitpod-network-check diagnose --runner ec2` (or simply `./gitpod-network-check diagnose` as EC2 is the default). ```console - ./gitpod-network-check diagnose + # Example output for EC2 runner + ./gitpod-network-check diagnose --runner ec2 INFO[0000] ℹ️ Running with region `eu-central-1`, main subnet `[subnet-0ed211f14362b224f subnet-041703e62a05d2024]`, pod subnet `[subnet-075c44edead3b062f subnet-06eb311c6b92e0f29]`, hosts `[accounts.google.com https://github.com]`, ami ``, and API endpoint `` INFO[0000] ✅ Main Subnets are valid INFO[0000] ✅ Pod Subnets are valid @@ -116,22 +130,51 @@ A CLI to check if your network setup is suitable for the installation of Gitpod. INFO[0306] ✅ Security group 'sg-00d4a66a7840ebd67' deleted ``` + **Using Lambda Runner:** + + This mode uses AWS Lambda functions deployed into your specified subnets to perform the network checks. It avoids the need to launch full EC2 instances but has its own prerequisites. + + * **Prerequisites for Lambda Mode:** + * **IAM Permissions:** The AWS credentials used to run `gitpod-network-check` need permissions to manage Lambda functions, IAM roles, security groups, and CloudWatch Logs. Specifically, it needs to perform actions like: `lambda:CreateFunction`, `lambda:GetFunction`, `lambda:DeleteFunction`, `lambda:InvokeFunction`, `iam:CreateRole`, `iam:GetRole`, `iam:DeleteRole`, `iam:AttachRolePolicy`, `iam:DetachRolePolicy`, `ec2:CreateSecurityGroup`, `ec2:DescribeSecurityGroups`, `ec2:DeleteSecurityGroup`, `ec2:AuthorizeSecurityGroupEgress`, `ec2:DescribeSubnets`, `logs:DeleteLogGroup`. + * **Network Connectivity:** Lambda functions running within a VPC need a route to the internet or required AWS service endpoints. This typically requires a **NAT Gateway** in your VPC or **VPC Endpoints** for all necessary services (e.g., STS, CloudWatch Logs, ECR, S3, DynamoDB, and any target HTTPS hosts). Without proper outbound connectivity, the Lambda checks will fail. + + * **Running Lambda Runner:** + To start the diagnosis using the Lambda runner: + ```bash + ./gitpod-network-check diagnose --runner lambda + ``` + + * **Using Existing Resources (Lambda Runner):** + If you have pre-existing IAM roles or Security Groups you want the Lambda functions to use, you can specify them using flags. This will prevent the tool from creating or deleting these specific resources. + ```bash + ./gitpod-network-check diagnose --runner lambda \ + --lambda-role-arn arn:aws:iam::123456789012:role/MyExistingLambdaRole \ + --lambda-sg-id sg-0123456789abcdef0 + ``` + + * **Example Output (Lambda Runner):** + The output will be similar to EC2 runner but will show Lambda function creation/invocation instead of EC2 instance management. + + **Using Local Runner:** + + This mode runs the checks directly from the machine where you execute the CLI. It's useful for basic outbound connectivity tests but **does not** accurately reflect the network environment within your AWS subnets. + + To start the diagnosis using the local runner: `./gitpod-network-check diagnose --runner local` + 3. Clean up after network diagnosis - Dianosis is designed to do clean-up before it finishes. However, if the process terminates unexpectedly, you may clean-up AWS resources it creates like so: + The `diagnose` command is designed to clean up the AWS resources it creates (EC2 instances, Lambda functions, IAM roles, Security Groups, CloudWatch Log groups) before it finishes. However, if the process terminates unexpectedly, you can manually trigger cleanup using the `clean` command. This command respects the `--runner` flag to clean up resources specific to that runner. - ```console - ./gitpod-network-check clean - INFO[0000] ✅ Main Subnets are valid - INFO[0000] ✅ Pod Subnets are valid - INFO[0000] ✅ Instances terminated - INFO[0000] Cleaning up: Waiting for 2 minutes so network interfaces are deleted - INFO[0121] ✅ Role 'GitpodNetworkCheck' deleted - INFO[0121] ✅ Instance profile deleted - INFO[0122] ✅ Security group 'sg-0a6119dcb6a564fc1' deleted - INFO[0122] ✅ Security group 'sg-07373362953212e54' deleted + ```bash + # Clean up resources potentially left by the EC2 runner + ./gitpod-network-check clean --runner ec2 + + # Clean up resources potentially left by the Lambda runner + ./gitpod-network-check clean --runner lambda ``` + **Note:** The `clean` command will *not* delete IAM roles or Security Groups if they were provided using the `--lambda-role-arn` or `--lambda-sg-id` flags during the `diagnose` run. + ## FAQ If the EC2 instances are timing out, or you cannot connect to them with Session Manager, be sure to add the following policies. diff --git a/gitpod-network-check/cmd/checks.go b/gitpod-network-check/cmd/checks.go index c0c6dae..b4ddce1 100644 --- a/gitpod-network-check/cmd/checks.go +++ b/gitpod-network-check/cmd/checks.go @@ -1,726 +1,111 @@ package cmd import ( - "context" - "encoding/base64" - "errors" "fmt" - "net" - "net/url" + "maps" "slices" - "sort" - "strings" - "time" - "github.com/aws/aws-sdk-go-v2/aws" - "github.com/aws/aws-sdk-go-v2/service/ec2" - "github.com/aws/aws-sdk-go-v2/service/ec2/types" - "github.com/aws/aws-sdk-go-v2/service/iam" - iam_types "github.com/aws/aws-sdk-go-v2/service/iam/types" - "github.com/aws/aws-sdk-go-v2/service/ssm" - "github.com/aws/smithy-go" log "github.com/sirupsen/logrus" "github.com/spf13/cobra" - "golang.org/x/sync/errgroup" - "k8s.io/apimachinery/pkg/util/wait" -) - -var checkCommand = &cobra.Command{ // nolint:gochecknoglobals - PersistentPreRunE: validateSubnets, - Use: "diagnose", - Short: "Runs the network check diagnosis", - SilenceUsage: false, - RunE: func(cmd *cobra.Command, args []string) error { - cfg, err := initAwsConfig(cmd.Context(), networkConfig.AwsRegion) - if err != nil { - return err - } - - ec2Client := ec2.NewFromConfig(cfg) - ssmClient := ssm.NewFromConfig(cfg) - iamClient := iam.NewFromConfig(cfg) - defer cleanup(cmd.Context(), ec2Client, iamClient) - err = checkSMPrerequisites(cmd.Context(), ec2Client) - if err != nil { - return fmt.Errorf("❌ failed to check prerequisites: %v", err) - } + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/checks" + testrunner "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/runner" +) - role, err := createIAMRoleAndAttachPolicy(cmd.Context(), iamClient) - if err != nil { - return fmt.Errorf("❌ error creating IAM role and attaching policy: %v", err) - } - Roles = append(Roles, *role.RoleName) - log.Info("✅ IAM role created and policy attached") +var skipCleanup bool - instanceProfile, err := createInstanceProfileAndAttachRole(cmd.Context(), iamClient, *role.RoleName) - if err != nil { - return fmt.Errorf("❌ failed to create instance profile: %v", err) - } - InstanceProfile = aws.ToString(instanceProfile.InstanceProfileName) +func init() { + checkCommand.Flags().BoolVar(&skipCleanup, "skip-cleanup", false, "Skip the cleanup false (default: false). Useful for debugging purposes.") + NetworkCheckCmd.AddCommand(checkCommand) +} - allSubnets := slices.Concat(networkConfig.MainSubnets, networkConfig.PodSubnets) - slices.Sort(allSubnets) - distinctSubnets := slices.Compact(allSubnets) - if len(distinctSubnets) < len(allSubnets) { - log.Infof("ℹ️ Found duplicate subnets. We'll test each subnet '%v' only once.", distinctSubnets) - } +var checkCommand = &cobra.Command{ // nolint:gochecknoglobals + PreRunE: validateArguments, + Use: "diagnose", + Short: "Runs the network check diagnosis", + SilenceUsage: false, + RunE: func(cmd *cobra.Command, args []string) error { + ctx := cmd.Context() - log.Info("ℹ️ Launching EC2 instances in Main subnets") - mainInstanceIds, err := launchInstances(cmd.Context(), ec2Client, networkConfig.MainSubnets, instanceProfile.Arn) + runner, err := testrunner.NewRunner(ctx, Flags.RunnerType, &NetworkConfig) if err != nil { - return err + return fmt.Errorf("❌ failed to create test runner: %v", err) } - log.Infof("ℹ️ Main EC2 instances: %v", mainInstanceIds) - InstanceIds = append(InstanceIds, mainInstanceIds...) - log.Info("ℹ️ Launching EC2 instances in a Pod subnets") - podInstanceIds, err := launchInstances(cmd.Context(), ec2Client, networkConfig.PodSubnets, instanceProfile.Arn) - if err != nil { - return err - } - log.Infof("ℹ️ Pod EC2 instances: %v", podInstanceIds) - InstanceIds = append(InstanceIds, podInstanceIds...) + defer (func() { + if skipCleanup { + log.Info("⚠️ Skipping cleanup, because --skip-cleanup flag is set.") + return + } - log.Info("ℹ️ Waiting for EC2 instances to become Running (times out in 5 minutes)") - runningWaiter := ec2.NewInstanceRunningWaiter(ec2Client, func(irwo *ec2.InstanceRunningWaiterOptions) { - irwo.MaxDelay = 15 * time.Second - irwo.MinDelay = 5 * time.Second - }) - err = runningWaiter.Wait(cmd.Context(), &ec2.DescribeInstancesInput{InstanceIds: InstanceIds}, *aws.Duration(5 * time.Minute)) - if err != nil { - return fmt.Errorf("❌ Nodes never got Running: %v", err) - } - log.Info("✅ EC2 instances are now Running.") - log.Info("ℹ️ Waiting for EC2 instances to become Healthy (times out in 5 minutes)") - waitstatusOK := ec2.NewInstanceStatusOkWaiter(ec2Client, func(isow *ec2.InstanceStatusOkWaiterOptions) { - isow.MaxDelay = 15 * time.Second - isow.MinDelay = 5 * time.Second - }) - err = waitstatusOK.Wait(cmd.Context(), &ec2.DescribeInstanceStatusInput{InstanceIds: InstanceIds}, *aws.Duration(5 * time.Minute)) - if err != nil { - return fmt.Errorf("❌ Nodes never got Healthy: %v", err) - } - log.Info("✅ EC2 Instances are now healthy/Ok") + // Ensure runner was actually assigned before trying to clean up + if runner == nil { + log.Info("ℹ️ No runner initialized, skipping cleanup.") + return + } + log.Infof("ℹ️ Running cleanup") + terr := runner.Cleanup(ctx) + if terr != nil { + log.Errorf("❌ failed to cleanup: %v", terr) + } + log.Infof("✅ Cleanup done") + })() - log.Infof("ℹ️ Connecting to SSM...") - err = ensureSessionManagerIsUp(cmd.Context(), ssmClient) + // Prepare + err = runner.Prepare(ctx) if err != nil { - return fmt.Errorf("❌ could not connect to SSM: %w", err) + return fmt.Errorf("❌ failed to prepare: %v", err) } - log.Infof("ℹ️ Checking if the required AWS Services can be reached from the ec2 instances in the pod subnet") - serviceEndpoints := map[string]string{ - "SSM": fmt.Sprintf("https://ssm.%s.amazonaws.com", networkConfig.AwsRegion), - "SSMmessages": fmt.Sprintf("https://ssmmessages.%s.amazonaws.com", networkConfig.AwsRegion), - "Autoscaling": fmt.Sprintf("https://autoscaling.%s.amazonaws.com", networkConfig.AwsRegion), - "CloudFormation": fmt.Sprintf("https://cloudformation.%s.amazonaws.com", networkConfig.AwsRegion), - "EC2": fmt.Sprintf("https://ec2.%s.amazonaws.com", networkConfig.AwsRegion), - "EC2messages": fmt.Sprintf("https://ec2messages.%s.amazonaws.com", networkConfig.AwsRegion), - "EKS": fmt.Sprintf("https://eks.%s.amazonaws.com", networkConfig.AwsRegion), - "Elastic LoadBalancing": fmt.Sprintf("https://elasticloadbalancing.%s.amazonaws.com", networkConfig.AwsRegion), - "Kinesis Firehose": fmt.Sprintf("https://firehose.%s.amazonaws.com", networkConfig.AwsRegion), - "KMS": fmt.Sprintf("https://kms.%s.amazonaws.com", networkConfig.AwsRegion), - "CloudWatch": fmt.Sprintf("https://logs.%s.amazonaws.com", networkConfig.AwsRegion), - "SecretsManager": fmt.Sprintf("https://secretsmanager.%s.amazonaws.com", networkConfig.AwsRegion), - "Sts": fmt.Sprintf("https://sts.%s.amazonaws.com", networkConfig.AwsRegion), - "ECR Api": fmt.Sprintf("https://api.ecr.%s.amazonaws.com", networkConfig.AwsRegion), - "ECR": fmt.Sprintf("https://869456089606.dkr.ecr.%s.amazonaws.com", networkConfig.AwsRegion), - } - checkServicesAvailability(cmd.Context(), ssmClient, InstanceIds, serviceEndpoints) + for _, testset := range Flags.SelectedTestsets { + log.Infof("ℹ️ Running testset: %s", testset) - log.Infof("ℹ️ Checking if certain AWS Services can be reached from ec2 instances in the main subnet") - serviceEndpointsForMain := map[string]string{ - "S3": fmt.Sprintf("https://s3.%s.amazonaws.com", networkConfig.AwsRegion), - "DynamoDB": fmt.Sprintf("https://dynamodb.%s.amazonaws.com", networkConfig.AwsRegion), - } - if networkConfig.ApiEndpoint != "" { - serviceEndpointsForMain["ExecuteAPI"] = fmt.Sprintf("https://%s.execute-api.%s.amazonaws.com", networkConfig.ApiEndpoint, networkConfig.AwsRegion) - } - checkServicesAvailability(cmd.Context(), ssmClient, mainInstanceIds, serviceEndpointsForMain) + ts := checks.TestSets[checks.TestsetName(testset)] + serviceEndpoints, subnetType := ts(&NetworkConfig) + subnets := Filter(NetworkConfig.GetAllSubnets(), func(subnet checks.Subnet) bool { + return subnet.Type == subnetType + }) - httpHosts := map[string]string{} - for _, v := range networkConfig.HttpsHosts { - host := strings.TrimSpace(v) - parsedUrl, err := url.Parse(host) + testResult, err := runner.TestService(ctx, subnets, serviceEndpoints) if err != nil { - log.Warnf("🚧 Invalid Host: %s, skipping due to error: %v", host, err) - continue + log.Errorf("❌ failed to run testset %s: %v", testset, err) + break } - if parsedUrl.Scheme == "" { - httpHosts[host] = fmt.Sprintf("https://%s", host) - } else if parsedUrl.Scheme == "https" { - httpHosts[host] = parsedUrl.Host + if !testResult { + log.Errorf("❌ Testset %s failed", testset) } else { - log.Warnf("🚧 Unsupported scheme: %s, skipping test for %s", parsedUrl.Scheme, host) - continue + log.Infof("✅ Testset %s passed", testset) } } - if len(httpHosts) > 0 { - log.Infof("ℹ️ Checking if hosts can be reached with HTTPS from ec2 instances in the main subnets") - } - checkServicesAvailability(cmd.Context(), ssmClient, mainInstanceIds, httpHosts) return nil }, } -type vpcEndpointsMap struct { - Endpoint string - PrivateDnsName string - PrivateDnsRequired bool -} - -// the ssm-agent requires that ec2messages, ssm and ssmmessages are available -// we check the endpoints here so that if we cannot send commands to the ec2 instance -// in a private setup we know why -func checkSMPrerequisites(ctx context.Context, ec2Client *ec2.Client) error { - log.Infof("ℹ️ Checking prerequisites") - vpcEndpoints := []vpcEndpointsMap{ - { - Endpoint: fmt.Sprintf("com.amazonaws.%s.ec2messages", networkConfig.AwsRegion), - PrivateDnsName: fmt.Sprintf("ec2messages.%s.amazonaws.com", networkConfig.AwsRegion), - PrivateDnsRequired: false, - }, - { - Endpoint: fmt.Sprintf("com.amazonaws.%s.ssm", networkConfig.AwsRegion), - PrivateDnsName: fmt.Sprintf("ssm.%s.amazonaws.com", networkConfig.AwsRegion), - PrivateDnsRequired: false, - }, - { - Endpoint: fmt.Sprintf("com.amazonaws.%s.ssmmessages", networkConfig.AwsRegion), - PrivateDnsName: fmt.Sprintf("ssmmessages.%s.amazonaws.com", networkConfig.AwsRegion), - PrivateDnsRequired: false, - }, - { - Endpoint: fmt.Sprintf("com.amazonaws.%s.execute-api", networkConfig.AwsRegion), - PrivateDnsName: fmt.Sprintf("execute-api.%s.amazonaws.com", networkConfig.AwsRegion), - PrivateDnsRequired: true, - }, - } - - var prereqErrs []string - for _, endpoint := range vpcEndpoints { - response, err := ec2Client.DescribeVpcEndpoints(ctx, &ec2.DescribeVpcEndpointsInput{ - Filters: []types.Filter{ - { - Name: aws.String("service-name"), - Values: []string{endpoint.Endpoint}, - }, - }, - }) - - if err != nil { - return err - } - - if len(response.VpcEndpoints) == 0 { - if strings.Contains(endpoint.Endpoint, "execute-api") && networkConfig.ApiEndpoint != "" { - log.Infof("ℹ️ 'api-endpoint' parameter exists, deferring connectivity test for execute-api VPC endpoint until testing main subnet connectivity") - continue - } else if strings.Contains(endpoint.Endpoint, "execute-api") && networkConfig.ApiEndpoint == "" { - errMsg := "Add a VPC endpoint for execute-api in this account or use the 'api-endpoint' parameter to specify a centralized one in another account, and test again" - log.Errorf("❌ %s", errMsg) - prereqErrs = append(prereqErrs, errMsg) - continue - } - _, err := TestServiceConnectivity(ctx, endpoint.PrivateDnsName, 5*time.Second) - if err != nil { - errMsg := fmt.Sprintf("Service %s connectivity test failed: %v\n", endpoint.PrivateDnsName, err) - log.Error("❌ %w", errMsg) - prereqErrs = append(prereqErrs, errMsg) +func validateArguments(cmd *cobra.Command, args []string) error { + // Validate testsets if specified + if len(Flags.SelectedTestsets) > 0 { + for _, testset := range Flags.SelectedTestsets { + if _, exists := checks.TestSets[checks.TestsetName(testset)]; !exists { + return fmt.Errorf("Invalid testset: %s. Available testsets: %v", + testset, + slices.Collect(maps.Keys(checks.TestSets))) } - log.Infof("✅ Service %s has connectivity", endpoint.PrivateDnsName) - } else { - for _, e := range response.VpcEndpoints { - if e.PrivateDnsEnabled != nil && !*e.PrivateDnsEnabled && endpoint.PrivateDnsRequired { - errMsg := fmt.Sprintf("VPC endpoint '%s' has private DNS disabled, it must be enabled", *e.VpcEndpointId) - log.Errorf("❌ %s", errMsg) - prereqErrs = append(prereqErrs, errMsg) - } - } - log.Infof("✅ VPC endpoint %s is configured", endpoint.Endpoint) - } - } - - if len(prereqErrs) > 0 { - return fmt.Errorf("%s", strings.Join(prereqErrs, "; ")) - } - return nil -} - -func ensureSessionManagerIsUp(ctx context.Context, ssmClient *ssm.Client) error { - err := wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 2*time.Minute, true, func(ctx context.Context) (done bool, err error) { - _, err = sendCommand(ctx, ssmClient, "echo ssm") - if err != nil { - return false, nil } - - return true, nil - }) - - if err != nil { - return fmt.Errorf("❌ could not establish connection with SSM: %w", err) - } - - return nil -} - -func checkServicesAvailability(ctx context.Context, ssmClient *ssm.Client, instanceIds []string, serviceEndpoints map[string]string) { - services := make([]string, 0, len(serviceEndpoints)) - for service := range serviceEndpoints { - services = append(services, service) - } - sort.Strings(services) - - for _, service := range services { - err := isServiceAvailable(ctx, ssmClient, instanceIds, serviceEndpoints[service]) - if err != nil { - log.Warnf("❌ %s is not available (%s)", service, serviceEndpoints[service]) - log.Info(err) - } else { - log.Infof("✅ %s is available", service) - } - } -} - -func isServiceAvailable(ctx context.Context, ssmSvc *ssm.Client, instanceIds []string, serviceUrl string) error { - commandId, err := sendServiceRequest(ctx, ssmSvc, serviceUrl) - if err != nil { - return fmt.Errorf("❌ Failed to run the command in instances: %v", err) - } - - g, ctx := errgroup.WithContext(context.Background()) - for _, instanceId := range instanceIds { - id := instanceId // Local variable for the closure - g.Go(func() error { - return fetchResultsForInstance(ctx, ssmSvc, id, commandId) - }) - } - if err := g.Wait(); err != nil { - return fmt.Errorf("❌ Error fetching command results: %v", err) - } - - return nil -} - -func validateSubnets(cmd *cobra.Command, args []string) error { - if len(networkConfig.MainSubnets) < 1 { - return fmt.Errorf("❌ At least one Main subnet needs to be specified: %v", networkConfig.MainSubnets) - } - log.Info("✅ Main Subnets are valid") - if len(networkConfig.PodSubnets) < 1 { - return fmt.Errorf("❌ At least one Pod subnet needs to be specified: %v", networkConfig.PodSubnets) - } - - log.Info("✅ Pod Subnets are valid") - - return nil -} - -func launchInstances(ctx context.Context, ec2Client *ec2.Client, subnets []string, profileArn *string) ([]string, error) { - var instanceIds []string - for _, subnet := range subnets { - if _, ok := Subnets[subnet]; ok { - log.Warnf("An EC2 instance was already created for subnet '%v', skipping", subnet) - continue - } - secGroup, err := createSecurityGroups(ctx, ec2Client, subnet) - if err != nil { - return nil, fmt.Errorf("❌ failed to create security group for subnet '%v': %v", subnet, err) - } - SecurityGroups = append(SecurityGroups, secGroup) - - instanceType, err := getPreferredInstanceType(ctx, ec2Client) - if err != nil { - return nil, fmt.Errorf("❌ failed to get preferred instance type: %v", err) - } - log.Infof("ℹ️ Instance type %s shall be used", instanceType) - - instanceId, err := launchInstanceInSubnet(ctx, ec2Client, subnet, secGroup, profileArn, instanceType) - if err != nil { - return nil, fmt.Errorf("❌ Failed to launch instances in subnet %s: %v", subnet, err) - } - - instanceIds = append(instanceIds, instanceId) - if Subnets == nil { - Subnets = make(map[string]bool) - } - Subnets[subnet] = true - } - - return instanceIds, nil -} - -func launchInstanceInSubnet(ctx context.Context, ec2Client *ec2.Client, subnetID, secGroupId string, instanceProfileName *string, instanceType types.InstanceType) (string, error) { - amiId := "" - if networkConfig.InstanceAMI != "" { - customAMIId, err := findCustomAMI(ctx, ec2Client, networkConfig.InstanceAMI) - if err != nil { - return "", err - } - amiId = customAMIId } else { - regionalAMI, err := findUbuntuAMI(ctx, ec2Client) - if err != nil { - return "", err - } - amiId = regionalAMI + log.Info("ℹ️ No testsets specified, running no testsets") } - // Specify the user data script to install the SSM Agent - userData := `#!/bin/bash - sudo systemctl enable snap.amazon-ssm-agent.amazon-ssm-agent.service - sudo systemctl restart snap.amazon-ssm-agent.amazon-ssm-agent.service - ` - - // Encode user data in base64 - userDataEncoded := base64.StdEncoding.EncodeToString([]byte(userData)) - - input := &ec2.RunInstancesInput{ - ImageId: aws.String(amiId), // Example AMI ID, replace with an actual one - InstanceType: instanceType, - MaxCount: aws.Int32(1), - MinCount: aws.Int32(1), - UserData: &userDataEncoded, - SecurityGroupIds: []string{secGroupId}, - SubnetId: aws.String(subnetID), - IamInstanceProfile: &types.IamInstanceProfileSpecification{ - Arn: instanceProfileName, - }, - TagSpecifications: []types.TagSpecification{ - { - ResourceType: types.ResourceTypeInstance, - Tags: []types.Tag{ - { - Key: aws.String("gitpod.io/network-check"), - Value: aws.String("true"), - }, - }, - }, - }, - } - - var result *ec2.RunInstancesOutput - err := wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 10*time.Second, false, func(ctx context.Context) (done bool, err error) { - result, err = ec2Client.RunInstances(ctx, input) - - if err != nil { - if strings.Contains(err.Error(), "Invalid IAM Instance Profile ARN") { - return false, nil - } - - return false, err - } - - return true, nil - }) - - if err != nil { - return "", err - } - - if len(result.Instances) == 0 { - return "", fmt.Errorf("instances didn't get created") - } - - return aws.ToString(result.Instances[0].InstanceId), nil -} - -func findCustomAMI(ctx context.Context, client *ec2.Client, amiId string) (string, error) { - input := &ec2.DescribeImagesInput{ - ImageIds: []string{amiId}, - } - - result, err := client.DescribeImages(ctx, input) - if err != nil { - return "", err - } - if len(result.Images) > 0 { - return *result.Images[0].ImageId, nil - } - - return "", fmt.Errorf("no custom AMI found") -} - -// findUbuntuAMI searches for the latest Ubuntu AMI in the region of the EC2 client. -func findUbuntuAMI(ctx context.Context, client *ec2.Client) (string, error) { - // You may want to update these filters based on your specific requirements - input := &ec2.DescribeImagesInput{ - Filters: []types.Filter{ - { - Name: aws.String("name"), - Values: []string{"ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"}, - }, - { - Name: aws.String("virtualization-type"), - Values: []string{"hvm"}, - }, - }, - Owners: []string{"099720109477"}, // Canonical's owner ID - } - - result, err := client.DescribeImages(ctx, input) - if err != nil { - return "", err - } - - // Sort the AMIs by creation date - sort.Slice(result.Images, func(i, j int) bool { - return *result.Images[i].CreationDate > *result.Images[j].CreationDate - }) - - if len(result.Images) > 0 { - return *result.Images[0].ImageId, nil - } - - return "", fmt.Errorf("no Ubuntu AMIs found") -} - -// sendServiceRequest sends a command to an EC2 instance and returns the command ID -func sendServiceRequest(ctx context.Context, svc *ssm.Client, serviceUrl string) (string, error) { - return sendCommand(ctx, svc, fmt.Sprintf("curl -m 15 -I %v", serviceUrl)) -} - -func sendCommand(ctx context.Context, svc *ssm.Client, command string) (string, error) { - networkTestingCommands := []string{ - command, - } - - result, err := svc.SendCommand(ctx, &ssm.SendCommandInput{ - InstanceIds: InstanceIds, - DocumentName: aws.String("AWS-RunShellScript"), - Parameters: map[string][]string{ - "commands": networkTestingCommands, - }, - }) - if err != nil { - return "", fmt.Errorf("error sending command: %v", err) - } - - return *result.Command.CommandId, nil -} - -func fetchResultsForInstance(ctx context.Context, svc *ssm.Client, instanceId, commandId string) error { - return wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 30*time.Second, false, func(ctx context.Context) (done bool, err error) { - // Check command invocation status - invocationResult, err := svc.GetCommandInvocation(ctx, &ssm.GetCommandInvocationInput{ - CommandId: aws.String(commandId), - InstanceId: aws.String(instanceId), - }) - - var apiErr smithy.APIError - if errors.As(err, &apiErr) && apiErr.ErrorCode() == "InvocationDoesNotExist" { - return false, nil - } - - if err != nil { - log.Errorf("❌ Error getting command invocation for instance %s: %v", instanceId, err) - return false, fmt.Errorf("error getting command invocation for instance %s: %v", instanceId, err) - } - - if *invocationResult.StatusDetails == "Pending" || *invocationResult.StatusDetails == "InProgress" { - log.Debugf("⏳ Instance %s is %s for command %s", instanceId, *invocationResult.StatusDetails, commandId) - return false, nil - } - - if *invocationResult.StatusDetails == "Success" { - log.Debugf("✅ Instance %s command output:\n%s\n", instanceId, *invocationResult.StandardOutputContent) - return true, nil - } else { - log.Errorf("❌ Instance %s command with status %s not successful:\n%s\n", instanceId, *invocationResult.StatusDetails, *invocationResult.StandardErrorContent) - return false, fmt.Errorf("instance %s command failed: %s", instanceId, *invocationResult.StandardErrorContent) - } - }) -} - -func createSecurityGroups(ctx context.Context, svc *ec2.Client, subnetID string) (string, error) { - // Describe the subnet to find the VPC ID - describeSubnetsInput := &ec2.DescribeSubnetsInput{ - SubnetIds: []string{subnetID}, - } - - describeSubnetsOutput, err := svc.DescribeSubnets(ctx, describeSubnetsInput) - if err != nil { - return "", fmt.Errorf("failed to describe subnet: %v", err) - } - - if len(describeSubnetsOutput.Subnets) == 0 { - return "", fmt.Errorf("no subnets found with ID: %s", subnetID) - } - - vpcID := describeSubnetsOutput.Subnets[0].VpcId - - // Create the security group - createSGInput := &ec2.CreateSecurityGroupInput{ - Description: aws.String("EC2 security group allowing all HTTPS outgoing traffic"), - GroupName: aws.String(fmt.Sprintf("EC2-security-group-nc-%s", subnetID)), - VpcId: vpcID, - TagSpecifications: []types.TagSpecification{ - { - ResourceType: types.ResourceTypeSecurityGroup, - Tags: []types.Tag{ - { - Key: aws.String("gitpod.io/network-check"), - Value: aws.String("true"), - }, - }, - }, - }, - } - - createSGOutput, err := svc.CreateSecurityGroup(ctx, createSGInput) - if err != nil { - log.Fatalf("Failed to create security group: %v", err) - } - - sgID := createSGOutput.GroupId - log.Infof("ℹ️ Created security group with ID: %s", *sgID) - - // Authorize HTTPS outbound traffic - authorizeEgressInput := &ec2.AuthorizeSecurityGroupEgressInput{ - GroupId: sgID, - IpPermissions: []types.IpPermission{ - { - IpProtocol: aws.String("tcp"), - FromPort: aws.Int32(443), - ToPort: aws.Int32(443), - IpRanges: []types.IpRange{ - { - CidrIp: aws.String("0.0.0.0/0"), - Description: aws.String("Allow all outbound HTTPS traffic"), - }, - }, - }, - }, - } - - _, err = svc.AuthorizeSecurityGroupEgress(ctx, authorizeEgressInput) - if err != nil { - log.Fatalf("Failed to authorize security group egress: %v", err) - } - - return *sgID, nil -} - -func createIAMRoleAndAttachPolicy(ctx context.Context, svc *iam.Client) (*iam_types.Role, error) { - // Define the trust relationship - trustPolicy := `{ - "Version": "2012-10-17", - "Statement": [{ - "Effect": "Allow", - "Principal": {"Service": "ec2.amazonaws.com"}, - "Action": "sts:AssumeRole" - }] - }` - - // Create the role - createRoleOutput, err := svc.CreateRole(ctx, &iam.CreateRoleInput{ - RoleName: aws.String(gitpodRoleName), - AssumeRolePolicyDocument: aws.String(trustPolicy), - Tags: networkCheckTag, - }) - if err != nil { - return nil, fmt.Errorf("creating IAM role: %w", err) - } - - // Attach the policy - _, err = svc.AttachRolePolicy(ctx, &iam.AttachRolePolicyInput{ - RoleName: aws.String(gitpodRoleName), - PolicyArn: aws.String("arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"), - }) - if err != nil { - return nil, fmt.Errorf("attaching policy to role: %w", err) - } - - return createRoleOutput.Role, nil -} - -func createInstanceProfileAndAttachRole(ctx context.Context, svc *iam.Client, roleName string) (*iam_types.InstanceProfile, error) { - // Create instance profile - instanceProfileOutput, err := svc.CreateInstanceProfile(ctx, &iam.CreateInstanceProfileInput{ - InstanceProfileName: aws.String(gitpodInstanceProfile), - Tags: networkCheckTag, - }) - if err != nil { - return nil, fmt.Errorf("creating instance profile: %w", err) - } - - // Add role to instance profile - _, err = svc.AddRoleToInstanceProfile(ctx, &iam.AddRoleToInstanceProfileInput{ - InstanceProfileName: aws.String(gitpodInstanceProfile), - RoleName: aws.String(roleName), - }) - if err != nil { - return nil, fmt.Errorf("adding role to instance profile: %w", err) - } - - return instanceProfileOutput.InstanceProfile, nil + return nil } -func getPreferredInstanceType(ctx context.Context, svc *ec2.Client) (types.InstanceType, error) { - instanceTypes := []types.InstanceType{ - types.InstanceTypeT2Micro, - types.InstanceTypeT3aMicro, - types.InstanceTypeT3Micro, - } - for _, instanceType := range instanceTypes { - exists, err := instanceTypeExists(ctx, svc, instanceType) - if err != nil { - return "", err - } - if exists { - return instanceType, nil +func Filter[T comparable](slice []T, f func(T) bool) []T { + var result []T + for _, v := range slice { + if f(v) { + result = append(result, v) } } - return "", fmt.Errorf("no preferred instance type available in region: %s", networkConfig.AwsRegion) -} - -func instanceTypeExists(ctx context.Context, svc *ec2.Client, instanceType types.InstanceType) (bool, error) { - input := &ec2.DescribeInstanceTypeOfferingsInput{ - Filters: []types.Filter{ - { - Name: aws.String("instance-type"), - Values: []string{string(instanceType)}, - }, - }, - LocationType: types.LocationTypeRegion, - } - - resp, err := svc.DescribeInstanceTypeOfferings(ctx, input) - if err != nil { - return false, err - } - - return len(resp.InstanceTypeOfferings) > 0, nil -} - -// ConnectivityTestResult represents the results of DNS and network connectivity tests -type ConnectivityTestResult struct { - IPAddresses []string -} - -// TestServiceConnectivity tests both DNS resolution and TCP connectivity given a hostname -func TestServiceConnectivity(ctx context.Context, hostname string, timeout time.Duration) (*ConnectivityTestResult, error) { - result := &ConnectivityTestResult{} - - ips, err := net.DefaultResolver.LookupIPAddr(ctx, hostname) - if err != nil { - return result, fmt.Errorf("DNS resolution failed: %w", err) - } - for _, ip := range ips { - result.IPAddresses = append(result.IPAddresses, ip.String()) - } - if len(result.IPAddresses) == 0 { - return result, fmt.Errorf("no IP addresses found for hostname: %s", hostname) - } - dialer := net.Dialer{Timeout: timeout} - conn, err := dialer.DialContext(ctx, "tcp", fmt.Sprintf("%s:443", result.IPAddresses[0])) - if err != nil { - return result, fmt.Errorf("TCP connection failed: %w", err) - } - defer conn.Close() - - return result, nil + return result } diff --git a/gitpod-network-check/cmd/cleanup.go b/gitpod-network-check/cmd/cleanup.go index c6b04cb..f2ba1e8 100644 --- a/gitpod-network-check/cmd/cleanup.go +++ b/gitpod-network-check/cmd/cleanup.go @@ -1,26 +1,37 @@ package cmd import ( - "github.com/aws/aws-sdk-go-v2/service/ec2" - "github.com/aws/aws-sdk-go-v2/service/iam" + "fmt" + + log "github.com/sirupsen/logrus" "github.com/spf13/cobra" + + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/runner" ) var cleanCommand = &cobra.Command{ // nolint:gochecknoglobals - PersistentPreRunE: validateSubnets, - Use: "clean", - Short: "Explicitly cleans up after the network check diagnosis", - SilenceUsage: false, + Use: "clean", + Short: "Explicitly cleans up after the network check diagnosis", + SilenceUsage: false, RunE: func(cmd *cobra.Command, args []string) error { - cfg, err := initAwsConfig(cmd.Context(), networkConfig.AwsRegion) + ctx := cmd.Context() + + log.Infof("ℹ️ Running cleanup") + runner, err := runner.LoadRunnerFromTags(ctx, Flags.RunnerType, &NetworkConfig) if err != nil { - return err + return fmt.Errorf("❌ failed to create test runner: %v", err) } - ec2Client := ec2.NewFromConfig(cfg) - iamClient := iam.NewFromConfig(cfg) + err = runner.Cleanup(ctx) + if err != nil { + return fmt.Errorf("❌ failed to cleanup: %v", err) + } + log.Infof("✅ Cleanup done") - cleanup(cmd.Context(), ec2Client, iamClient) return nil }, } + +func init() { + NetworkCheckCmd.AddCommand(cleanCommand) +} diff --git a/gitpod-network-check/cmd/common.go b/gitpod-network-check/cmd/common.go deleted file mode 100644 index 2f1bf0f..0000000 --- a/gitpod-network-check/cmd/common.go +++ /dev/null @@ -1,191 +0,0 @@ -package cmd - -import ( - "context" - "time" - - "github.com/aws/aws-sdk-go-v2/aws" - "github.com/aws/aws-sdk-go-v2/config" - "github.com/aws/aws-sdk-go-v2/service/ec2" - "github.com/aws/aws-sdk-go-v2/service/ec2/types" - "github.com/aws/aws-sdk-go-v2/service/iam" - iam_types "github.com/aws/aws-sdk-go-v2/service/iam/types" - log "github.com/sirupsen/logrus" -) - -// this will be useful when we are cleaning up things at the end -var ( - InstanceIds []string - SecurityGroups []string - Roles []string - InstanceProfile string - Subnets map[string]bool -) - -const gitpodRoleName = "GitpodNetworkCheck" -const gitpodInstanceProfile = "GitpodNetworkCheck" - -var networkCheckTag = []iam_types.Tag{ - { - Key: aws.String("gitpod.io/network-check"), - Value: aws.String("true"), - }, -} - -func initAwsConfig(ctx context.Context, region string) (aws.Config, error) { - return config.LoadDefaultConfig(ctx, config.WithRegion(region)) -} - -func cleanup(ctx context.Context, svc *ec2.Client, iamsvc *iam.Client) { - if len(InstanceIds) == 0 { - instances, err := svc.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ - Filters: []types.Filter{ - { - Name: aws.String("tag:gitpod.io/network-check"), - Values: []string{"true"}, - }, - { - Name: aws.String("instance-state-name"), - Values: []string{"pending", "running", "shutting-down", "stopping", "stopped"}, - }, - }, - }) - if err != nil { - log.WithError(err).Error("Failed to list instances, please cleanup instances manually") - } else if len(instances.Reservations) == 0 { - log.Info("No instances found.") - } - - if instances != nil { - for _, r := range instances.Reservations { - for _, i := range r.Instances { - InstanceIds = append(InstanceIds, *i.InstanceId) - } - } - } - } - - if len(InstanceIds) > 0 { - log.Info("ℹ️ Terminating EC2 instances") - _, err := svc.TerminateInstances(ctx, &ec2.TerminateInstancesInput{ - InstanceIds: InstanceIds, - }) - if err != nil { - log.WithError(err).WithField("instanceIds", InstanceIds).Warnf("Failed to cleanup instances, please cleanup manually") - } - - terminateWaiter := ec2.NewInstanceTerminatedWaiter(svc, func(itwo *ec2.InstanceTerminatedWaiterOptions) { - itwo.MaxDelay = 15 * time.Second - itwo.MinDelay = 5 * time.Second - }) - log.Info("ℹ️ Waiting for EC2 instances to Terminate (times out in 5 minutes)") - err = terminateWaiter.Wait(ctx, &ec2.DescribeInstancesInput{InstanceIds: InstanceIds}, *aws.Duration(5 * time.Minute)) - if err != nil { - log.WithError(err).Warn("Failed to wait for instances to terminate") - log.Warn("ℹ️ Waiting 2 minutes so network interfaces are deleted") - time.Sleep(2 * time.Minute) - } else { - log.Info("✅ Instances terminated") - } - } - - if len(Roles) == 0 { - paginator := iam.NewListInstanceProfilesPaginator(iamsvc, &iam.ListInstanceProfilesInput{}) - for paginator.HasMorePages() { - output, err := paginator.NextPage(ctx) - if err != nil { - log.WithError(err).Warn("Failed to list roles, please cleanup manually") - break - } - - for _, ip := range output.InstanceProfiles { - if *ip.InstanceProfileName == gitpodInstanceProfile { - { - InstanceProfile = *ip.InstanceProfileName - if len(ip.Roles) > 0 { - for _, role := range ip.Roles { - Roles = append(Roles, *role.RoleName) - } - } - } - } - } - } - if len(Roles) == 0 { - log.Info("No roles found.") - } - } - - if len(Roles) > 0 { - for _, role := range Roles { - _, err := iamsvc.DetachRolePolicy(ctx, &iam.DetachRolePolicyInput{PolicyArn: aws.String("arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"), RoleName: aws.String(role)}) - if err != nil { - log.WithError(err).WithField("rolename", role).Warnf("Failed to cleanup role, please cleanup manually") - } - - _, err = iamsvc.RemoveRoleFromInstanceProfile(ctx, &iam.RemoveRoleFromInstanceProfileInput{ - RoleName: aws.String(role), - InstanceProfileName: aws.String(InstanceProfile), - }) - if err != nil { - log.WithError(err).WithField("roleName", role).WithField("profileName", InstanceProfile).Warnf("Failed to remove role from instance profile") - } - - _, err = iamsvc.DeleteRole(ctx, &iam.DeleteRoleInput{RoleName: aws.String(role)}) - if err != nil { - log.WithError(err).WithField("rolename", role).Warnf("Failed to cleanup role, please cleanup manaullay") - continue - } - - log.Infof("✅ Role '%v' deleted", role) - } - - _, err := iamsvc.DeleteInstanceProfile(ctx, &iam.DeleteInstanceProfileInput{ - InstanceProfileName: aws.String(InstanceProfile), - }) - - if err != nil { - log.WithError(err).WithField("instanceProfile", InstanceProfile).Warnf("Failed to clean up instance profile, please cleanup manually") - } - - log.Info("✅ Instance profile deleted") - } - - if len(SecurityGroups) == 0 { - securityGroups, err := svc.DescribeSecurityGroups(ctx, &ec2.DescribeSecurityGroupsInput{ - Filters: []types.Filter{ - { - Name: aws.String("tag:gitpod.io/network-check"), - Values: []string{"true"}, - }, - }, - }) - - if err != nil { - log.WithError(err).Error("Failed to list security groups, please cleanup manually") - } else if len(securityGroups.SecurityGroups) == 0 { - log.Info("No security groups found.") - } - - if securityGroups != nil { - for _, sg := range securityGroups.SecurityGroups { - SecurityGroups = append(SecurityGroups, *sg.GroupId) - } - } - } - - if len(SecurityGroups) > 0 { - for _, sg := range SecurityGroups { - deleteSGInput := &ec2.DeleteSecurityGroupInput{ - GroupId: aws.String(sg), - } - - _, err := svc.DeleteSecurityGroup(ctx, deleteSGInput) - if err != nil { - log.WithError(err).WithField("securityGroup", sg).Warnf("Failed to clean up security group, please cleanup manually") - continue - } - log.Infof("✅ Security group '%v' deleted", sg) - } - } -} diff --git a/gitpod-network-check/cmd/lambda_handler.go b/gitpod-network-check/cmd/lambda_handler.go new file mode 100644 index 0000000..6110fa9 --- /dev/null +++ b/gitpod-network-check/cmd/lambda_handler.go @@ -0,0 +1,45 @@ +package cmd + +import ( + "fmt" + "os" + + "github.com/aws/aws-lambda-go/lambda" + log "github.com/sirupsen/logrus" + "github.com/spf13/cobra" + + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/runner" +) + +// lambdaHandlerCmd is the Cobra command invoked when the binary is run with the "lambda-handler" argument. +// This happens inside the AWS Lambda environment via the bootstrap script. +var lambdaHandlerCmd = &cobra.Command{ + Use: "lambda-handler", + Short: "Internal command used by AWS Lambda runtime to execute network checks", + Hidden: true, // Hide this command from user help output + PersistentPreRun: func(cmd *cobra.Command, args []string) { + // override parent, as we don't care about the config or other flags when run by lambda + // Ensure logs go to stderr (Lambda standard) + log.SetOutput(os.Stderr) + // Optionally set log level from env var if needed, e.g., os.Getenv("LOG_LEVEL") + // Consider setting a default level appropriate for Lambda execution. + log.SetLevel(log.InfoLevel) // Example: Set a default level + }, + RunE: func(cmd *cobra.Command, args []string) error { + // The aws-lambda-go library takes over execution when lambda.Start is called. + // It handles reading events, invoking the handler, and writing responses. + log.Info("Lambda Handler: Starting AWS Lambda handler loop.") + lambda.Start(runner.HandleLambdaEvent) + // lambda.Start blocks and never returns unless there's a critical error during initialization + log.Error("Lambda Handler: lambda.Start returned unexpectedly (should not happen)") + return fmt.Errorf("lambda.Start returned unexpectedly") + }, + // Disable flag parsing for this internal command as input comes from Lambda event payload + DisableFlagParsing: true, +} + +func init() { + // Register the hidden lambda handler command + // It's invoked by the Lambda runtime via the bootstrap script + NetworkCheckCmd.AddCommand(lambdaHandlerCmd) +} diff --git a/gitpod-network-check/cmd/root.go b/gitpod-network-check/cmd/root.go index f433da6..ecebb0c 100644 --- a/gitpod-network-check/cmd/root.go +++ b/gitpod-network-check/cmd/root.go @@ -11,41 +11,80 @@ import ( "github.com/spf13/pflag" "github.com/spf13/viper" + + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/checks" + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/runner" ) -type NetworkConfig struct { - LogLevel string - CfgFile string - AwsRegion string - Destroy bool - Cleanup bool - - MainSubnets []string - PodSubnets []string - HttpsHosts []string - InstanceAMI string - ApiEndpoint string -} +// NetworkConfig holds the application configuration, populated from flags/config file +var NetworkConfig = checks.NetworkConfig{LogLevel: "INFO"} + +// Flags holds parsed flag values +var Flags = struct { + // Variable to store the testsets flag value + SelectedTestsets []string -var networkConfig = NetworkConfig{LogLevel: "INFO"} + // Variable to store the runner flag value + RunnerTypeStr string -var networkCheckCmd = &cobra.Command{ // nolint:gochecknoglobals - PersistentPreRunE: configLogger, + RunnerType runner.RunnerType +}{} + +// NetworkCheckCmd is the root command for the application +var NetworkCheckCmd = &cobra.Command{ // nolint:gochecknoglobals + PersistentPreRunE: preRunE, Use: "gitpod-network-check", Short: "CLI to check if your network is setup correctly to deploy Gitpod", SilenceUsage: false, } -func configLogger(cmd *cobra.Command, args []string) error { - lvl, err := log.ParseLevel(networkConfig.LogLevel) +func preRunE(cmd *cobra.Command, args []string) error { + // setup logger + lvl, err := log.ParseLevel(NetworkConfig.LogLevel) if err != nil { - log.WithField("log-level", networkConfig.CfgFile).Fatal("incorrect log level") - - return fmt.Errorf("incorrect log level") + return fmt.Errorf("❌ incorrect log level: %v", err) } log.SetLevel(lvl) - log.WithField("log-level", networkConfig.CfgFile).Debug("log level configured") + log.WithField("log-level", NetworkConfig.CfgFile).Debug("log level configured") + + // Log the effective configuration after setup and binding (Moved from init) + log.Infof("ℹ️ Running with region `%s`, main subnet `%v`, pod subnet `%v`, hosts `%v`, ami `%v`, and API endpoint `%v`", NetworkConfig.AwsRegion, NetworkConfig.MainSubnets, NetworkConfig.PodSubnets, NetworkConfig.HttpsHosts, NetworkConfig.InstanceAMI, NetworkConfig.ApiEndpoint) + + // validate the config + err = validateSubnetsConfig(cmd, args) + if err != nil { + return fmt.Errorf("❌ incorrect subnets: %v", err) + } + + err = validateRunnerFlag(cmd, args) + if err != nil { + return fmt.Errorf("❌ incorrect runner: %v", err) // Update error message context + } + + return nil +} + +func validateSubnetsConfig(cmd *cobra.Command, args []string) error { + if len(NetworkConfig.MainSubnets) < 1 { + return fmt.Errorf("At least one Main subnet needs to be specified: %v", NetworkConfig.MainSubnets) + } + log.Info("✅ Main Subnets are valid") + if len(NetworkConfig.PodSubnets) < 1 { + return fmt.Errorf("At least one Pod subnet needs to be specified: %v", NetworkConfig.PodSubnets) + } + log.Info("✅ Pod Subnets are valid") + + return nil +} + +func validateRunnerFlag(cmd *cobra.Command, args []string) error { + // Validate runnerType + runnerType, err := runner.ValidateRunnerType(Flags.RunnerTypeStr) + if err != nil { + return err + } + Flags.RunnerType = runnerType return nil } @@ -80,27 +119,34 @@ func bindFlags(cmd *cobra.Command, v *viper.Viper) { func init() { v := readConfigFile() - networkCheckCmd.PersistentFlags().StringVar(&networkConfig.CfgFile, "log-level", + NetworkCheckCmd.PersistentFlags().StringVar(&NetworkConfig.LogLevel, "log-level", "info", "set log level verbosity (options: debug, info, error, warning)") - networkCheckCmd.PersistentFlags().StringVar(&networkConfig.CfgFile, "config", "", "config file "+ + NetworkCheckCmd.PersistentFlags().StringVar(&NetworkConfig.CfgFile, "config", "", "config file "+ "(default is ./gitpod-network-check.yaml)") - networkCheckCmd.PersistentFlags().StringVar(&networkConfig.AwsRegion, "region", "eu-central-1", "AWS Region to create the cell in") - networkCheckCmd.PersistentFlags().StringSliceVar(&networkConfig.MainSubnets, "main-subnets", []string{}, "List of main subnets") - networkCheckCmd.PersistentFlags().StringSliceVar(&networkConfig.PodSubnets, "pod-subnets", []string{}, "List of pod subnets") - networkCheckCmd.PersistentFlags().StringSliceVar(&networkConfig.HttpsHosts, "https-hosts", []string{}, "Hosts to test for outbound HTTPS connectivity") - networkCheckCmd.PersistentFlags().StringVar(&networkConfig.InstanceAMI, "instance-ami", "", "Custom ec2 instance AMI id, if not set will use latest ubuntu") - networkCheckCmd.PersistentFlags().StringVar(&networkConfig.ApiEndpoint, "api-endpoint", "", "The Gitpod Enterprise control plane's regional API endpoint subdomain") - bindFlags(networkCheckCmd, v) - log.Infof("ℹ️ Running with region `%s`, main subnet `%v`, pod subnet `%v`, hosts `%v`, ami `%v`, and API endpoint `%v`", networkConfig.AwsRegion, networkConfig.MainSubnets, networkConfig.PodSubnets, networkConfig.HttpsHosts, networkConfig.InstanceAMI, networkConfig.ApiEndpoint) + NetworkCheckCmd.PersistentFlags().StringVar(&NetworkConfig.AwsRegion, "region", "eu-central-1", "AWS Region to create the cell in") + NetworkCheckCmd.PersistentFlags().StringSliceVar(&NetworkConfig.MainSubnets, "main-subnets", []string{}, "List of main subnets") + NetworkCheckCmd.PersistentFlags().StringSliceVar(&NetworkConfig.PodSubnets, "pod-subnets", []string{}, "List of pod subnets") + NetworkCheckCmd.PersistentFlags().StringSliceVar(&NetworkConfig.HttpsHosts, "https-hosts", []string{}, "Hosts to test for outbound HTTPS connectivity") + NetworkCheckCmd.PersistentFlags().StringVar(&NetworkConfig.InstanceAMI, "instance-ami", "", "Custom ec2 instance AMI id, if not set will use latest ubuntu") + NetworkCheckCmd.PersistentFlags().StringVar(&NetworkConfig.ApiEndpoint, "api-endpoint", "", "The Gitpod Enterprise control plane's regional API endpoint subdomain") + testsetOptions := []string{string(checks.TestsetNameAwsServicesApp), string(checks.TestSetNameAwsServicesSubstrate), string(checks.TestSetNameHttpsHosts)} + NetworkCheckCmd.PersistentFlags().StringSliceVar(&Flags.SelectedTestsets, "testsets", testsetOptions, fmt.Sprintf("List of testsets to run (options: %v)", testsetOptions)) + // Rename flag, variable, and update help text + NetworkCheckCmd.PersistentFlags().StringVar(&Flags.RunnerTypeStr, "runner", string(runner.RunnerTypeEC2), fmt.Sprintf("Specify the runner for executing tests (default: %s, options: %s, %s, %s)", runner.RunnerTypeEC2, runner.RunnerTypeEC2, runner.RunnerTypeLambda, runner.RunnerTypeLocal)) + // Lambda-specific flags + NetworkCheckCmd.PersistentFlags().StringVar(&NetworkConfig.LambdaRoleArn, "lambda-role-arn", "", "ARN of an existing IAM role to use for Lambda execution (overrides automatic creation/deletion)") + NetworkCheckCmd.PersistentFlags().StringVar(&NetworkConfig.LambdaSecurityGroupID, "lambda-sg-id", "", "ID of an existing Security Group to use for Lambda execution (overrides automatic creation/deletion)") + + bindFlags(NetworkCheckCmd, v) } func readConfigFile() *viper.Viper { v := viper.New() - if networkConfig.CfgFile != "" { + if NetworkConfig.CfgFile != "" { // Use config file from the flag. - v.SetConfigFile(networkConfig.CfgFile) + v.SetConfigFile(NetworkConfig.CfgFile) } else { // Find current directory. currentDir := path.Dir("") @@ -131,8 +177,7 @@ func readConfigFile() *viper.Viper { return v } +// Execute runs the root command func Execute() error { - networkCheckCmd.AddCommand(checkCommand) - networkCheckCmd.AddCommand(cleanCommand) - return networkCheckCmd.Execute() + return NetworkCheckCmd.Execute() } diff --git a/gitpod-network-check/gitpod-network-check.yaml b/gitpod-network-check/gitpod-network-check.yaml index 1dd76b4..3099cd9 100644 --- a/gitpod-network-check/gitpod-network-check.yaml +++ b/gitpod-network-check/gitpod-network-check.yaml @@ -1,9 +1,9 @@ log-level: debug # Options: debug, info, warning, error region: eu-central-1 -main-subnets: subnet-0ed211f14362b224f, subnet-041703e62a05d2024 -pod-subnets: subnet-075c44edead3b062f, subnet-06eb311c6b92e0f29 +main-subnets: subnet-017c6a80f4879d851, subnet-0215744d52cd1c01f +pod-subnets: subnet-00a118009d1d572a5, subnet-062288af00ba50d86 https-hosts: accounts.google.com, https://github.com # put your custom ami id here if you want to use it, otherwise it will using latest ubuntu AMI from aws -instance-ami: +#instance-ami: # optional, put your API endpoint regional sub-domain here to test connectivity, like when the execute-api vpc endpoint is not in the same account as Gitpod -api-endpoint: \ No newline at end of file +#api-endpoint: \ No newline at end of file diff --git a/gitpod-network-check/go.mod b/gitpod-network-check/go.mod index 58b2e51..d154b56 100644 --- a/gitpod-network-check/go.mod +++ b/gitpod-network-check/go.mod @@ -1,37 +1,40 @@ -module "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check" +module github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check -go 1.22.0 - -toolchain go1.22.1 +go 1.23.1 require ( - github.com/aws/aws-sdk-go-v2 v1.25.2 + github.com/aws/aws-lambda-go v1.47.0 + github.com/aws/aws-sdk-go-v2 v1.36.3 github.com/aws/aws-sdk-go-v2/config v1.27.6 + github.com/aws/aws-sdk-go-v2/service/cloudwatchlogs v1.47.1 github.com/aws/aws-sdk-go-v2/service/ec2 v1.149.4 github.com/aws/aws-sdk-go-v2/service/iam v1.31.1 + github.com/aws/aws-sdk-go-v2/service/lambda v1.71.0 github.com/aws/aws-sdk-go-v2/service/ssm v1.49.1 + github.com/aws/smithy-go v1.22.2 + github.com/google/go-cmp v0.6.0 github.com/sirupsen/logrus v1.9.3 github.com/spf13/cobra v1.8.0 github.com/spf13/pflag v1.0.5 github.com/spf13/viper v1.18.2 golang.org/x/sync v0.5.0 + k8s.io/apimachinery v0.30.0 ) require ( + github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.6.10 // indirect github.com/aws/aws-sdk-go-v2/credentials v1.17.6 // indirect github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.15.2 // indirect - github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.2 // indirect - github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.2 // indirect + github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.34 // indirect + github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.34 // indirect github.com/aws/aws-sdk-go-v2/internal/ini v1.8.0 // indirect github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.11.1 // indirect github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.11.4 // indirect github.com/aws/aws-sdk-go-v2/service/sso v1.20.1 // indirect github.com/aws/aws-sdk-go-v2/service/ssooidc v1.23.1 // indirect github.com/aws/aws-sdk-go-v2/service/sts v1.28.3 // indirect - github.com/aws/smithy-go v1.20.1 // indirect github.com/fsnotify/fsnotify v1.7.0 // indirect github.com/go-logr/logr v1.4.1 // indirect - github.com/google/go-cmp v0.6.0 // indirect github.com/hashicorp/hcl v1.0.0 // indirect github.com/inconshreveable/mousetrap v1.1.0 // indirect github.com/jmespath/go-jmespath v0.4.0 // indirect @@ -52,9 +55,7 @@ require ( golang.org/x/text v0.14.0 // indirect gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c // indirect gopkg.in/ini.v1 v1.67.0 // indirect - gopkg.in/yaml.v2 v2.4.0 // indirect gopkg.in/yaml.v3 v3.0.1 // indirect - k8s.io/apimachinery v0.30.0 // indirect k8s.io/klog/v2 v2.120.1 // indirect k8s.io/utils v0.0.0-20230726121419-3b25d923346b // indirect ) diff --git a/gitpod-network-check/go.sum b/gitpod-network-check/go.sum index bebdaa7..f35bdb2 100644 --- a/gitpod-network-check/go.sum +++ b/gitpod-network-check/go.sum @@ -1,17 +1,23 @@ -github.com/aws/aws-sdk-go-v2 v1.25.2 h1:/uiG1avJRgLGiQM9X3qJM8+Qa6KRGK5rRPuXE0HUM+w= -github.com/aws/aws-sdk-go-v2 v1.25.2/go.mod h1:Evoc5AsmtveRt1komDwIsjHFyrP5tDuF1D1U+6z6pNo= +github.com/aws/aws-lambda-go v1.47.0 h1:0H8s0vumYx/YKs4sE7YM0ktwL2eWse+kfopsRI1sXVI= +github.com/aws/aws-lambda-go v1.47.0/go.mod h1:dpMpZgvWx5vuQJfBt0zqBha60q7Dd7RfgJv23DymV8A= +github.com/aws/aws-sdk-go-v2 v1.36.3 h1:mJoei2CxPutQVxaATCzDUjcZEjVRdpsiiXi2o38yqWM= +github.com/aws/aws-sdk-go-v2 v1.36.3/go.mod h1:LLXuLpgzEbD766Z5ECcRmi8AzSwfZItDtmABVkRLGzg= +github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.6.10 h1:zAybnyUQXIZ5mok5Jqwlf58/TFE7uvd3IAsa1aF9cXs= +github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.6.10/go.mod h1:qqvMj6gHLR/EXWZw4ZbqlPbQUyenf4h82UQUlKc+l14= github.com/aws/aws-sdk-go-v2/config v1.27.6 h1:WmoH1aPrxwcqAZTTnETjKr+fuvqzKd4hRrKxQUiuKP4= github.com/aws/aws-sdk-go-v2/config v1.27.6/go.mod h1:W9RZFF2pL+OhnUSZsQS/eDMWD8v+R+yWgjj3nSlrXVU= github.com/aws/aws-sdk-go-v2/credentials v1.17.6 h1:akhj/nSC6SEx3OmiYGG/7mAyXMem9ZNVVf+DXkikcTk= github.com/aws/aws-sdk-go-v2/credentials v1.17.6/go.mod h1:chJZuJ7TkW4kiMwmldOJOEueBoSkUb4ynZS1d9dhygo= github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.15.2 h1:AK0J8iYBFeUk2Ax7O8YpLtFsfhdOByh2QIkHmigpRYk= github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.15.2/go.mod h1:iRlGzMix0SExQEviAyptRWRGdYNo3+ufW/lCzvKVTUc= -github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.2 h1:bNo4LagzUKbjdxE0tIcR9pMzLR2U/Tgie1Hq1HQ3iH8= -github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.2/go.mod h1:wRQv0nN6v9wDXuWThpovGQjqF1HFdcgWjporw14lS8k= -github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.2 h1:EtOU5jsPdIQNP+6Q2C5e3d65NKT1PeCiQk+9OdzO12Q= -github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.2/go.mod h1:tyF5sKccmDz0Bv4NrstEr+/9YkSPJHrcO7UsUKf7pWM= +github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.34 h1:ZK5jHhnrioRkUNOc+hOgQKlUL5JeC3S6JgLxtQ+Rm0Q= +github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.34/go.mod h1:p4VfIceZokChbA9FzMbRGz5OV+lekcVtHlPKEO0gSZY= +github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.34 h1:SZwFm17ZUNNg5Np0ioo/gq8Mn6u9w19Mri8DnJ15Jf0= +github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.34/go.mod h1:dFZsC0BLo346mvKQLWmoJxT+Sjp+qcVR1tRVHQGOH9Q= github.com/aws/aws-sdk-go-v2/internal/ini v1.8.0 h1:hT8rVHwugYE2lEfdFE0QWVo81lF7jMrYJVDWI+f+VxU= github.com/aws/aws-sdk-go-v2/internal/ini v1.8.0/go.mod h1:8tu/lYfQfFe6IGnaOdrpVgEL2IrrDOf6/m9RQum4NkY= +github.com/aws/aws-sdk-go-v2/service/cloudwatchlogs v1.47.1 h1:IKznEkCo7L8VHkQ3tC1e50F1eudenoQ7BTHJhMOswtE= +github.com/aws/aws-sdk-go-v2/service/cloudwatchlogs v1.47.1/go.mod h1:uo14VBn5cNk/BPGTPz3kyLBxgpgOObgO8lmz+H7Z4Ck= github.com/aws/aws-sdk-go-v2/service/ec2 v1.149.4 h1:CGKvG+I3Qxly4jkKm3OCEeVNqcH6xUJ0NCGoqjytoGY= github.com/aws/aws-sdk-go-v2/service/ec2 v1.149.4/go.mod h1:XvYGmTpdybgh+aNRfm+XbnaJdjWXxPXvRPlp7YpTs1A= github.com/aws/aws-sdk-go-v2/service/iam v1.31.1 h1:3l4/wmvUjTbGfk/YJBkKub4cVbDdvJ9YMOQmopXc2T8= @@ -20,6 +26,8 @@ github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.11.1 h1:EyBZibR github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.11.1/go.mod h1:JKpmtYhhPs7D97NL/ltqz7yCkERFW5dOlHyVl66ZYF8= github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.11.4 h1:jRiWxyuVO8PlkN72wDMVn/haVH4SDCBkUt0Lf/dxd7s= github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.11.4/go.mod h1:Ru7vg1iQ7cR4i7SZ/JTLYN9kaXtbL69UdgG0OQWQxW0= +github.com/aws/aws-sdk-go-v2/service/lambda v1.71.0 h1:8PjrcaqDZKar6ivI8c6vwNADOURebrRZQms3SxggRgU= +github.com/aws/aws-sdk-go-v2/service/lambda v1.71.0/go.mod h1:c27kk10S36lBYgbG1jR3opn4OAS5Y/4wjJa1GiHK/X4= github.com/aws/aws-sdk-go-v2/service/ssm v1.49.1 h1:MeYuN4Ld4FWVJb9ZiOJkon7/foj0Zm2GTDorSaInHj4= github.com/aws/aws-sdk-go-v2/service/ssm v1.49.1/go.mod h1:TM0pqkfTRMVtsMlPnOivUmrZSIANsLbq9FTm4oJPcPQ= github.com/aws/aws-sdk-go-v2/service/sso v1.20.1 h1:utEGkfdQ4L6YW/ietH7111ZYglLJvS+sLriHJ1NBJEQ= @@ -28,8 +36,8 @@ github.com/aws/aws-sdk-go-v2/service/ssooidc v1.23.1 h1:9/GylMS45hGGFCcMrUZDVayQ github.com/aws/aws-sdk-go-v2/service/ssooidc v1.23.1/go.mod h1:YjAPFn4kGFqKC54VsHs5fn5B6d+PCY2tziEa3U/GB5Y= github.com/aws/aws-sdk-go-v2/service/sts v1.28.3 h1:TkiFkSVX990ryWIMBCT4kPqZEgThQe1xPU/AQXavtvU= github.com/aws/aws-sdk-go-v2/service/sts v1.28.3/go.mod h1:xYNauIUqSuvzlPVb3VB5no/n48YGhmlInD3Uh0Co8Zc= -github.com/aws/smithy-go v1.20.1 h1:4SZlSlMr36UEqC7XOyRVb27XMeZubNcBNN+9IgEPIQw= -github.com/aws/smithy-go v1.20.1/go.mod h1:krry+ya/rV9RDcV/Q16kpu6ypI4K2czasz0NC3qS14E= +github.com/aws/smithy-go v1.22.2 h1:6D9hW43xKFrRx/tXXfAlIZc4JI+yQe6snnWcQyxSyLQ= +github.com/aws/smithy-go v1.22.2/go.mod h1:irrKGvNn1InZwb2d7fkIRNucdfwR8R+Ts3wxYa/cJHg= github.com/cpuguy83/go-md2man/v2 v2.0.3/go.mod h1:tgQtvFlXSQOSOSIRvRPT7W67SCa46tRHOmNcaadrF8o= github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= @@ -108,8 +116,6 @@ golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa/go.mod h1:zk2irFbV9DP96SEBUU golang.org/x/sync v0.5.0 h1:60k92dhOjHxJkrqnwsfl8KuaHbn/5dl0lUPUklKo3qE= golang.org/x/sync v0.5.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk= golang.org/x/sys v0.0.0-20220715151400-c0bba94af5f8/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.15.0 h1:h48lPFYpsTvQJZF4EKyI4aLHaev3CxivZmv7yZig9pc= -golang.org/x/sys v0.15.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA= golang.org/x/sys v0.18.0 h1:DBdB3niSjOA/O0blCZBqDefyWNYveAYMNF1Wum0DYQ4= golang.org/x/sys v0.18.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA= golang.org/x/text v0.14.0 h1:ScX5w1eTa3QqT8oi6+ziP7dTV1S2+ALU0bI+0zXKWiQ= diff --git a/gitpod-network-check/pkg/checks/types.go b/gitpod-network-check/pkg/checks/types.go new file mode 100644 index 0000000..9c19722 --- /dev/null +++ b/gitpod-network-check/pkg/checks/types.go @@ -0,0 +1,138 @@ +package checks + +import ( + "fmt" + "net/url" + "strings" + + log "github.com/sirupsen/logrus" +) + +type NetworkConfig struct { + LogLevel string + CfgFile string + AwsRegion string + Destroy bool + Cleanup bool + + MainSubnets []string + PodSubnets []string + HttpsHosts []string + InstanceAMI string + ApiEndpoint string + + // Lambda-specific configuration + LambdaRoleArn string + LambdaSecurityGroupID string +} + +func (nc *NetworkConfig) GetAllSubnets() []Subnet { + var subnets []Subnet + for _, subnet := range nc.MainSubnets { + subnets = append(subnets, Subnet{SubnetID: subnet, Type: SubnetTypeMain}) + } + for _, subnet := range nc.PodSubnets { + subnets = append(subnets, Subnet{SubnetID: subnet, Type: SubnetTypePod}) + } + return subnets +} + +type TestsetName string + +const ( + TestsetNameAwsServicesApp TestsetName = "aws-services-app" + TestSetNameAwsServicesSubstrate TestsetName = "aws-services-substrate" + TestSetNameHttpsHosts TestsetName = "https-hosts" +) + +type SubnetType string + +const ( + SubnetTypeMain SubnetType = "main" + SubnetTypePod SubnetType = "pod" +) + +type Subnet struct { + SubnetID string + Type SubnetType +} + +func (s Subnet) String() string { + return fmt.Sprintf("%s (%s)", s.SubnetID, s.Type) +} + +func SubnetsFromIDs(subnets []string, typ SubnetType) []Subnet { + var result []Subnet + for _, subnetID := range subnets { + result = append(result, Subnet{SubnetID: subnetID, Type: typ}) + } + return result +} + +type Subnets []Subnet + +func (sns Subnets) String() string { + var result []string + for _, subnet := range sns { + result = append(result, subnet.String()) + } + return strings.Join(result, ", ") +} + +// TODO(gpl) We should re-consider the assignment of the subnet type to a test-set. For BYON, it's actually only all from main only. +type TestSet func(networkConfig *NetworkConfig) (endpoints map[string]string, subnetType SubnetType) + +var TestSets = map[TestsetName]TestSet{ + TestsetNameAwsServicesApp: func(networkConfig *NetworkConfig) (map[string]string, SubnetType) { + return map[string]string{ + "SSM": fmt.Sprintf("https://ssm.%s.amazonaws.com", networkConfig.AwsRegion), + "SSMmessages": fmt.Sprintf("https://ssmmessages.%s.amazonaws.com", networkConfig.AwsRegion), + "Autoscaling": fmt.Sprintf("https://autoscaling.%s.amazonaws.com", networkConfig.AwsRegion), + "CloudFormation": fmt.Sprintf("https://cloudformation.%s.amazonaws.com", networkConfig.AwsRegion), + "EC2": fmt.Sprintf("https://ec2.%s.amazonaws.com", networkConfig.AwsRegion), + "EC2messages": fmt.Sprintf("https://ec2messages.%s.amazonaws.com", networkConfig.AwsRegion), + "EKS": fmt.Sprintf("https://eks.%s.amazonaws.com", networkConfig.AwsRegion), + "Elastic LoadBalancing": fmt.Sprintf("https://elasticloadbalancing.%s.amazonaws.com", networkConfig.AwsRegion), + "Kinesis Firehose": fmt.Sprintf("https://firehose.%s.amazonaws.com", networkConfig.AwsRegion), + "KMS": fmt.Sprintf("https://kms.%s.amazonaws.com", networkConfig.AwsRegion), + "CloudWatch": fmt.Sprintf("https://logs.%s.amazonaws.com", networkConfig.AwsRegion), + "SecretsManager": fmt.Sprintf("https://secretsmanager.%s.amazonaws.com", networkConfig.AwsRegion), + "Sts": fmt.Sprintf("https://sts.%s.amazonaws.com", networkConfig.AwsRegion), + "ECR Api": fmt.Sprintf("https://api.ecr.%s.amazonaws.com", networkConfig.AwsRegion), + "ECR": fmt.Sprintf("https://869456089606.dkr.ecr.%s.amazonaws.com", networkConfig.AwsRegion), + }, SubnetTypeMain + }, + TestSetNameAwsServicesSubstrate: func(networkConfig *NetworkConfig) (map[string]string, SubnetType) { + endpoints := map[string]string{ + "S3": fmt.Sprintf("https://s3.%s.amazonaws.com", networkConfig.AwsRegion), + "DynamoDB": fmt.Sprintf("https://dynamodb.%s.amazonaws.com", networkConfig.AwsRegion), + } + if networkConfig.ApiEndpoint != "" { + endpoints["ExecuteAPI"] = fmt.Sprintf("https://%s.execute-api.%s.amazonaws.com", networkConfig.ApiEndpoint, networkConfig.AwsRegion) + } else { + log.Warnf("🚧 No execute-api endpoint provided, skipping test") + } + return endpoints, SubnetTypeMain + }, + TestSetNameHttpsHosts: func(networkConfig *NetworkConfig) (map[string]string, SubnetType) { + endpoints := map[string]string{} + for _, v := range networkConfig.HttpsHosts { + host := strings.TrimSpace(v) + parsedUrl, err := url.Parse(host) + if err != nil { + log.Warnf("🚧 Invalid Host: %s, skipping due to error: %v", host, err) + continue + } + + if parsedUrl.Scheme == "" { + endpoints[host] = fmt.Sprintf("https://%s", host) + } else if parsedUrl.Scheme == "https" { + endpoints[host] = parsedUrl.Host + } else { + log.Warnf("🚧 Unsupported scheme: %s, skipping test for %s", parsedUrl.Scheme, host) + continue + } + } + return endpoints, SubnetTypeMain + }, +} diff --git a/gitpod-network-check/pkg/lambda_types/types.go b/gitpod-network-check/pkg/lambda_types/types.go new file mode 100644 index 0000000..7f730ef --- /dev/null +++ b/gitpod-network-check/pkg/lambda_types/types.go @@ -0,0 +1,17 @@ +package lambda_types + +// CheckRequest defines the input structure for the Lambda function. +type CheckRequest struct { + Endpoints map[string]string `json:"endpoints"` // Map of service name -> URL +} + +// CheckResult defines the result for a single endpoint check. +type CheckResult struct { + Success bool `json:"success"` + Error string `json:"error,omitempty"` +} + +// CheckResponse defines the output structure for the Lambda function. +type CheckResponse struct { + Results map[string]CheckResult `json:"results"` // Map of service name -> CheckResult +} diff --git a/gitpod-network-check/pkg/runner/common.go b/gitpod-network-check/pkg/runner/common.go new file mode 100644 index 0000000..9139de1 --- /dev/null +++ b/gitpod-network-check/pkg/runner/common.go @@ -0,0 +1,112 @@ +package runner + +import ( + "context" + "fmt" + "maps" + "slices" + + "github.com/aws/aws-sdk-go-v2/aws" + "github.com/aws/aws-sdk-go-v2/config" + ec2_types "github.com/aws/aws-sdk-go-v2/service/ec2/types" + iam_types "github.com/aws/aws-sdk-go-v2/service/iam/types" + + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/checks" +) + +type RunnerType string + +const ( + RunnerTypeEC2 RunnerType = "ec2" + RunnerTypeLambda RunnerType = "lambda" + RunnerTypeLocal RunnerType = "local" +) + +var validRunnerType = map[string]bool{ + string(RunnerTypeLambda): true, + string(RunnerTypeEC2): true, + string(RunnerTypeLocal): true, +} + +func ValidateRunnerType(runnerStr string) (RunnerType, error) { + if _, ok := validRunnerType[runnerStr]; ok { + return RunnerType(runnerStr), nil + } + return "", fmt.Errorf("invalid runner: %s, must be one of: %v", runnerStr, slices.Collect(maps.Keys(validRunnerType))) +} + +type TestRunner interface { + Prepare(ctx context.Context) error + TestService(ctx context.Context, subnets []checks.Subnet, serviceEndpoints map[string]string) (bool, error) + Cleanup(ctx context.Context) error +} + +func NewRunner(ctx context.Context, mode RunnerType, config *checks.NetworkConfig) (TestRunner, error) { + switch mode { + case RunnerTypeEC2: + return NewEC2TestRunner(context.Background(), config) + case RunnerTypeLocal: + return NewLocalTestRunner(), nil + case RunnerTypeLambda: + return NewLambdaTestRunner(ctx, config) + default: + // Update error message + return nil, fmt.Errorf("invalid runner: %s, must be one of: %v", mode, slices.Collect(maps.Keys(validRunnerType))) + } +} + +// Creates a new TestRunner instance, loading existing resources from the AWS account by known name/tags. +// This is useful for cleaning up left-over resources from previous runs. +func LoadRunnerFromTags(ctx context.Context, mode RunnerType, networkConfig *checks.NetworkConfig) (TestRunner, error) { + switch mode { + case RunnerTypeEC2: + return LoadEC2RunnerFromTags(ctx, networkConfig) + case RunnerTypeLambda: + return LoadLambdaRunnerFromTags(ctx, networkConfig) // Call the new function + case RunnerTypeLocal: + // Local mode does not require any AWS resources, so we can just return a new instance. + return NewLocalTestRunner(), nil + default: + // Update error message + return nil, fmt.Errorf("invalid runner: %s, must be one of: %v", mode, slices.Collect(maps.Keys(validRunnerType))) + } +} + +// AWS stuff +func initAwsConfig(ctx context.Context, region string) (aws.Config, error) { + return config.LoadDefaultConfig(ctx, config.WithRegion(region)) +} + +const ( + // NetworkCheckTagKey is the tag key used to identify network check resources + // in AWS. + NetworkCheckTagKey = "gitpod.io/network-check" + // NetworkCheckTagValue is the tag value used to identify network check resources + // in AWS. + NetworkCheckTagValue = "true" +) + +var NetworkCheckTags = map[string]string{ + NetworkCheckTagKey: NetworkCheckTagValue, +} + +var NetworkCheckIamTags = []iam_types.Tag{ + { + Key: aws.String(NetworkCheckTagKey), + Value: aws.String(NetworkCheckTagValue), + }, +} + +var NetworkCheckEC2Tags = []ec2_types.Tag{ + { + Key: aws.String(NetworkCheckTagKey), + Value: aws.String(NetworkCheckTagValue), + }, +} + +var NetworkCheckTagsFilter = []ec2_types.Filter{ + { + Name: aws.String(fmt.Sprintf("tag:%s", NetworkCheckTagKey)), + Values: []string{NetworkCheckTagValue}, + }, +} diff --git a/gitpod-network-check/pkg/runner/ec2-runner.go b/gitpod-network-check/pkg/runner/ec2-runner.go new file mode 100644 index 0000000..da3f441 --- /dev/null +++ b/gitpod-network-check/pkg/runner/ec2-runner.go @@ -0,0 +1,867 @@ +package runner + +import ( + "context" + "encoding/base64" + "errors" + "fmt" + "maps" + "net" + "slices" + "sort" + "strings" + "time" + + "github.com/aws/aws-sdk-go-v2/aws" + "github.com/aws/aws-sdk-go-v2/service/ec2" + "github.com/aws/aws-sdk-go-v2/service/ec2/types" + "github.com/aws/aws-sdk-go-v2/service/iam" + iam_types "github.com/aws/aws-sdk-go-v2/service/iam/types" + "github.com/aws/aws-sdk-go-v2/service/ssm" + "github.com/aws/smithy-go" + log "github.com/sirupsen/logrus" + "golang.org/x/sync/errgroup" + "k8s.io/apimachinery/pkg/util/wait" + + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/checks" +) + +const gitpodRoleName = "GitpodNetworkCheck" +const gitpodInstanceProfile = "GitpodNetworkCheck" + +type EC2TestRunner struct { + networkConfig *checks.NetworkConfig + + ec2Client *ec2.Client + ssmClient *ssm.Client + iamClient *iam.Client + + roles []string + securityGroups []string + instanceProfile *iam_types.InstanceProfile + instanceIds map[string]string +} + +func NewEC2TestRunner(ctx context.Context, networkConfig *checks.NetworkConfig) (*EC2TestRunner, error) { + cfg, err := initAwsConfig(ctx, networkConfig.AwsRegion) + if err != nil { + return nil, err + } + + ec2Client := ec2.NewFromConfig(cfg) + ssmClient := ssm.NewFromConfig(cfg) + iamClient := iam.NewFromConfig(cfg) + + return &EC2TestRunner{ + networkConfig: networkConfig, + + ec2Client: ec2Client, + ssmClient: ssmClient, + iamClient: iamClient, + + roles: []string{}, + securityGroups: []string{}, + instanceIds: make(map[string]string), + }, nil +} + +// create IAM role, attach policy, instance profile and attach role +func (r *EC2TestRunner) Prepare(ctx context.Context) error { + err := checkSMPrerequisites(ctx, r.networkConfig, r.ec2Client) + if err != nil { + return fmt.Errorf("failed to check prerequisites: %v", err) + } + + // Prepare EC2 instance creation + role, err := createIAMRoleAndAttachPolicy(ctx, r.iamClient) + if err != nil { + return fmt.Errorf("error creating IAM role and attaching policy: %v", err) + } + r.roles = append(r.roles, *role.RoleName) + log.Info("✅ IAM role created and policy attached") + + instanceProfile, err := createInstanceProfileAndAttachRole(ctx, r.iamClient, *role.RoleName) + if err != nil { + return fmt.Errorf("failed to create instance profile: %v", err) + } + r.instanceProfile = instanceProfile + + // Lazy initialization of the EC2 instances + subnets := r.networkConfig.GetAllSubnets() + for _, subnet := range subnets { + _, err := r.ensureEC2Instance(ctx, subnet) + if err != nil { + return err + } + } + log.Infof("✅ EC2 instances launched for subnets: %s", checks.Subnets(subnets).String()) + + return nil +} + +// the ssm-agent requires that ec2messages, ssm and ssmmessages are available +// we check the endpoints here so that if we cannot send commands to the ec2 instance +// in a private setup we know why +func checkSMPrerequisites(ctx context.Context, networkConfig *checks.NetworkConfig, ec2Client *ec2.Client) error { + type vpcEndpointsMap struct { + Endpoint string + PrivateDnsName string + PrivateDnsRequired bool + } + + log.Infof("ℹ️ Checking prerequisites") + vpcEndpoints := []vpcEndpointsMap{ + { + Endpoint: fmt.Sprintf("com.amazonaws.%s.ec2messages", networkConfig.AwsRegion), + PrivateDnsName: fmt.Sprintf("ec2messages.%s.amazonaws.com", networkConfig.AwsRegion), + PrivateDnsRequired: false, + }, + { + Endpoint: fmt.Sprintf("com.amazonaws.%s.ssm", networkConfig.AwsRegion), + PrivateDnsName: fmt.Sprintf("ssm.%s.amazonaws.com", networkConfig.AwsRegion), + PrivateDnsRequired: false, + }, + { + Endpoint: fmt.Sprintf("com.amazonaws.%s.ssmmessages", networkConfig.AwsRegion), + PrivateDnsName: fmt.Sprintf("ssmmessages.%s.amazonaws.com", networkConfig.AwsRegion), + PrivateDnsRequired: false, + }, + { + Endpoint: fmt.Sprintf("com.amazonaws.%s.execute-api", networkConfig.AwsRegion), + PrivateDnsName: fmt.Sprintf("execute-api.%s.amazonaws.com", networkConfig.AwsRegion), + PrivateDnsRequired: true, + }, + } + + var prereqErrs []string + for _, endpoint := range vpcEndpoints { + response, err := ec2Client.DescribeVpcEndpoints(ctx, &ec2.DescribeVpcEndpointsInput{ + Filters: []types.Filter{ + { + Name: aws.String("service-name"), + Values: []string{endpoint.Endpoint}, + }, + }, + }) + + if err != nil { + return err + } + + if len(response.VpcEndpoints) == 0 { + if strings.Contains(endpoint.Endpoint, "execute-api") && networkConfig.ApiEndpoint != "" { + log.Infof("ℹ️ 'api-endpoint' parameter exists, deferring connectivity test for execute-api VPC endpoint until testing main subnet connectivity") + continue + } else if strings.Contains(endpoint.Endpoint, "execute-api") && networkConfig.ApiEndpoint == "" { + errMsg := "Add a VPC endpoint for execute-api in this account or use the 'api-endpoint' parameter to specify a centralized one in another account, and test again" + log.Errorf("❌ %s", errMsg) + prereqErrs = append(prereqErrs, errMsg) + continue + } + _, err := TestServiceConnectivity(ctx, endpoint.PrivateDnsName, 5*time.Second) + if err != nil { + errMsg := fmt.Sprintf("Service %s connectivity test failed: %v\n", endpoint.PrivateDnsName, err) + log.Error("❌ %w", errMsg) + prereqErrs = append(prereqErrs, errMsg) + } + log.Infof("✅ Service %s has connectivity", endpoint.PrivateDnsName) + } else { + for _, e := range response.VpcEndpoints { + if e.PrivateDnsEnabled != nil && !*e.PrivateDnsEnabled && endpoint.PrivateDnsRequired { + errMsg := fmt.Sprintf("VPC endpoint '%s' has private DNS disabled, it must be enabled", *e.VpcEndpointId) + log.Errorf("❌ %s", errMsg) + prereqErrs = append(prereqErrs, errMsg) + } + } + log.Infof("✅ VPC endpoint %s is configured", endpoint.Endpoint) + } + } + + if len(prereqErrs) > 0 { + return fmt.Errorf("%s", strings.Join(prereqErrs, "; ")) + } + return nil +} + +func (r *EC2TestRunner) ensureEC2Instance(ctx context.Context, subnet checks.Subnet) (string, error) { + launchInstance := func(ctx context.Context, subnet checks.Subnet) (string, error) { + log.Infof("ℹ️ Launching EC2 instance in subnet: %s", subnet.String()) + secGroup, err := createSecurityGroups(ctx, r.ec2Client, subnet.SubnetID) + if err != nil { + return "", fmt.Errorf("failed to create security group for subnet '%v': %v", subnet, err) + } + r.securityGroups = append(r.securityGroups, secGroup) + + instanceType, err := getPreferredInstanceType(ctx, r.ec2Client, r.networkConfig) + if err != nil { + return "", fmt.Errorf("failed to get preferred instance type: %v", err) + } + log.Infof("ℹ️ Instance type %s shall be used", instanceType) + + instanceId, err := launchInstanceInSubnet(ctx, r.ec2Client, subnet.SubnetID, secGroup, r.instanceProfile.Arn, instanceType, r.networkConfig.InstanceAMI) + if err != nil { + return "", fmt.Errorf("Failed to launch instances in subnet %s: %v", subnet, err) + } + return instanceId, nil + } + + if existingInstanceId, exists := r.instanceIds[subnet.SubnetID]; exists { + log.Infof("ℹ️ Instance %s already exists in subnet %s, skipping launch", existingInstanceId, subnet.String()) + return existingInstanceId, nil + } + + instanceId, err := launchInstance(ctx, subnet) + if err != nil { + return "", fmt.Errorf("failed to launch instance in subnet %s: %v", subnet.String(), err) + } + r.instanceIds[subnet.SubnetID] = instanceId + log.Infof("ℹ️ Launched instance %s in subnet %s", instanceId, subnet.String()) + + return instanceId, nil +} + +func (r *EC2TestRunner) TestService(ctx context.Context, subnets []checks.Subnet, serviceEndpoints map[string]string) (bool, error) { + // Make sure we have one instance per subnet + instanceIds := []string{} + for _, subnet := range subnets { + instanceId, err := r.ensureEC2Instance(ctx, subnet) + if err != nil { + return false, err + } + instanceIds = append(instanceIds, instanceId) + } + + err := r.checkAllInstancesAvailable(ctx, instanceIds) + if err != nil { + return false, err + } + + // Actually test the service + testResult := r.checkServicesAvailability(ctx, instanceIds, serviceEndpoints) + return testResult, nil +} + +func (r *EC2TestRunner) checkAllInstancesAvailable(ctx context.Context, instanceIds []string) error { + // Wait until all instances are running + log.WithField("instanceIds", instanceIds).Info("ℹ️ Waiting for EC2 instances to become Running (times out in 5 minutes)") + runningWaiter := ec2.NewInstanceRunningWaiter(r.ec2Client, func(irwo *ec2.InstanceRunningWaiterOptions) { + irwo.MaxDelay = 15 * time.Second + irwo.MinDelay = 5 * time.Second + irwo.LogWaitAttempts = true + }) + err := runningWaiter.Wait(ctx, &ec2.DescribeInstancesInput{InstanceIds: instanceIds}, *aws.Duration(5 * time.Minute)) + if err != nil { + return fmt.Errorf("Nodes never got Running: %v", err) + } + + log.Info("✅ EC2 instances are now Running.") + log.Info("ℹ️ Waiting for EC2 instances to become Healthy (times out in 5 minutes)") + waitstatusOK := ec2.NewInstanceStatusOkWaiter(r.ec2Client, func(isow *ec2.InstanceStatusOkWaiterOptions) { + isow.MaxDelay = 15 * time.Second + isow.MinDelay = 5 * time.Second + }) + err = waitstatusOK.Wait(ctx, &ec2.DescribeInstanceStatusInput{InstanceIds: instanceIds}, *aws.Duration(5 * time.Minute)) + if err != nil { + return fmt.Errorf("Nodes never got Healthy: %v", err) + } + log.Info("✅ EC2 Instances are now healthy/Ok") + + log.Infof("ℹ️ Connecting to SSM...") + err = ensureSessionManagerIsUp(ctx, r.ssmClient, instanceIds) + if err != nil { + return fmt.Errorf("could not connect to SSM: %w", err) + } + log.Infof("✅ SSM is up and running") + + return nil +} + +func (r *EC2TestRunner) checkServicesAvailability(ctx context.Context, instanceIds []string, serviceEndpoints map[string]string) bool { + services := make([]string, 0, len(serviceEndpoints)) + for service := range serviceEndpoints { + services = append(services, service) + } + sort.Strings(services) + + result := true + for _, service := range services { + err := r.isServiceAvailable(ctx, instanceIds, serviceEndpoints[service]) + if err != nil { + log.Warnf("❌ %s is not available (%s)", service, serviceEndpoints[service]) + log.Info(err) + result = false + } else { + log.Infof("✅ %s is available", service) + } + } + return result +} + +func (r *EC2TestRunner) isServiceAvailable(ctx context.Context, instanceIds []string, serviceUrl string) error { + commandId, err := sendServiceRequest(ctx, r.ssmClient, instanceIds, serviceUrl) + if err != nil { + return fmt.Errorf("Failed to run the command in instances: %v", err) + } + + g, ctx := errgroup.WithContext(context.Background()) + for _, instanceId := range instanceIds { + id := instanceId // Local variable for the closure + g.Go(func() error { + return fetchResultsForInstance(ctx, r.ssmClient, id, commandId) + }) + } + if err := g.Wait(); err != nil { + return fmt.Errorf("Error fetching command results: %v", err) + } + + return nil +} + +func launchInstanceInSubnet(ctx context.Context, ec2Client *ec2.Client, subnetID, secGroupId string, instanceProfileName *string, instanceType types.InstanceType, instanceAMI string) (string, error) { + amiId := "" + if instanceAMI != "" { + customAMIId, err := findCustomAMI(ctx, ec2Client, instanceAMI) + if err != nil { + return "", err + } + amiId = customAMIId + } else { + regionalAMI, err := findUbuntuAMI(ctx, ec2Client) + if err != nil { + return "", err + } + amiId = regionalAMI + } + + // Specify the user data script to install the SSM Agent + userData := `#!/bin/bash + sudo systemctl enable snap.amazon-ssm-agent.amazon-ssm-agent.service + sudo systemctl restart snap.amazon-ssm-agent.amazon-ssm-agent.service + ` + + // Encode user data in base64 + userDataEncoded := base64.StdEncoding.EncodeToString([]byte(userData)) + + input := &ec2.RunInstancesInput{ + ImageId: aws.String(amiId), // Example AMI ID, replace with an actual one + InstanceType: instanceType, + MaxCount: aws.Int32(1), + MinCount: aws.Int32(1), + UserData: &userDataEncoded, + SecurityGroupIds: []string{secGroupId}, + SubnetId: aws.String(subnetID), + IamInstanceProfile: &types.IamInstanceProfileSpecification{ + Arn: instanceProfileName, + }, + TagSpecifications: []types.TagSpecification{ + { + ResourceType: types.ResourceTypeInstance, + Tags: NetworkCheckEC2Tags, + }, + }, + } + + var result *ec2.RunInstancesOutput + err := wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 10*time.Second, false, func(ctx context.Context) (done bool, err error) { + result, err = ec2Client.RunInstances(ctx, input) + + if err != nil { + if strings.Contains(err.Error(), "Invalid IAM Instance Profile ARN") { + return false, nil + } + + return false, err + } + + return true, nil + }) + + if err != nil { + return "", err + } + + if len(result.Instances) == 0 { + return "", fmt.Errorf("instances didn't get created") + } + + return aws.ToString(result.Instances[0].InstanceId), nil +} + +func findCustomAMI(ctx context.Context, client *ec2.Client, amiId string) (string, error) { + input := &ec2.DescribeImagesInput{ + ImageIds: []string{amiId}, + } + + result, err := client.DescribeImages(ctx, input) + if err != nil { + return "", err + } + if len(result.Images) > 0 { + return *result.Images[0].ImageId, nil + } + + return "", fmt.Errorf("no custom AMI found") +} + +// findUbuntuAMI searches for the latest Ubuntu AMI in the region of the EC2 client. +func findUbuntuAMI(ctx context.Context, client *ec2.Client) (string, error) { + // You may want to update these filters based on your specific requirements + input := &ec2.DescribeImagesInput{ + Filters: []types.Filter{ + { + Name: aws.String("name"), + Values: []string{"ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"}, + }, + { + Name: aws.String("virtualization-type"), + Values: []string{"hvm"}, + }, + }, + Owners: []string{"099720109477"}, // Canonical's owner ID + } + + result, err := client.DescribeImages(ctx, input) + if err != nil { + return "", err + } + + // Sort the AMIs by creation date + sort.Slice(result.Images, func(i, j int) bool { + return *result.Images[i].CreationDate > *result.Images[j].CreationDate + }) + + if len(result.Images) > 0 { + return *result.Images[0].ImageId, nil + } + + return "", fmt.Errorf("no Ubuntu AMIs found") +} + +func ensureSessionManagerIsUp(ctx context.Context, ssmClient *ssm.Client, instanceIds []string) error { + err := wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 2*time.Minute, true, func(ctx context.Context) (done bool, err error) { + _, err = sendCommand(ctx, ssmClient, instanceIds, "echo ssm") + if err != nil { + return false, nil + } + + return true, nil + }) + + if err != nil { + return fmt.Errorf("could not establish connection with SSM: %w", err) + } + + return nil +} + +// Creates a new EC2TestRunner instance, loading existing resources from the AWS account by known name/tags +func LoadEC2RunnerFromTags(ctx context.Context, networkConfig *checks.NetworkConfig) (*EC2TestRunner, error) { + runner, err := NewEC2TestRunner(ctx, networkConfig) + if err != nil { + return nil, fmt.Errorf("failed to create EC2TestRunner: %v", err) + } + svc := runner.ec2Client + + // load instanceIds + instances, err := svc.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ + Filters: append(NetworkCheckTagsFilter, types.Filter{ + Name: aws.String("instance-state-name"), + Values: []string{"pending", "running", "shutting-down", "stopping", "stopped"}, + }, + ), + }) + if err != nil { + log.WithError(err).Error("Failed to list instances, please cleanup instances manually") + } else if len(instances.Reservations) == 0 { + log.Info("No instances found.") + } + if instances != nil { + for _, r := range instances.Reservations { + for _, i := range r.Instances { + runner.instanceIds[*i.SubnetId] = *i.InstanceId + } + } + } + + // load roles + paginator := iam.NewListInstanceProfilesPaginator(runner.iamClient, &iam.ListInstanceProfilesInput{}) + for paginator.HasMorePages() { + output, err := paginator.NextPage(ctx) + if err != nil { + log.WithError(err).Warn("Failed to list roles, please cleanup manually") + break + } + + for _, ip := range output.InstanceProfiles { + if *ip.InstanceProfileName == gitpodInstanceProfile { + { + runner.instanceProfile = &ip + if len(ip.Roles) > 0 { + for _, role := range ip.Roles { + runner.roles = append(runner.roles, *role.RoleName) + } + } + } + + } + } + } + if len(runner.roles) == 0 { + log.Info("No roles found.") + } + + // load security groups + securityGroups, err := svc.DescribeSecurityGroups(ctx, &ec2.DescribeSecurityGroupsInput{ + Filters: NetworkCheckTagsFilter, + }) + + if err != nil { + log.WithError(err).Error("Failed to list security groups, please cleanup manually") + } else if len(securityGroups.SecurityGroups) == 0 { + log.Info("No security groups found.") + } + + if securityGroups != nil { + for _, sg := range securityGroups.SecurityGroups { + runner.securityGroups = append(runner.securityGroups, *sg.GroupId) + } + } + + return runner, nil +} + +func (r *EC2TestRunner) Cleanup(ctx context.Context) error { + // delete instances + instanceIds := slices.Collect(maps.Values(r.instanceIds)) + if len(instanceIds) != 0 { + log.Info("ℹ️ Terminating EC2 instances") + _, err := r.ec2Client.TerminateInstances(ctx, &ec2.TerminateInstancesInput{ + InstanceIds: instanceIds, + }) + if err != nil { + log.WithError(err).WithField("instanceIds", instanceIds).Warnf("Failed to cleanup instances, please cleanup manually") + } + + terminateWaiter := ec2.NewInstanceTerminatedWaiter(r.ec2Client, func(itwo *ec2.InstanceTerminatedWaiterOptions) { + itwo.MaxDelay = 15 * time.Second + itwo.MinDelay = 5 * time.Second + }) + log.Info("ℹ️ Waiting for EC2 instances to Terminate (times out in 5 minutes)") + err = terminateWaiter.Wait(ctx, &ec2.DescribeInstancesInput{InstanceIds: instanceIds}, *aws.Duration(5 * time.Minute)) + if err != nil { + log.WithError(err).Warn("Failed to wait for instances to terminate") + log.Warn("ℹ️ Waiting 2 minutes so network interfaces are deleted") + time.Sleep(2 * time.Minute) + } else { + log.Info("✅ Instances terminated") + } + } + + // delete roles + instanceProfileName := "" + if r.instanceProfile != nil { + instanceProfileName = *r.instanceProfile.InstanceProfileName + } + + if instanceProfileName != "" { + log.WithField("instanceProfileName", instanceProfileName).Info("ℹ️ Deleting instance profile...") + for _, role := range r.roles { + _, err := r.iamClient.DetachRolePolicy(ctx, &iam.DetachRolePolicyInput{PolicyArn: aws.String("arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"), RoleName: aws.String(role)}) + if err != nil && errorCode(err) != "NoSuchEntity" { + log.WithError(err).WithField("rolename", role).Warnf("Failed to cleanup role, please cleanup manually") + } + + _, err = r.iamClient.RemoveRoleFromInstanceProfile(ctx, &iam.RemoveRoleFromInstanceProfileInput{ + RoleName: aws.String(role), + InstanceProfileName: aws.String(instanceProfileName), + }) + if err != nil { + log.WithError(err).WithField("roleName", role).WithField("instanceProfileName", instanceProfileName).Warnf("Failed to remove role from instance profile") + } + + _, err = r.iamClient.DeleteRole(ctx, &iam.DeleteRoleInput{RoleName: aws.String(role)}) + if err != nil && errorCode(err) != "NoSuchEntity" { + log.WithError(err).WithField("rolename", role).Warnf("Failed to cleanup role, please cleanup manually") + continue + } + + log.Infof("✅ Role '%v' deleted", role) + } + + _, err := r.iamClient.DeleteInstanceProfile(ctx, &iam.DeleteInstanceProfileInput{ + InstanceProfileName: aws.String(instanceProfileName), + }) + + if err != nil && errorCode(err) != "NoSuchEntity" { + log.WithError(err).WithField("instanceProfileName", instanceProfileName).Warnf("Failed to clean up instance profile, please cleanup manually") + } else { + log.WithField("instanceProfileName", instanceProfileName).Info("✅ Instance profile deleted") + } + } + + // delete security groups + for _, sg := range r.securityGroups { + deleteSGInput := &ec2.DeleteSecurityGroupInput{ + GroupId: aws.String(sg), + } + + _, err := r.ec2Client.DeleteSecurityGroup(ctx, deleteSGInput) + if err != nil { + log.WithError(err).WithField("securityGroup", sg).Warnf("Failed to clean up security group, please cleanup manually") + continue + } + log.Infof("✅ Security group '%v' deleted", sg) + } + + return nil +} + +func errorCode(err error) string { + var apiErr smithy.APIError + if errors.As(err, &apiErr) { + return apiErr.ErrorCode() + } + return "" +} + +// sendServiceRequest sends a command to an EC2 instance and returns the command ID +func sendServiceRequest(ctx context.Context, svc *ssm.Client, instanceIds []string, serviceUrl string) (string, error) { + return sendCommand(ctx, svc, instanceIds, fmt.Sprintf("curl -m 15 -I %v", serviceUrl)) +} + +func sendCommand(ctx context.Context, svc *ssm.Client, instanceIds []string, command string) (string, error) { + networkTestingCommands := []string{ + command, + } + + result, err := svc.SendCommand(ctx, &ssm.SendCommandInput{ + InstanceIds: instanceIds, + DocumentName: aws.String("AWS-RunShellScript"), + Parameters: map[string][]string{ + "commands": networkTestingCommands, + }, + }) + if err != nil { + return "", fmt.Errorf("error sending command: %v", err) + } + + return *result.Command.CommandId, nil +} + +func fetchResultsForInstance(ctx context.Context, svc *ssm.Client, instanceId, commandId string) error { + return wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 30*time.Second, false, func(ctx context.Context) (done bool, err error) { + // Check command invocation status + invocationResult, err := svc.GetCommandInvocation(ctx, &ssm.GetCommandInvocationInput{ + CommandId: aws.String(commandId), + InstanceId: aws.String(instanceId), + }) + + var apiErr smithy.APIError + if errors.As(err, &apiErr) && apiErr.ErrorCode() == "InvocationDoesNotExist" { + return false, nil + } + + if err != nil { + log.Errorf("❌ Error getting command invocation for instance %s: %v", instanceId, err) + return false, fmt.Errorf("error getting command invocation for instance %s: %v", instanceId, err) + } + + if *invocationResult.StatusDetails == "Pending" || *invocationResult.StatusDetails == "InProgress" { + log.Debugf("⏳ Instance %s is %s for command %s", instanceId, *invocationResult.StatusDetails, commandId) + return false, nil + } + + if *invocationResult.StatusDetails == "Success" { + log.Debugf("✅ Instance %s command output:\n%s\n", instanceId, *invocationResult.StandardOutputContent) + return true, nil + } else { + log.Errorf("❌ Instance %s command with status %s not successful:\n%s\n", instanceId, *invocationResult.StatusDetails, *invocationResult.StandardErrorContent) + return false, fmt.Errorf("instance %s command failed: %s", instanceId, *invocationResult.StandardErrorContent) + } + }) +} + +func createSecurityGroups(ctx context.Context, svc *ec2.Client, subnetID string) (string, error) { + // Describe the subnet to find the VPC ID + describeSubnetsInput := &ec2.DescribeSubnetsInput{ + SubnetIds: []string{subnetID}, + } + + describeSubnetsOutput, err := svc.DescribeSubnets(ctx, describeSubnetsInput) + if err != nil { + return "", fmt.Errorf("failed to describe subnet: %v", err) + } + + if len(describeSubnetsOutput.Subnets) == 0 { + return "", fmt.Errorf("no subnets found with ID: %s", subnetID) + } + + vpcID := describeSubnetsOutput.Subnets[0].VpcId + + // Create the security group + createSGInput := &ec2.CreateSecurityGroupInput{ + Description: aws.String("EC2 security group allowing all HTTPS outgoing traffic"), + GroupName: aws.String(fmt.Sprintf("EC2-security-group-nc-%s", subnetID)), + VpcId: vpcID, + TagSpecifications: []types.TagSpecification{ + { + ResourceType: types.ResourceTypeSecurityGroup, + Tags: NetworkCheckEC2Tags, + }, + }, + } + + createSGOutput, err := svc.CreateSecurityGroup(ctx, createSGInput) + if err != nil { + log.Fatalf("Failed to create security group: %v", err) + } + + sgID := createSGOutput.GroupId + log.Infof("ℹ️ Created security group with ID: %s", *sgID) + + // Authorize HTTPS outbound traffic + authorizeEgressInput := &ec2.AuthorizeSecurityGroupEgressInput{ + GroupId: sgID, + IpPermissions: []types.IpPermission{ + { + IpProtocol: aws.String("tcp"), + FromPort: aws.Int32(443), + ToPort: aws.Int32(443), + IpRanges: []types.IpRange{ + { + CidrIp: aws.String("0.0.0.0/0"), + Description: aws.String("Allow all outbound HTTPS traffic"), + }, + }, + }, + }, + } + + _, err = svc.AuthorizeSecurityGroupEgress(ctx, authorizeEgressInput) + if err != nil { + log.Fatalf("Failed to authorize security group egress: %v", err) + } + + return *sgID, nil +} + +func createIAMRoleAndAttachPolicy(ctx context.Context, svc *iam.Client) (*iam_types.Role, error) { + // Define the trust relationship + trustPolicy := `{ + "Version": "2012-10-17", + "Statement": [{ + "Effect": "Allow", + "Principal": {"Service": "ec2.amazonaws.com"}, + "Action": "sts:AssumeRole" + }] + }` + + // Create the role + createRoleOutput, err := svc.CreateRole(ctx, &iam.CreateRoleInput{ + RoleName: aws.String(gitpodRoleName), + AssumeRolePolicyDocument: aws.String(trustPolicy), + Tags: NetworkCheckIamTags, + }) + if err != nil { + return nil, fmt.Errorf("creating IAM role: %w", err) + } + + // Attach the policy + _, err = svc.AttachRolePolicy(ctx, &iam.AttachRolePolicyInput{ + RoleName: aws.String(gitpodRoleName), + PolicyArn: aws.String("arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"), + }) + if err != nil { + return nil, fmt.Errorf("attaching policy to role: %w", err) + } + + return createRoleOutput.Role, nil +} + +func createInstanceProfileAndAttachRole(ctx context.Context, svc *iam.Client, roleName string) (*iam_types.InstanceProfile, error) { + // Create instance profile + instanceProfileOutput, err := svc.CreateInstanceProfile(ctx, &iam.CreateInstanceProfileInput{ + InstanceProfileName: aws.String(gitpodInstanceProfile), + Tags: NetworkCheckIamTags, + }) + if err != nil { + return nil, fmt.Errorf("creating instance profile: %w", err) + } + + // Add role to instance profile + _, err = svc.AddRoleToInstanceProfile(ctx, &iam.AddRoleToInstanceProfileInput{ + InstanceProfileName: aws.String(gitpodInstanceProfile), + RoleName: aws.String(roleName), + }) + if err != nil { + return nil, fmt.Errorf("adding role to instance profile: %w", err) + } + + return instanceProfileOutput.InstanceProfile, nil +} + +func getPreferredInstanceType(ctx context.Context, svc *ec2.Client, networkConfig *checks.NetworkConfig) (types.InstanceType, error) { + instanceTypes := []types.InstanceType{ + types.InstanceTypeT2Micro, + types.InstanceTypeT3aMicro, + types.InstanceTypeT3Micro, + } + for _, instanceType := range instanceTypes { + exists, err := instanceTypeExists(ctx, svc, instanceType) + if err != nil { + return "", err + } + if exists { + return instanceType, nil + } + } + return "", fmt.Errorf("no preferred instance type available in region: %s", networkConfig.AwsRegion) +} + +func instanceTypeExists(ctx context.Context, svc *ec2.Client, instanceType types.InstanceType) (bool, error) { + input := &ec2.DescribeInstanceTypeOfferingsInput{ + Filters: []types.Filter{ + { + Name: aws.String("instance-type"), + Values: []string{string(instanceType)}, + }, + }, + LocationType: types.LocationTypeRegion, + } + + resp, err := svc.DescribeInstanceTypeOfferings(ctx, input) + if err != nil { + return false, err + } + + return len(resp.InstanceTypeOfferings) > 0, nil +} + +// ConnectivityTestResult represents the results of DNS and network connectivity tests +type ConnectivityTestResult struct { + IPAddresses []string +} + +// TestServiceConnectivity tests both DNS resolution and TCP connectivity given a hostname +func TestServiceConnectivity(ctx context.Context, hostname string, timeout time.Duration) (*ConnectivityTestResult, error) { + result := &ConnectivityTestResult{} + + ips, err := net.DefaultResolver.LookupIPAddr(ctx, hostname) + if err != nil { + return result, fmt.Errorf("DNS resolution failed: %w", err) + } + for _, ip := range ips { + result.IPAddresses = append(result.IPAddresses, ip.String()) + } + if len(result.IPAddresses) == 0 { + return result, fmt.Errorf("no IP addresses found for hostname: %s", hostname) + } + dialer := net.Dialer{Timeout: timeout} + conn, err := dialer.DialContext(ctx, "tcp", fmt.Sprintf("%s:443", result.IPAddresses[0])) + if err != nil { + return result, fmt.Errorf("TCP connection failed: %w", err) + } + defer conn.Close() + + return result, nil +} diff --git a/gitpod-network-check/pkg/runner/lambda-runner.go b/gitpod-network-check/pkg/runner/lambda-runner.go new file mode 100644 index 0000000..95e1211 --- /dev/null +++ b/gitpod-network-check/pkg/runner/lambda-runner.go @@ -0,0 +1,1455 @@ +package runner + +import ( + "archive/zip" + "context" + "encoding/json" + "errors" + "fmt" + "io" + "net/url" // Added for decoding policy document + "os" + "path/filepath" + "strings" // Added import for string manipulation + "sync" // Added for mutex + "time" + + "github.com/aws/aws-sdk-go-v2/aws" + "github.com/aws/aws-sdk-go-v2/service/cloudwatchlogs" // Added import for log group cleanup + cwltypes "github.com/aws/aws-sdk-go-v2/service/cloudwatchlogs/types" // Added import for log group cleanup types + "github.com/aws/aws-sdk-go-v2/service/ec2" // Added import + ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types" // Added import with alias + "github.com/aws/aws-sdk-go-v2/service/iam" + "github.com/aws/aws-sdk-go-v2/service/iam/types" + "github.com/aws/aws-sdk-go-v2/service/lambda" + lambdatypes "github.com/aws/aws-sdk-go-v2/service/lambda/types" // Added import with alias + smithy "github.com/aws/smithy-go" // Added import for API error handling + "golang.org/x/sync/errgroup" // Added for parallel execution + + log "github.com/sirupsen/logrus" + + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/checks" + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/lambda_types" // Import shared types +) + +// LambdaTestRunner implements the TestRunner interface using AWS Lambda. +type LambdaTestRunner struct { + awsConfig aws.Config + lambdaClient *lambda.Client + iamClient *iam.Client + ec2Client *ec2.Client // Added EC2 client + cloudwatchlogsClient *cloudwatchlogs.Client // Added CloudWatch Logs client + config *checks.NetworkConfig + + // State managed by Prepare/Cleanup + roleArn *string + securityGroupID *string + functionArns map[string]string // Map subnetID -> function ARN + funcMapMutex sync.Mutex // Mutex to protect functionArns map + codeZipPath string + runID string // Unique ID for this run, used for tagging + tags map[string]string +} + +const ( + lambdaFunctionNamePrefix = "gitpod-network-check-" + lambdaRoleName = "GitpodNetworkCheckLambdaRole" + lambdaSecurityGroupName = "gitpod-network-check-lambda-sg" +) + +// NewLambdaTestRunner creates a new LambdaTestRunner. +func NewLambdaTestRunner(ctx context.Context, config *checks.NetworkConfig) (*LambdaTestRunner, error) { + log.Info("Initializing Lambda test runner...") + awsCfg, err := initAwsConfig(ctx, config.AwsRegion) + if err != nil { + return nil, fmt.Errorf("failed to load AWS config: %w", err) + } + + return &LambdaTestRunner{ + awsConfig: awsCfg, + lambdaClient: lambda.NewFromConfig(awsCfg), + iamClient: iam.NewFromConfig(awsCfg), + ec2Client: ec2.NewFromConfig(awsCfg), // Initialize EC2 client + cloudwatchlogsClient: cloudwatchlogs.NewFromConfig(awsCfg), // Initialize CloudWatch Logs client + config: config, + functionArns: make(map[string]string), + runID: fmt.Sprintf("%d", time.Now().Unix()), // Use seconds for shorter runID + tags: NetworkCheckTags, + }, nil +} + +// Prepare sets up the necessary AWS resources (IAM role, Security Group, Lambda functions). +// It returns an error if any step fails, relying on the caller to invoke Cleanup. +func (r *LambdaTestRunner) Prepare(ctx context.Context) (err error) { // Named return err for easier deferred cleanup + log.Info("Lambda Runner: Prepare phase starting...") + var createdRole bool + var createdSG bool + + // Note: Cleanup on error is now handled by the caller invoking the Cleanup method. + // The named return 'err' ensures that any error encountered below is returned. + + // 1. Get or Create IAM Role + roleArn, createdRole, err := r.getOrCreateLambdaRole(ctx) // Modified to return creation status + if err != nil { + return fmt.Errorf("failed to get or create IAM role: %w", err) // Error is captured by named return + } + r.roleArn = roleArn + log.Infof("✅ Using IAM Role ARN: %s (Created: %t)", *r.roleArn, createdRole) + + // 2. Package Lambda Code + var zipPath string + zipPath, err = r.packageLambdaCode(ctx) + if err != nil { + return fmt.Errorf("failed to package lambda code: %w", err) // Error captured by named return + } + r.codeZipPath = zipPath + log.Infof("✅ Packaged Lambda code to: %s", r.codeZipPath) + // Defer cleanup of the zip file + defer func() { + if r.codeZipPath != "" { + log.Debugf("Removing temporary zip file: %s", r.codeZipPath) + _ = os.Remove(r.codeZipPath) // Best effort removal + } + }() + + // 3. Get or Create Security Group + // We need the VPC ID first. Assume all subnets are in the same VPC. + var vpcID *string + vpcID, err = r.getVpcIDFromSubnets(ctx) + if err != nil { + return fmt.Errorf("failed to determine VPC ID from subnets: %w", err) // Error captured by named return + } + var sgID *string + sgID, createdSG, err = r.getOrCreateSecurityGroup(ctx, vpcID) // Modified to return creation status + if err != nil { + return fmt.Errorf("failed to get or create security group: %w", err) // Error captured by named return + } + r.securityGroupID = sgID + log.Infof("✅ Using Security Group ID: %s (Created: %t)", *r.securityGroupID, createdSG) + + // 4. Deploy Lambda Function(s) + log.Info("Deploying Lambda functions...") + targetSubnets := r.config.GetAllSubnets() // Get all configured subnets + if len(targetSubnets) == 0 { + err = fmt.Errorf("no subnets configured for Lambda deployment") // Assign to named return + return err + } + + var zipContent []byte + zipContent, err = os.ReadFile(r.codeZipPath) + if err != nil { + err = fmt.Errorf("failed to read packaged lambda code zip %s: %w", r.codeZipPath, err) // Assign to named return + return err + } + + // Deploy one function per unique subnet ID in parallel + var eg errgroup.Group + uniqueSubnets := make(map[string]checks.Subnet) // Store unique cleaned subnet IDs and original struct + + log.Debugf("Identifying unique subnets for deployment...") + for _, subnet := range r.config.GetAllSubnets() { + // More robust cleaning: extract only valid subnet characters + var sb strings.Builder + for _, r := range subnet.SubnetID { + if (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') || r == '-' { + sb.WriteRune(r) + } + } + cleanSubnetID := sb.String() + + // Basic validation after cleaning + if !strings.HasPrefix(cleanSubnetID, "subnet-") || len(cleanSubnetID) < 8 { // Basic sanity check + log.Warnf("Invalid subnet ID format after cleaning: '%s' (Original: '%s'). Skipping.", cleanSubnetID, subnet.SubnetID) + // Don't immediately return error, just skip this one + continue + } + + if _, exists := uniqueSubnets[cleanSubnetID]; !exists { + log.Debugf("Adding unique subnet %s to deployment list (Original: %s)", cleanSubnetID, subnet.SubnetID) + uniqueSubnets[cleanSubnetID] = subnet // Store original struct with cleaned ID as key + } else { + log.Debugf("Subnet %s already in deployment list (Original: %s)", cleanSubnetID, subnet.SubnetID) + } + } + + if len(uniqueSubnets) == 0 { + // This might happen if all input subnet IDs were invalid after cleaning + return fmt.Errorf("no valid subnets found to deploy Lambda functions into after cleaning input") + } + + log.Infof("Starting parallel deployment for %d unique subnets...", len(uniqueSubnets)) + + for cleanSubnetID := range uniqueSubnets { + // Capture loop variable for goroutine + currentCleanSubnetID := cleanSubnetID + + eg.Go(func() error { + functionName := fmt.Sprintf("%s%s-%s", lambdaFunctionNamePrefix, currentCleanSubnetID, r.runID) + // Check function name length again + if len(functionName) > 64 { + log.Errorf("❌ Generated function name '%s' is too long (%d chars > 64) for subnet ID '%s'. Skipping.", functionName, len(functionName), currentCleanSubnetID) + // Return error from goroutine + return fmt.Errorf("generated function name too long: %s", functionName) + } + + log.Infof("Deploying Lambda function '%s' for subnet %s", functionName, currentCleanSubnetID) + + createInput := &lambda.CreateFunctionInput{ + FunctionName: aws.String(functionName), + Role: r.roleArn, + Code: &lambdatypes.FunctionCode{ZipFile: zipContent}, + Handler: aws.String("bootstrap"), // The name of our script + Runtime: lambdatypes.RuntimeProvidedal2, // Use the provided runtime + Description: aws.String(fmt.Sprintf("Gitpod Network Check function for subnet %s (RunID: %s)", cleanSubnetID, r.runID)), + Timeout: aws.Int32(30), // 30 seconds timeout, adjust as needed + MemorySize: aws.Int32(256), // Minimum memory size + Publish: true, // Publish the first version + VpcConfig: &lambdatypes.VpcConfig{ + SubnetIds: []string{currentCleanSubnetID}, // Use captured loop variable + SecurityGroupIds: []string{*r.securityGroupID}, + }, + Tags: NetworkCheckTags, // Use exported var + // Architectures field might not be needed for provided.al2, but keeping x86_64 is safe. + Architectures: []lambdatypes.Architecture{lambdatypes.ArchitectureX8664}, + } + + var createOutput *lambda.CreateFunctionOutput + createOutput, err = r.lambdaClient.CreateFunction(ctx, createInput) + if err != nil { + // Use currentCleanSubnetID in error message + return fmt.Errorf("failed to create lambda function %s for subnet %s: %w", functionName, currentCleanSubnetID, err) + } + log.Infof("Lambda function %s created with ARN: %s. Waiting for it to become active...", functionName, *createOutput.FunctionArn) + + // Wait for the function to become active (moved to helper) + waitErr := r.waitForLambdaActive(ctx, createOutput.FunctionArn) + if waitErr != nil { + log.Errorf("❌ Error waiting for Lambda function %s (Subnet: %s) to become active: %v", *createOutput.FunctionArn, currentCleanSubnetID, waitErr) + // Return error from goroutine + return fmt.Errorf("error waiting for lambda %s to become active: %w", *createOutput.FunctionArn, waitErr) + } + + // Store the ARN safely + r.funcMapMutex.Lock() + r.functionArns[currentCleanSubnetID] = *createOutput.FunctionArn // Use cleaned subnet ID as map key + r.funcMapMutex.Unlock() + + return nil // Goroutine finished successfully + }) + } + + // Wait for all goroutines to finish and collect the first error + if err = eg.Wait(); err != nil { + return fmt.Errorf("one or more errors occurred during parallel Lambda deployment: %w", err) + } + + log.Info("Lambda Runner: Prepare phase completed successfully.") + return nil +} + +// waitForLambdaActive polls the Lambda function until it becomes active or times out. +func (r *LambdaTestRunner) waitForLambdaActive(ctx context.Context, functionArn *string) error { + const maxWaitTime = 2 * time.Minute // Reduced timeout slightly as multiple waits run in parallel + const pollInterval = 5 * time.Second + startTime := time.Now() + + for { + getFuncInput := &lambda.GetFunctionInput{ + FunctionName: functionArn, + } + getFuncOutput, err := r.lambdaClient.GetFunction(ctx, getFuncInput) + if err != nil { + // If the function is not found immediately after creation, it might be an eventual consistency issue. Retry. + log.Warnf("Error getting function %s status (will retry): %v", *functionArn, err) + } else if getFuncOutput.Configuration != nil && getFuncOutput.Configuration.State == lambdatypes.StateActive { + log.Infof("✅ Lambda function %s is now active.", *functionArn) + return nil // Function is active + } else if getFuncOutput.Configuration != nil && (getFuncOutput.Configuration.State == lambdatypes.StateFailed || getFuncOutput.Configuration.State == lambdatypes.StateInactive) { + // Handle terminal failure states + stateReason := "Unknown reason" + if getFuncOutput.Configuration.StateReason != nil { + stateReason = *getFuncOutput.Configuration.StateReason + } + log.Errorf("❌ Lambda function %s entered terminal state %s (%s).", *functionArn, getFuncOutput.Configuration.State, stateReason) + return fmt.Errorf("lambda function %s failed to become active, entered state %s: %s", *functionArn, getFuncOutput.Configuration.State, stateReason) + } else { + // Still pending or other state, continue waiting + currentState := "Unknown" + if getFuncOutput.Configuration != nil && getFuncOutput.Configuration.State != "" { + currentState = string(getFuncOutput.Configuration.State) + } + log.Infof("Lambda function %s state is %s, waiting...", *functionArn, currentState) + } + + if time.Since(startTime) > maxWaitTime { + log.Errorf("❌ Timed out waiting for Lambda function %s to become active after %v.", *functionArn, maxWaitTime) + return fmt.Errorf("timed out waiting for lambda function %s to become active", *functionArn) + } + + // Check context cancellation + select { + case <-ctx.Done(): + log.Warnf("Context cancelled while waiting for Lambda function %s to become active.", *functionArn) + return ctx.Err() + case <-time.After(pollInterval): + // Continue loop + } + } +} + +// packageLambdaCode finds the current executable, creates a bootstrap script, and zips them. +func (r *LambdaTestRunner) packageLambdaCode(ctx context.Context) (string, error) { + zipFileName := fmt.Sprintf("lambda-gpnwc-%s.zip", r.runID) + bootstrapScriptName := "bootstrap" + executableName := "gitpod-network-check" // Name of the binary inside the zip + + // Find the path of the currently running executable + exePath, err := os.Executable() + if err != nil { + return "", fmt.Errorf("failed to get current executable path: %w", err) + } + log.Infof("Using current executable for Lambda package: %s", exePath) + + // Create a temporary directory for staging files + tempDir, err := os.MkdirTemp("", "lambda-pkg-") + if err != nil { + return "", fmt.Errorf("failed to create temp dir for packaging: %w", err) + } + defer os.RemoveAll(tempDir) // Clean up temp dir afterwards + log.Debugf("Created temporary staging directory: %s", tempDir) + + // Define bootstrap script content + bootstrapContent := fmt.Sprintf(`#!/bin/sh +set -e +echo "Bootstrap: Running %s lambda-handler" >&2 +./%s lambda-handler +`, executableName, executableName) + + // Write bootstrap script to temp dir + bootstrapPath := filepath.Join(tempDir, bootstrapScriptName) + err = os.WriteFile(bootstrapPath, []byte(bootstrapContent), 0755) // rwxr-xr-x permissions + if err != nil { + return "", fmt.Errorf("failed to write bootstrap script %s: %w", bootstrapPath, err) + } + log.Infof("Created bootstrap script: %s", bootstrapPath) + + // Copy executable to temp dir with the target name + destExePath := filepath.Join(tempDir, executableName) + log.Debugf("Copying executable from %s to %s", exePath, destExePath) + sourceFile, err := os.Open(exePath) + if err != nil { + return "", fmt.Errorf("failed to open source executable %s: %w", exePath, err) + } + defer sourceFile.Close() + + destFile, err := os.OpenFile(destExePath, os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0755) // rwxr-xr-x + if err != nil { + return "", fmt.Errorf("failed to create destination executable %s: %w", destExePath, err) + } + defer destFile.Close() + + _, err = io.Copy(destFile, sourceFile) + if err != nil { + return "", fmt.Errorf("failed to copy executable: %w", err) + } + log.Infof("Copied executable to: %s", destExePath) + + // Create the zip archive + finalZipPath := filepath.Join(".", zipFileName) // Place final zip in CWD + log.Infof("Creating zip archive: %s", finalZipPath) + zipFile, err := os.Create(finalZipPath) + if err != nil { + return "", fmt.Errorf("failed to create zip file %s: %w", finalZipPath, err) + } + defer zipFile.Close() + + zipWriter := zip.NewWriter(zipFile) + defer zipWriter.Close() + + // Add files from tempDir to zip + filesToZip := []string{bootstrapScriptName, executableName} + for _, filename := range filesToZip { + filePath := filepath.Join(tempDir, filename) + log.Debugf("Adding %s to zip archive", filePath) + + fileToZip, err := os.Open(filePath) + if err != nil { + return "", fmt.Errorf("failed to open file %s for zipping: %w", filePath, err) + } + defer fileToZip.Close() + + info, err := fileToZip.Stat() + if err != nil { + return "", fmt.Errorf("failed to stat file %s: %w", filePath, err) + } + + header, err := zip.FileInfoHeader(info) + if err != nil { + return "", fmt.Errorf("failed to create zip header for %s: %w", filename, err) + } + // Use base name (filename) in zip archive's root + header.Name = filename + header.Method = zip.Deflate // Use compression + + writer, err := zipWriter.CreateHeader(header) + if err != nil { + return "", fmt.Errorf("failed to create zip writer for %s: %w", filename, err) + } + + _, err = io.Copy(writer, fileToZip) + if err != nil { + return "", fmt.Errorf("failed to copy file %s to zip: %w", filename, err) + } + } + + log.Info("Lambda code zipped successfully.") + return finalZipPath, nil +} + +// getOrCreateLambdaRole finds or creates the necessary IAM role for the Lambda function, +// respecting the LambdaRoleArn config if provided. +// Returns the Role ARN, a boolean indicating if the role was created in this call, and an error. +func (r *LambdaTestRunner) getOrCreateLambdaRole(ctx context.Context) (*string, bool, error) { + // Check if a specific Role ARN is provided in the config + if r.config.LambdaRoleArn != "" { + roleArnString := r.config.LambdaRoleArn + log.Infof("Using pre-configured Lambda IAM Role ARN: %s", roleArnString) + // Extract role name from ARN for GetRole/UpdateAssumeRolePolicy calls + arnParts := strings.Split(roleArnString, "/") + if len(arnParts) < 2 { + return nil, false, fmt.Errorf("invalid pre-configured role ARN format: %s", roleArnString) + } + roleName := arnParts[len(arnParts)-1] + log.Debugf("Extracted role name from provided ARN: %s", roleName) + + // Validate the role exists and check/update its trust policy + getRoleInput := &iam.GetRoleInput{RoleName: aws.String(roleName)} + getRoleOutput, err := r.iamClient.GetRole(ctx, getRoleInput) + if err != nil { + var nsee *types.NoSuchEntityException + if errors.As(err, &nsee) { + return nil, false, fmt.Errorf("pre-configured IAM role %s (ARN: %s) not found: %w", roleName, roleArnString, err) + } + return nil, false, fmt.Errorf("failed to get pre-configured IAM role %s: %w", roleName, err) + } + + // Check and potentially update the trust policy + policyUpdated, err := r.ensureLambdaTrustPolicy(ctx, getRoleOutput.Role) + if err != nil { + return nil, false, fmt.Errorf("failed to ensure trust policy for pre-configured role %s: %w", roleName, err) + } + if policyUpdated { + log.Infof("Updated trust policy for pre-configured role %s. Adding delay for propagation...", roleName) + time.Sleep(10 * time.Second) // Delay after updating policy + } + + return aws.String(roleArnString), false, nil // Not created by us, but possibly updated + } + + // No specific ARN provided, proceed with get-or-create logic for the managed role + roleName := lambdaRoleName // Assign to existing variable, not redeclare + log.Infof("Checking for managed IAM role: %s", roleName) + + // Try to get the role first + getRoleInput := &iam.GetRoleInput{ + RoleName: aws.String(roleName), + } + getRoleOutput, err := r.iamClient.GetRole(ctx, getRoleInput) + if err == nil { + roleArn := getRoleOutput.Role.Arn + log.Infof("Found existing managed IAM role: %s", *roleArn) + + // Check and potentially update the trust policy for the existing managed role + policyUpdated, updateErr := r.ensureLambdaTrustPolicy(ctx, getRoleOutput.Role) + if updateErr != nil { + // Log error but don't fail, maybe it's usable anyway? Or maybe permissions issue. + log.WithError(updateErr).Warnf("Failed to ensure trust policy for existing managed role %s. Proceeding cautiously.", roleName) + } else if policyUpdated { + log.Infof("Updated trust policy for existing managed role %s. Adding delay for propagation...", roleName) + time.Sleep(10 * time.Second) // Delay after updating policy + } + + // TODO: Optionally verify/update tags or policies on existing role? + return roleArn, false, nil // Found, not created now, but possibly updated + } + + // Handle specific error: NoSuchEntityException means we need to create it + var nsee *types.NoSuchEntityException + if !errors.As(err, &nsee) { + return nil, false, fmt.Errorf("failed to get IAM role %s: %w", roleName, err) + } + + // Role doesn't exist, create it + log.Infof("IAM role %s not found, creating...", roleName) + + assumeRolePolicy := map[string]interface{}{ + "Version": "2012-10-17", + "Statement": []map[string]interface{}{ + { + "Effect": "Allow", + "Principal": map[string]string{ + "Service": "lambda.amazonaws.com", + }, + "Action": "sts:AssumeRole", + }, + }, + } + assumeRolePolicyBytes, _ := json.Marshal(assumeRolePolicy) // Error handling omitted for brevity + + createRoleInput := &iam.CreateRoleInput{ + RoleName: aws.String(roleName), + AssumeRolePolicyDocument: aws.String(string(assumeRolePolicyBytes)), + Description: aws.String("Role for Gitpod Network Check Lambda functions"), + Tags: NetworkCheckIamTags, + } + + createRoleOutput, err := r.iamClient.CreateRole(ctx, createRoleInput) + if err != nil { + return nil, false, fmt.Errorf("failed to create IAM role %s: %w", roleName, err) + } + log.Infof("Created IAM role: %s", *createRoleOutput.Role.Arn) + + // Attach required managed policies + policies := []string{ + "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole", + "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole", + } + for _, policyArn := range policies { + log.Infof("Attaching policy %s to role %s", policyArn, roleName) + attachPolicyInput := &iam.AttachRolePolicyInput{ + RoleName: aws.String(roleName), + PolicyArn: aws.String(policyArn), + } + _, err := r.iamClient.AttachRolePolicy(ctx, attachPolicyInput) + if err != nil { + // Don't attempt cleanup here, caller invoking Cleanup() is responsible + log.Warnf("Failed to attach policy %s: %v. Role %s might be left in an incomplete state.", policyArn, err, roleName) + return nil, true, fmt.Errorf("failed to attach policy %s to role %s: %w", policyArn, roleName, err) // Created but failed config + } + } + + log.Infof("Successfully created and configured IAM role %s", roleName) + + // Add delay after creating role for IAM propagation + log.Info("Adding delay after IAM role creation for propagation...") + time.Sleep(10 * time.Second) + + return createRoleOutput.Role.Arn, true, nil // Created successfully +} + +// ensureLambdaTrustPolicy checks if the role's trust policy allows lambda.amazonaws.com +// and updates it if necessary. Returns true if the policy was updated. +func (r *LambdaTestRunner) ensureLambdaTrustPolicy(ctx context.Context, role *types.Role) (bool, error) { + if role == nil || role.AssumeRolePolicyDocument == nil { + return false, fmt.Errorf("role or AssumeRolePolicyDocument is nil") + } + + // AWS policy documents are URL encoded + decodedPolicy, err := url.QueryUnescape(*role.AssumeRolePolicyDocument) + if err != nil { + return false, fmt.Errorf("failed to decode assume role policy document: %w", err) + } + + var policyDoc map[string]interface{} + err = json.Unmarshal([]byte(decodedPolicy), &policyDoc) + if err != nil { + return false, fmt.Errorf("failed to unmarshal assume role policy document: %w", err) + } + + statements, ok := policyDoc["Statement"].([]interface{}) + if !ok { + return false, fmt.Errorf("invalid policy document structure: 'Statement' is not an array") + } + + lambdaPrincipalFound := false + for _, stmtInterface := range statements { + stmt, ok := stmtInterface.(map[string]interface{}) + if !ok { + continue // Skip invalid statements + } + + // Check Effect is Allow + if effect, ok := stmt["Effect"].(string); !ok || effect != "Allow" { + continue + } + + // Check Action contains sts:AssumeRole + actionFound := false + switch action := stmt["Action"].(type) { + case string: + if action == "sts:AssumeRole" || action == "*" { + actionFound = true + } + case []interface{}: + for _, act := range action { + if actStr, ok := act.(string); ok && (actStr == "sts:AssumeRole" || actStr == "*") { + actionFound = true + break + } + } + } + if !actionFound { + continue + } + + // Check Principal contains lambda.amazonaws.com + principal, ok := stmt["Principal"].(map[string]interface{}) + if !ok { + continue + } + service, ok := principal["Service"] + if !ok { + continue + } + + switch srv := service.(type) { + case string: + if srv == "lambda.amazonaws.com" { + lambdaPrincipalFound = true + break // Found it in this statement + } + case []interface{}: + for _, s := range srv { + if sStr, ok := s.(string); ok && sStr == "lambda.amazonaws.com" { + lambdaPrincipalFound = true + break // Found it in the list + } + } + } + if lambdaPrincipalFound { + break // Found it in the statements array + } + } + + if lambdaPrincipalFound { + log.Debugf("Role %s already has lambda.amazonaws.com in its trust policy.", *role.RoleName) + return false, nil // Already has the correct trust policy + } + + // Lambda principal not found, need to update the policy + log.Warnf("Role %s is missing 'lambda.amazonaws.com' in its trust policy. Attempting to update.", *role.RoleName) + + // Construct the new policy document - simplest approach is to overwrite with the standard one + // A more robust approach would merge, but this is likely sufficient for this tool's managed role. + newAssumeRolePolicy := map[string]interface{}{ + "Version": "2012-10-17", + "Statement": []map[string]interface{}{ + { + "Effect": "Allow", + "Principal": map[string]string{ + "Service": "lambda.amazonaws.com", + }, + "Action": "sts:AssumeRole", + }, + // Add other principals if they existed? For now, just ensure Lambda is there. + // If the original policy had other services, this will remove them. + // Consider fetching, merging, and then updating if preserving others is critical. + }, + } + newAssumeRolePolicyBytes, _ := json.Marshal(newAssumeRolePolicy) + + updateInput := &iam.UpdateAssumeRolePolicyInput{ + RoleName: role.RoleName, + PolicyDocument: aws.String(string(newAssumeRolePolicyBytes)), + } + + _, err = r.iamClient.UpdateAssumeRolePolicy(ctx, updateInput) + if err != nil { + return false, fmt.Errorf("failed to update assume role policy for role %s: %w", *role.RoleName, err) + } + + log.Infof("Successfully updated trust policy for role %s to include lambda.amazonaws.com.", *role.RoleName) + return true, nil // Policy was updated +} + +// getVpcIDFromSubnets determines the VPC ID from the configured subnets. +// Assumes all subnets belong to the same VPC. +func (r *LambdaTestRunner) getVpcIDFromSubnets(ctx context.Context) (*string, error) { + allSubnetIDs := r.config.GetAllSubnets() + if len(allSubnetIDs) == 0 { + return nil, fmt.Errorf("no subnets configured, cannot determine VPC ID") + } + + // Describe the first subnet to get the VPC ID + firstSubnetID := allSubnetIDs[0].SubnetID + describeInput := &ec2.DescribeSubnetsInput{ + SubnetIds: []string{firstSubnetID}, + } + describeOutput, err := r.ec2Client.DescribeSubnets(ctx, describeInput) + if err != nil { + return nil, fmt.Errorf("failed to describe subnet %s: %w", firstSubnetID, err) + } + if len(describeOutput.Subnets) == 0 { + return nil, fmt.Errorf("subnet %s not found", firstSubnetID) + } + vpcID := describeOutput.Subnets[0].VpcId + log.Debugf("Determined VPC ID: %s from subnet %s", *vpcID, firstSubnetID) + return vpcID, nil +} + +// getOrCreateSecurityGroup finds or creates the necessary Security Group for the Lambda function, +// respecting the LambdaSecurityGroupID config if provided. +// Returns the SG ID, a boolean indicating if the SG was created in this call, and an error. +func (r *LambdaTestRunner) getOrCreateSecurityGroup(ctx context.Context, vpcID *string) (*string, bool, error) { + // Check if a specific Security Group ID is provided in the config + if r.config.LambdaSecurityGroupID != "" { + log.Infof("Using pre-configured Lambda Security Group ID: %s", r.config.LambdaSecurityGroupID) + // Optionally, perform a DescribeSecurityGroups call to validate the ID exists and is accessible? + // For now, assume the provided ID is valid. + return aws.String(r.config.LambdaSecurityGroupID), false, nil // Not created by us + } + + // No specific SG ID provided, proceed with get-or-create logic + sgName := lambdaSecurityGroupName + log.Infof("Checking for managed Security Group: %s in VPC %s", sgName, *vpcID) + + // Try to find the SG by name and tag + describeInput := &ec2.DescribeSecurityGroupsInput{ + Filters: append(NetworkCheckTagsFilter, + ec2types.Filter{Name: aws.String("vpc-id"), Values: []string{*vpcID}}, + ec2types.Filter{Name: aws.String("group-name"), Values: []string{sgName}}, + ), + } + + describeOutput, err := r.ec2Client.DescribeSecurityGroups(ctx, describeInput) + if err != nil { + // Handle potential AWS errors if needed, otherwise assume it doesn't exist or other issue + log.Warnf("Could not describe security groups (maybe transient error or SG doesn't exist): %v", err) + } + + if describeOutput != nil && len(describeOutput.SecurityGroups) > 0 { + // Found existing managed SG + sgID := describeOutput.SecurityGroups[0].GroupId + log.Infof("Found existing managed Security Group: %s", *sgID) + // Ensure the necessary egress rules exist even if we found the SG + ipv4Rule := ec2types.IpPermission{ + IpProtocol: aws.String("-1"), + IpRanges: []ec2types.IpRange{{CidrIp: aws.String("0.0.0.0/0")}}, + } + if err := ensureSecurityGroupEgressRule(ctx, r.ec2Client, sgID, ipv4Rule); err != nil { + log.WithError(err).Warnf("Failed to ensure IPv4 egress rule for existing SG %s", *sgID) + // Continue, but log the warning + } + ipv6Rule := ec2types.IpPermission{ + IpProtocol: aws.String("-1"), + Ipv6Ranges: []ec2types.Ipv6Range{{CidrIpv6: aws.String("::/0")}}, + } + if err := ensureSecurityGroupEgressRule(ctx, r.ec2Client, sgID, ipv6Rule); err != nil { + log.WithError(err).Warnf("Failed to ensure IPv6 egress rule for existing SG %s", *sgID) + // Continue, but log the warning + } + return sgID, false, nil // Found, not created now + } + + // Security Group doesn't exist, create it + log.Infof("Security Group %s not found, creating...", sgName) + + tagSpec := []ec2types.TagSpecification{ + {ResourceType: ec2types.ResourceTypeSecurityGroup, Tags: NetworkCheckEC2Tags}, + } + + createInput := &ec2.CreateSecurityGroupInput{ + GroupName: aws.String(sgName), + Description: aws.String("Security Group for Gitpod Network Check Lambda"), + VpcId: vpcID, + TagSpecifications: tagSpec, + } + + createOutput, err := r.ec2Client.CreateSecurityGroup(ctx, createInput) + if err != nil { + return nil, false, fmt.Errorf("failed to create security group %s: %w", sgName, err) + } + sgID := createOutput.GroupId + log.Infof("Created Security Group: %s", *sgID) + + // Ensure the necessary egress rules exist after creation + ipv4Rule := ec2types.IpPermission{ + IpProtocol: aws.String("-1"), + IpRanges: []ec2types.IpRange{{CidrIp: aws.String("0.0.0.0/0")}}, + } + if err := ensureSecurityGroupEgressRule(ctx, r.ec2Client, sgID, ipv4Rule); err != nil { + log.WithError(err).Errorf("Failed to ensure IPv4 egress rule for newly created SG %s", *sgID) + // Return error as this is critical for a new SG + return nil, true, fmt.Errorf("failed to ensure IPv4 egress rule for security group %s: %w", *sgID, err) // Created but failed config + } + + ipv6Rule := ec2types.IpPermission{ + IpProtocol: aws.String("-1"), + Ipv6Ranges: []ec2types.Ipv6Range{{CidrIpv6: aws.String("::/0")}}, + } + if err := ensureSecurityGroupEgressRule(ctx, r.ec2Client, sgID, ipv6Rule); err != nil { + log.WithError(err).Errorf("Failed to ensure IPv6 egress rule for newly created SG %s", *sgID) + // Return error as this is critical for a new SG + return nil, true, fmt.Errorf("failed to ensure IPv6 egress rule for security group %s: %w", *sgID, err) // Created but failed config + } + + log.Infof("Successfully created and configured Security Group %s", *sgID) + return sgID, true, nil // Created successfully +} + +// ensureSecurityGroupEgressRule checks if a specific egress rule exists and adds it if not. +func ensureSecurityGroupEgressRule(ctx context.Context, ec2Client *ec2.Client, sgID *string, rule ec2types.IpPermission) error { + log.Debugf("Ensuring egress rule for SG %s: Proto=%s, IPv4=%v, IPv6=%v", + *sgID, aws.ToString(rule.IpProtocol), rule.IpRanges, rule.Ipv6Ranges) + + // Describe the security group to check existing rules + describeInput := &ec2.DescribeSecurityGroupsInput{ + GroupIds: []string{*sgID}, + } + describeOutput, err := ec2Client.DescribeSecurityGroups(ctx, describeInput) + if err != nil { + // Log warning but don't necessarily fail the whole operation, maybe transient? + log.WithError(err).Warnf("Could not describe security group %s to check egress rules", *sgID) + // Proceed with caution - attempt to add the rule anyway? Or return error? + // Let's return error to be safer, as we can't verify. + return fmt.Errorf("failed to describe security group %s to verify egress rule: %w", *sgID, err) + } + + if len(describeOutput.SecurityGroups) == 0 { + return fmt.Errorf("security group %s not found during egress rule check", *sgID) + } + sg := describeOutput.SecurityGroups[0] + + // Check if the rule already exists + ruleExists := false + for _, existingRule := range sg.IpPermissionsEgress { + if aws.ToString(existingRule.IpProtocol) == aws.ToString(rule.IpProtocol) && + ipRangesMatch(existingRule.IpRanges, rule.IpRanges) && + ipv6RangesMatch(existingRule.Ipv6Ranges, rule.Ipv6Ranges) { + ruleExists = true + break + } + } + + if ruleExists { + log.Debugf("Egress rule already exists for SG %s: Proto=%s, IPv4=%v, IPv6=%v", + *sgID, aws.ToString(rule.IpProtocol), rule.IpRanges, rule.Ipv6Ranges) + return nil // Rule exists, nothing to do + } + + // Rule doesn't exist, add it + log.Infof("Authorizing missing egress rule for SG %s: Proto=%s, IPv4=%v, IPv6=%v", + *sgID, aws.ToString(rule.IpProtocol), rule.IpRanges, rule.Ipv6Ranges) + authInput := &ec2.AuthorizeSecurityGroupEgressInput{ + GroupId: sgID, + IpPermissions: []ec2types.IpPermission{rule}, + } + _, err = ec2Client.AuthorizeSecurityGroupEgress(ctx, authInput) + if err != nil { + // Check for duplicate error specifically, although the check above should prevent it + var apiErr smithy.APIError + if errors.As(err, &apiErr) && apiErr.ErrorCode() == "InvalidPermission.Duplicate" { + log.Warnf("Attempted to add duplicate egress rule for SG %s despite check (potential race condition?): %v", *sgID, err) + return nil // Treat as success if it's just a duplicate error + } + log.Errorf("Failed to authorize egress rule for SG %s: %v", *sgID, err) + return fmt.Errorf("failed to authorize egress rule for security group %s: %w", *sgID, err) + } + + log.Infof("Successfully authorized egress rule for SG %s", *sgID) + return nil +} + +// Helper to compare IP ranges (order doesn't matter) +func ipRangesMatch(a, b []ec2types.IpRange) bool { + if len(a) != len(b) { + return false + } + mapA := make(map[string]struct{}, len(a)) + for _, r := range a { + if r.CidrIp != nil { + mapA[*r.CidrIp] = struct{}{} + } + } + for _, r := range b { + if r.CidrIp == nil { // If b has a nil entry, it can't match + return false + } + if _, ok := mapA[*r.CidrIp]; !ok { + return false + } + } + // Ensure the counts match exactly (handles cases where a has duplicates) + return len(mapA) == len(b) +} + +// Helper to compare IPv6 ranges (order doesn't matter) +func ipv6RangesMatch(a, b []ec2types.Ipv6Range) bool { + if len(a) != len(b) { + return false + } + mapA := make(map[string]struct{}, len(a)) + for _, r := range a { + if r.CidrIpv6 != nil { + mapA[*r.CidrIpv6] = struct{}{} + } + } + for _, r := range b { + if r.CidrIpv6 == nil { // If b has a nil entry, it can't match + return false + } + if _, ok := mapA[*r.CidrIpv6]; !ok { + return false + } + } + // Ensure the counts match exactly + return len(mapA) == len(b) +} + +// TestService runs the network checks by invoking the Lambda function(s). +func (r *LambdaTestRunner) TestService(ctx context.Context, subnets []checks.Subnet, serviceEndpoints map[string]string) (bool, error) { + log.Infof("Lambda Runner: TestService phase starting for %d subnets and %d endpoints.", len(subnets), len(serviceEndpoints)) + + if len(r.functionArns) == 0 { + return false, fmt.Errorf("no lambda functions seem to be prepared (functionArns map is empty)") + } + if len(subnets) == 0 { + log.Warn("No target subnets provided for this test set, skipping invocation.") + return true, nil // No subnets means nothing to test here + } + if len(serviceEndpoints) == 0 { + log.Warn("No service endpoints provided for this test set, skipping invocation.") + return true, nil // No endpoints means nothing to test here + } + + overallSuccess := true + + // Prepare the request payload once + requestPayload := lambda_types.CheckRequest{Endpoints: serviceEndpoints} // Use shared type + payloadBytes, err := json.Marshal(requestPayload) + if err != nil { + return false, fmt.Errorf("failed to marshal lambda request payload: %w", err) + } + + // Invoke Lambda for each unique target subnet + invokedSubnets := make(map[string]bool) + for _, subnet := range subnets { + // Clean the subnet ID *before* using it for lookup, consistent with Prepare phase + var sb strings.Builder + for _, r := range subnet.SubnetID { + if (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') || r == '-' { + sb.WriteRune(r) + } + } + cleanSubnetID := sb.String() + + // Basic validation after cleaning - should match validation in Prepare + if !strings.HasPrefix(cleanSubnetID, "subnet-") || len(cleanSubnetID) < 8 { + log.Errorf("❌ Invalid subnet ID format during lookup: '%s' (Original: '%s'). Skipping.", cleanSubnetID, subnet.SubnetID) + overallSuccess = false // Mark as failure if subnet ID is invalid + continue + } + + if _, exists := invokedSubnets[cleanSubnetID]; exists { + log.Debugf("Skipping already invoked subnet: %s", cleanSubnetID) + continue + } + + functionArn, ok := r.functionArns[cleanSubnetID] // Use cleaned subnet ID for lookup + if !ok { + log.Errorf("❌ No prepared Lambda function found for subnet %s. Skipping.", cleanSubnetID) + overallSuccess = false + invokedSubnets[cleanSubnetID] = true // Mark as invoked (even though failed) to avoid re-attempting + continue // Skip this subnet if no function was prepared for it + } + + log.Infof("🚀 Invoking Lambda function %s for subnet %s", functionArn, cleanSubnetID) + + invokeInput := &lambda.InvokeInput{ + FunctionName: aws.String(functionArn), + Payload: payloadBytes, + InvocationType: lambdatypes.InvocationTypeRequestResponse, // Synchronous invocation + LogType: lambdatypes.LogTypeTail, // Get logs in response + } + + invokeOutput, err := r.lambdaClient.Invoke(ctx, invokeInput) + if err != nil { + log.Errorf("❌ Failed to invoke Lambda function %s for subnet %s: %v", functionArn, cleanSubnetID, err) // Use cleanSubnetID in log + overallSuccess = false + invokedSubnets[cleanSubnetID] = true // Mark as invoked even if failed + continue + } + + // Log Lambda execution logs if available + if invokeOutput.LogResult != nil { + log.Tracef("Lambda logs for %s (Subnet: %s):\n%s", functionArn, cleanSubnetID, *invokeOutput.LogResult) // Use cleanSubnetID in log + } + + if invokeOutput.FunctionError != nil { + log.Errorf("❌ Lambda function %s for subnet %s executed with error: %s", functionArn, cleanSubnetID, *invokeOutput.FunctionError) // Use cleanSubnetID in log + overallSuccess = false + invokedSubnets[cleanSubnetID] = true + continue + } + + // Process the response payload + var responsePayload lambda_types.CheckResponse // Use shared type + err = json.Unmarshal(invokeOutput.Payload, &responsePayload) + if err != nil { + log.Errorf("❌ Failed to unmarshal response payload from Lambda %s for subnet %s: %v", functionArn, cleanSubnetID, err) // Use cleanSubnetID in log + log.Debugf("Raw payload: %s", string(invokeOutput.Payload)) + overallSuccess = false + invokedSubnets[cleanSubnetID] = true + continue + } + + log.Infof("📋 Results from Lambda in subnet %s:", cleanSubnetID) // Use cleanSubnetID in log + subnetSuccess := true + for endpointName, result := range responsePayload.Results { + if result.Success { + log.Infof(" ✅ %s: OK", endpointName) + } else { + log.Errorf(" ❌ %s: FAILED (%s)", endpointName, result.Error) + subnetSuccess = false + } + } + + if !subnetSuccess { + overallSuccess = false + } + invokedSubnets[cleanSubnetID] = true + } + + log.Info("Lambda Runner: TestService phase finished.") + return overallSuccess, nil +} + +// Cleanup removes the AWS resources created during Prepare. +func (r *LambdaTestRunner) Cleanup(ctx context.Context) error { + log.Info("Lambda Runner: Cleanup phase starting...") + var cleanupErrors []error + deletedFunctionNames := make(map[string]string) // Store function name -> ARN for log group deletion + + // 1. Delete Lambda Functions + if len(r.functionArns) > 0 { + log.Infof("Deleting %d Lambda function(s)...", len(r.functionArns)) + for subnetID, functionArn := range r.functionArns { + // Extract function name from ARN for log group deletion later + // ARN format: arn:aws:lambda:region:account-id:function:function-name + functionName := getFunctionNameFromARN(functionArn) + if functionName == "" { + log.Warnf("Could not extract function name from ARN %s, skipping log group cleanup for this function.", functionArn) + } else { + deletedFunctionNames[functionName] = functionArn // Store for later use + } + + log.Debugf("Deleting Lambda function %s (Name: %s, Subnet: %s)", functionArn, functionName, subnetID) + deleteInput := &lambda.DeleteFunctionInput{ + FunctionName: aws.String(functionArn), + } + _, err := r.lambdaClient.DeleteFunction(ctx, deleteInput) + if err != nil { + // Check if it's already gone + var rnfe *lambdatypes.ResourceNotFoundException + if errors.As(err, &rnfe) { + log.Warnf("Lambda function %s not found, likely already deleted.", functionArn) + } else { + log.Errorf("❌ Failed to delete Lambda function %s: %v", functionArn, err) + cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to delete lambda %s: %w", functionArn, err)) + } + } else { + log.Infof("✅ Deleted Lambda function %s (Name: %s)", functionArn, functionName) + } + } + } else { + log.Info("No Lambda functions recorded to delete.") + } + + // 2. Find and Delete Network Interfaces associated with the managed Security Group + if r.securityGroupID != nil && r.config.LambdaSecurityGroupID == "" { + sgID := *r.securityGroupID + log.Infof("Searching for Network Interfaces attached to managed Security Group %s...", sgID) + + enis, err := r.findNetworkInterfacesForSecurityGroup(ctx, sgID) + if err != nil { + log.WithError(err).Errorf("❌ Failed to find network interfaces for SG %s. Skipping ENI cleanup.", sgID) + } else if len(enis) > 0 { + log.Infof("Found %d Network Interface(s) associated with SG %s. Attempting detachment and deletion...", len(enis), sgID) + for eniID, eni := range enis { + log.Infof("Processing ENI '%s' (status: %s) with attachment ID '%s' (status: %s)...", eniID, eni.eniStatus, eni.attachmentID, eni.attachmentStatus) + if strings.HasPrefix(eni.attachmentID, "ela-attach-") { + log.Infof("Leaving attachment ID '%s' as-is, as it will be automatically removed once the lambda is gone", eni.attachmentID) + continue + } + if strings.HasPrefix(eni.description, "AWS Lambda") { + log.Infof("Leaving ENI '%s' as-is, as it is an AWS Lambda that is cleaned up automatically", eniID) + continue + } + + attachmentID := eni.attachmentID + if attachmentID != "" { + detachInput := &ec2.DetachNetworkInterfaceInput{ + AttachmentId: aws.String(attachmentID), + Force: aws.Bool(true), + } + _, detachErr := r.ec2Client.DetachNetworkInterface(ctx, detachInput) + if detachErr != nil { + var apiErr smithy.APIError + if errors.As(detachErr, &apiErr) && apiErr.ErrorCode() == "InvalidAttachmentID.NotFound" { + log.Infof("ENI %s already detached.", eniID) + } else { + log.WithError(detachErr).Warnf("Failed to detach ENI '%s' with attachment ID '%s'", eniID, attachmentID) + cleanupErrors = append(cleanupErrors, fmt.Errorf("Failed to detach ENI '%s' with attachment ID '%s': %w", eniID, attachmentID, detachErr)) + } + } + log.Infof("Detachment initiated for ENI '%s' with attachment ID '%s' ", eniID, attachmentID) + } + + // Attempt deletion with retries + deleteErr := r.deleteNetworkInterfaceWithRetry(ctx, eniID) + if deleteErr != nil { + log.WithError(deleteErr).Errorf("❌ Failed to delete ENI %s after retries.", eniID) + cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to delete ENI %s: %w", eniID, deleteErr)) + // Continue to try deleting other ENIs + } else { + log.Infof("✅ Deleted Network Interface %s", eniID) + } + } + } else { + log.Infof("No Network Interfaces found associated with SG %s.", sgID) + } + } else { + log.Info("Skipping Network Interface cleanup as Security Group was user-provided or not found.") + } + + // 3. Delete Security Group (only if managed by this tool, i.e., not provided via config) + // Now attempt SG deletion *after* ENI cleanup attempt + // if r.securityGroupID != nil && r.config.LambdaSecurityGroupID == "" { + // sgID := *r.securityGroupID + // log.Infof("Deleting managed Security Group %s...", sgID) + // deleteSGInput := &ec2.DeleteSecurityGroupInput{ + // GroupId: r.securityGroupID, + // } + // _, err := r.ec2Client.DeleteSecurityGroup(ctx, deleteSGInput) + // if err != nil { + // var apiErr smithy.APIError + // if errors.As(err, &apiErr) && apiErr.ErrorCode() == "InvalidGroup.NotFound" { + // log.Warnf("Security Group %s not found, likely already deleted.", sgID) + // } else { + // log.Errorf("❌ Failed to delete Security Group %s: %v", sgID, err) + // cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to delete security group %s: %w", sgID, err)) + // } + // } else { + // log.Infof("✅ Deleted Security Group %s", sgID) + // } + // } else + if r.config.LambdaSecurityGroupID != "" { // Check if SG was provided via config + log.Infof("Skipping deletion of user-provided Security Group: %s", r.config.LambdaSecurityGroupID) + } else { + log.Info("No deleting created SecurityGroup as it's garbage collected by AWS.") + } + + // 4. Delete IAM Role (only if managed by this tool, i.e., not provided via config) + if r.roleArn != nil && r.config.LambdaRoleArn == "" { + roleName := lambdaRoleName // Assuming we always use the same name for managed roles + log.Infof("Detaching policies and deleting managed IAM role %s...", roleName) + + // Detach policies first (only necessary if we created the role) + policies := []string{ + "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole", + "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole", + } + for _, policyArn := range policies { + log.Debugf("Detaching policy %s from role %s", policyArn, roleName) + detachInput := &iam.DetachRolePolicyInput{ + RoleName: aws.String(roleName), + PolicyArn: aws.String(policyArn), + } + _, err := r.iamClient.DetachRolePolicy(ctx, detachInput) + if err != nil { + // Log error but continue trying to delete role + log.Warnf("Failed to detach policy %s from role %s: %v", policyArn, roleName, err) + // Don't add to cleanupErrors here, as role deletion might still succeed or fail for other reasons + } + } + + // Delete the role + deleteInput := &iam.DeleteRoleInput{ + RoleName: aws.String(roleName), + } + _, err := r.iamClient.DeleteRole(ctx, deleteInput) + if err != nil { + var nsee *types.NoSuchEntityException + if errors.As(err, &nsee) { + log.Warnf("IAM role %s not found, likely already deleted.", roleName) + } else { + log.Errorf("❌ Failed to delete IAM role %s: %v", roleName, err) + cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to delete iam role %s: %w", roleName, err)) + } + } else { + log.Infof("✅ Deleted IAM role %s", roleName) + } + } else if r.config.LambdaRoleArn != "" { // Check if Role was provided via config + log.Infof("Skipping deletion of user-provided IAM Role: %s", r.config.LambdaRoleArn) + } else { // Neither managed nor provided (or r.roleArn was nil initially) + log.Info("No managed IAM Role ARN recorded to delete.") + } + + // 4. Delete CloudWatch Log Groups (always delete these as they are tied to the specific function run) + if len(deletedFunctionNames) > 0 { + log.Infof("Deleting %d CloudWatch Log Group(s)...", len(deletedFunctionNames)) + for functionName, functionArn := range deletedFunctionNames { + logGroupName := fmt.Sprintf("/aws/lambda/%s", functionName) + log.Debugf("Deleting CloudWatch Log Group %s (for function %s)", logGroupName, functionArn) + + deleteLogGroupInput := &cloudwatchlogs.DeleteLogGroupInput{ + LogGroupName: aws.String(logGroupName), + } + _, err := r.cloudwatchlogsClient.DeleteLogGroup(ctx, deleteLogGroupInput) + if err != nil { + var rnfe *cwltypes.ResourceNotFoundException // Use aliased type + if errors.As(err, &rnfe) { + log.Warnf("CloudWatch Log Group %s not found, likely already deleted or never created.", logGroupName) + } else { + log.Errorf("❌ Failed to delete CloudWatch Log Group %s: %v", logGroupName, err) + cleanupErrors = append(cleanupErrors, fmt.Errorf("failed to delete log group %s: %w", logGroupName, err)) + } + } else { + log.Infof("✅ Deleted CloudWatch Log Group %s", logGroupName) + } + } + } else { + log.Info("No Lambda function names recorded to attempt log group deletion.") + } + + if len(cleanupErrors) > 0 { + log.Error("Lambda Runner: Cleanup phase completed with errors.") + // Combine errors? For now, just return the first one or a generic error. + return fmt.Errorf("cleanup failed with %d error(s): %w", len(cleanupErrors), cleanupErrors[0]) + } + + log.Info("Lambda Runner: Cleanup phase completed successfully.") + return nil +} + +// Helper function to extract function name from ARN +// Example ARN: arn:aws:lambda:us-west-2:123456789012:function:my-function +func getFunctionNameFromARN(arn string) string { + parts := strings.Split(arn, ":") + if len(parts) >= 6 && parts[5] == "function" { + // Handle potential version/alias suffix like my-function:1 or my-function:$LATEST + nameParts := strings.Split(parts[6], ":") + return nameParts[0] + } + return "" +} + +type networkInterface struct { + eniID string + eniStatus ec2types.NetworkInterfaceStatus + description string + attachmentID string + attachmentStatus ec2types.AttachmentStatus +} + +// findNetworkInterfacesForSecurityGroup finds ENIs and their Attachment IDs associated with a specific security group. +// Returns a map[eniID]attachmentID. +func (r *LambdaTestRunner) findNetworkInterfacesForSecurityGroup(ctx context.Context, sgID string) (map[string]*networkInterface, error) { + eniAttachments := make(map[string]*networkInterface) + input := &ec2.DescribeNetworkInterfacesInput{ + Filters: []ec2types.Filter{ + { + Name: aws.String("group-id"), + Values: []string{sgID}, + }, + }, + } + + paginator := ec2.NewDescribeNetworkInterfacesPaginator(r.ec2Client, input) + for paginator.HasMorePages() { + page, err := paginator.NextPage(ctx) + if err != nil { + return nil, fmt.Errorf("failed to describe network interfaces for SG %s: %w", sgID, err) + } + + for _, eni := range page.NetworkInterfaces { + if eni.NetworkInterfaceId == nil { + log.Warnf("Found ENI for SG %s, but it has no interfaceId. Cannot detach automatically.", sgID) + continue + } + + eniID := *eni.NetworkInterfaceId + nif := networkInterface{ + eniID: eniID, + description: *eni.Description, + eniStatus: eni.Status, + } + if eni.Attachment != nil && eni.Attachment.AttachmentId != nil { + nif.attachmentStatus = eni.Attachment.Status + nif.attachmentID = *eni.Attachment.AttachmentId + } + eniAttachments[eniID] = &nif + } + } + return eniAttachments, nil +} + +// deleteNetworkInterfaceWithRetry attempts to delete an ENI with retries. +func (r *LambdaTestRunner) deleteNetworkInterfaceWithRetry(ctx context.Context, eniID string) error { + maxDuration := 3 * time.Minute + maxWaitDuration := time.NewTimer(maxDuration) + baseDelay := 20 * time.Second + var lastErr error + + // for ;; { + // enis, err := r.ec2Client.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{ + // NetworkInterfaceIds: []string{eniID}, + // }) + // if err != nil { + // log.WithError(err).Warnf("Failed to describe ENI %s before deletion attempt.", eniID) + // return fmt.Errorf("failed to describe ENI %s: %w", eniID, err) + // } + // if len(enis.NetworkInterfaces) == 0 { + // log.Warnf("ENI %s not found during describe before deletion attempt.", eniID) + // return nil + // } + // eni := enis.NetworkInterfaces[0] + // if eni.Status == ec2types.NetworkInterfaceStatusDetaching { + // log.Infof("Found ENI %s with status %s before deletion attempt.", eniID, eni.Status) + // } + // } + +loop: + for attempt := 1; ; attempt++ { + log.Debugf("Attempt %d to delete ENI %s...", attempt, eniID) + _, err := r.ec2Client.DeleteNetworkInterface(ctx, &ec2.DeleteNetworkInterfaceInput{ + NetworkInterfaceId: aws.String(eniID), + }) + if err == nil { + log.Debugf("Successfully deleted ENI %s on attempt %d.", eniID, attempt) + return nil // Success + } + + lastErr = err + log.WithError(err).Warnf("Attempt %d failed to delete ENI %s.", attempt, eniID) + + // Check if it's already deleted + var apiErr smithy.APIError + if errors.As(err, &apiErr) && apiErr.ErrorCode() == "InvalidNetworkInterfaceID.NotFound" { + log.Infof("ENI %s not found during delete attempt %d, assuming already deleted.", eniID, attempt) + return nil // Treat as success + } + + // Wait before retrying + select { + case <-time.After(baseDelay): + // Continue loop + case <-maxWaitDuration.C: + log.Warnf("Timeout struck while waiting to retry ENI %s deletion.", eniID) + break loop + case <-ctx.Done(): + log.Warnf("Context cancelled while waiting to retry ENI %s deletion.", eniID) + return ctx.Err() + } + } + + return fmt.Errorf("failed to delete ENI %s after %d attempts: %w", eniID, maxDuration, lastErr) +} + +// LoadLambdaRunnerFromTags creates a new LambdaTestRunner instance by discovering existing +// AWS resources based on known names and the standard network check tag. +func LoadLambdaRunnerFromTags(ctx context.Context, networkConfig *checks.NetworkConfig) (*LambdaTestRunner, error) { + runner, err := NewLambdaTestRunner(ctx, networkConfig) + if err != nil { + return nil, fmt.Errorf("failed to create base LambdaTestRunner: %w", err) + } + + log.Info("Attempting to load existing Lambda runner resources from tags...") + + // Discover Lambda Functions by tag + log.Debugf("Searching for Lambda functions with tag %s=%s", NetworkCheckTagKey, NetworkCheckTagValue) // Use exported constants + listFuncPaginator := lambda.NewListFunctionsPaginator(runner.lambdaClient, &lambda.ListFunctionsInput{}) + foundFunctions := 0 + for listFuncPaginator.HasMorePages() { + page, err := listFuncPaginator.NextPage(ctx) + if err != nil { + log.WithError(err).Warn("Failed to list Lambda functions page, discovery might be incomplete.") + break // Stop processing on error, but continue with what we have + } + for _, function := range page.Functions { + tagsOutput, err := runner.lambdaClient.ListTags(ctx, &lambda.ListTagsInput{Resource: function.FunctionArn}) + if err != nil { + log.WithError(err).Warnf("Failed to list tags for function %s, skipping.", *function.FunctionArn) + continue + } + if val, ok := tagsOutput.Tags[NetworkCheckTagKey]; ok && val == NetworkCheckTagValue { // Use exported constants + log.Debugf("Found tagged Lambda function: %s", *function.FunctionArn) + // We don't know the original subnet ID here, store by ARN for cleanup + runner.functionArns[*function.FunctionArn] = *function.FunctionArn + foundFunctions++ + } + } + } + if foundFunctions > 0 { + log.Infof("Discovered %d existing Lambda function(s) tagged for cleanup.", foundFunctions) + } else { + log.Info("No existing Lambda functions found with the network check tag.") + } + + // Discover IAM Role by name (and optionally check tag) + log.Debugf("Checking for managed IAM role: %s", lambdaRoleName) + getRoleInput := &iam.GetRoleInput{RoleName: aws.String(lambdaRoleName)} + getRoleOutput, err := runner.iamClient.GetRole(ctx, getRoleInput) + if err == nil { + // Verify tag - GetRole doesn't return tags directly, need ListRoleTags + tagsOutput, tagErr := runner.iamClient.ListRoleTags(ctx, &iam.ListRoleTagsInput{RoleName: aws.String(lambdaRoleName)}) + hasTag := false + if tagErr == nil { + for _, tag := range tagsOutput.Tags { + if aws.ToString(tag.Key) == NetworkCheckTagKey && aws.ToString(tag.Value) == NetworkCheckTagValue { // Use exported constants + hasTag = true + break + } + } + } else { + log.WithError(tagErr).Warnf("Could not list tags for role %s", lambdaRoleName) + } + + if hasTag { + log.Infof("Discovered existing managed IAM role: %s", *getRoleOutput.Role.Arn) + runner.roleArn = getRoleOutput.Role.Arn + } else { + log.Warnf("Found IAM role named %s, but it doesn't have the expected tag (%s=%s). It will not be managed/cleaned up.", lambdaRoleName, NetworkCheckTagKey, NetworkCheckTagValue) // Use exported constants + } + } else { + var nsee *types.NoSuchEntityException + if !errors.As(err, &nsee) { + log.WithError(err).Warnf("Failed to get IAM role %s", lambdaRoleName) + } else { + log.Info("No existing managed IAM role found.") + } + } + + // Discover Security Group by name and tag + log.Debugf("Checking for managed Security Group: %s", lambdaSecurityGroupName) + // We need a VPC ID to search for the SG. If subnets are configured, use the first one. + // If no subnets are configured (e.g., pure cleanup run), we might not be able to find the SG reliably by name alone across VPCs. + // For cleanup, maybe we should list *all* SGs with the tag? Or require VPC context? + // Let's assume for now cleanup usually runs with the same config, so we can get VPC ID. + vpcID, err := runner.getVpcIDFromSubnets(ctx) // Re-use existing helper + if err != nil { + log.WithError(err).Warn("Could not determine VPC ID from config, Security Group discovery might be limited.") + // Potentially list SGs across all VPCs with the tag? Riskier. For now, skip SG discovery if VPC is unknown. + } else { + describeSGInput := &ec2.DescribeSecurityGroupsInput{ + Filters: append(NetworkCheckTagsFilter, // Use exported var + ec2types.Filter{Name: aws.String("vpc-id"), Values: []string{*vpcID}}, + ec2types.Filter{Name: aws.String("group-name"), Values: []string{lambdaSecurityGroupName}}, + ), + } + describeSGOutput, err := runner.ec2Client.DescribeSecurityGroups(ctx, describeSGInput) + if err != nil { + log.WithError(err).Warnf("Could not describe security groups (maybe transient error or SG doesn't exist)") + } else if len(describeSGOutput.SecurityGroups) > 0 { + // Found the managed SG + sgID := describeSGOutput.SecurityGroups[0].GroupId + log.Infof("Discovered existing managed Security Group: %s", *sgID) + runner.securityGroupID = sgID + } else { + log.Info("No existing managed Security Group found.") + } + } + + log.Info("Lambda runner resource discovery complete.") + return runner, nil +} diff --git a/gitpod-network-check/pkg/runner/lambda_handler.go b/gitpod-network-check/pkg/runner/lambda_handler.go new file mode 100644 index 0000000..f1a6f97 --- /dev/null +++ b/gitpod-network-check/pkg/runner/lambda_handler.go @@ -0,0 +1,73 @@ +package runner + +import ( + "context" + "fmt" + "net/http" + "net/url" + "time" + + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/lambda_types" + log "github.com/sirupsen/logrus" +) + +// HandleLambdaEvent is handling the Lambda event. +// This function is called by the aws-lambda-go library. +func HandleLambdaEvent(ctx context.Context, request lambda_types.CheckRequest) (lambda_types.CheckResponse, error) { + log.Infof("Lambda Handler: Received check request for %d endpoints.", len(request.Endpoints)) + + response := lambda_types.CheckResponse{ + Results: make(map[string]lambda_types.CheckResult), + } + + client := &http.Client{ + Timeout: 10 * time.Second, // Consider making this configurable if needed + } + + // Perform checks + for name, targetUrlStr := range request.Endpoints { + + targetUrl, err := url.Parse(targetUrlStr) + if err != nil { + response.Results[name] = lambda_types.CheckResult{Success: false, Error: fmt.Sprintf("invalid URL: %v", err)} + log.Warnf(" -> Failed URL parsing for '%s': %v", targetUrlStr, err) + continue + } + if targetUrl.Scheme == "" { + // Default to HTTPS if no scheme is provided + targetUrl.Scheme = "https" + } + + log.Debugf("Lambda Handler: Checking endpoint: %s (%s)", name, targetUrl.String()) + + // Use the context provided by the Lambda runtime + log := log.WithField("endpoint", targetUrl.String()) + + req, err := http.NewRequestWithContext(ctx, "GET", targetUrl.String(), nil) + if err != nil { + response.Results[name] = lambda_types.CheckResult{Success: false, Error: fmt.Sprintf("failed to create request: %v", err)} + log.Warnf(" -> Failed (request creation): %v", err) + continue + } + + resp, err := client.Do(req) + if err != nil { + response.Results[name] = lambda_types.CheckResult{Success: false, Error: fmt.Sprintf("HTTP request failed: %v", err)} + log.Warnf(" -> Failed (HTTP request): %v", err) + } else { + resp.Body.Close() // Ensure body is closed + if resp.StatusCode >= 200 && resp.StatusCode < 300 { + response.Results[name] = lambda_types.CheckResult{Success: true} + log.Debugf(" -> Success (Status: %d)", resp.StatusCode) + } else { + response.Results[name] = lambda_types.CheckResult{Success: false, Error: fmt.Sprintf("unexpected status code: %d", resp.StatusCode)} + log.Warnf(" -> Failed (Status: %d)", resp.StatusCode) + } + } + } + + log.Info("Lambda Handler: Check processing complete.") + // The lambda library handles marshalling the response and deals with errors. + // We return the response struct and nil error if processing logic itself didn't fail critically. + return response, nil +} diff --git a/gitpod-network-check/pkg/runner/lambda_handler_test.go b/gitpod-network-check/pkg/runner/lambda_handler_test.go new file mode 100644 index 0000000..122ecd7 --- /dev/null +++ b/gitpod-network-check/pkg/runner/lambda_handler_test.go @@ -0,0 +1,146 @@ +package runner + +import ( + "context" + "fmt" + "net/http" + "net/http/httptest" + "testing" + + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/lambda_types" + cmp "github.com/google/go-cmp/cmp" // Added for deep comparison + log "github.com/sirupsen/logrus" +) + +// TestHandleLambdaEvent tests the core logic within the Lambda handler function. +func TestHandleLambdaEvent(t *testing.T) { + + tests := []struct { + name string + request lambda_types.CheckRequest + expectedResp lambda_types.CheckResponse + }{ + { + name: "successful http check", + request: lambda_types.CheckRequest{ + Endpoints: map[string]string{ + "example_http": "http://example.com", // Use http to avoid cert issues in test + }, + }, + expectedResp: lambda_types.CheckResponse{ + Results: map[string]lambda_types.CheckResult{ + "example_http": {Success: true}, + }, + }, + }, + { + name: "successful https check", + request: lambda_types.CheckRequest{ + Endpoints: map[string]string{ + "example_https": "https://example.com", + }, + }, + expectedResp: lambda_types.CheckResponse{ + Results: map[string]lambda_types.CheckResult{ + "example_https": {Success: true}, + }, + }, + }, + { + name: "failed http check - 404", + // Assuming httpbin gives a 404 for this path + request: lambda_types.CheckRequest{ + Endpoints: map[string]string{ + "httpbin_404": "http://httpbin.org/status/404", + }, + }, + expectedResp: lambda_types.CheckResponse{ + Results: map[string]lambda_types.CheckResult{ + "httpbin_404": {Success: false, Error: "unexpected status code: 404"}, + }, + }, + }, + { + name: "failed http check - connection refused", + // Use a port likely not open on localhost + request: lambda_types.CheckRequest{ + Endpoints: map[string]string{ + "localhost_conn_refused": "http://127.0.0.1:1", + }, + }, + expectedResp: lambda_types.CheckResponse{ + Results: map[string]lambda_types.CheckResult{ + // Error message might vary slightly depending on OS/network stack - adjust if needed after running + "localhost_conn_refused": {Success: false, Error: "HTTP request failed: Get \"http://127.0.0.1:1\": dial tcp 127.0.0.1:1: connect: connection refused"}, + }, + }, + }, + { + name: "multiple endpoints - mix success and failure", + request: lambda_types.CheckRequest{ + Endpoints: map[string]string{ + "example_ok": "https://example.com", + "httpbin_404": "http://httpbin.org/status/404", + "conn_refused": "http://127.0.0.1:1", + }, + }, + expectedResp: lambda_types.CheckResponse{ + Results: map[string]lambda_types.CheckResult{ + "example_ok": {Success: true}, + "httpbin_404": {Success: false, Error: "unexpected status code: 404"}, + // Error message might vary slightly depending on OS/network stack - adjust if needed after running + "conn_refused": {Success: false, Error: "HTTP request failed: Get \"http://127.0.0.1:1\": dial tcp 127.0.0.1:1: connect: connection refused"}, + }, + }, + }, + } + + // Setup mock HTTP server for reliable testing of external URLs + mockServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + // Check the path to determine the response, as Host will be the mock server's address + if r.URL.Path == "/status/404" { + w.WriteHeader(http.StatusNotFound) + fmt.Fprintln(w, "Not Found") + } else if r.URL.Path == "/" { // Assume requests to the root are for the "success" case + w.WriteHeader(http.StatusOK) + fmt.Fprintln(w, "OK") + } else { + // Log unexpected paths to help debug test failures + log.Warnf("Mock server received request for unexpected path: %s", r.URL.Path) + w.WriteHeader(http.StatusInternalServerError) + fmt.Fprintln(w, "Mock server default - unexpected path") + } + })) + defer mockServer.Close() + + // Replace external URLs in test cases with mock server URL + // This makes tests faster and more reliable (no external network dependency) + for i := range tests { + newEndpoints := make(map[string]string) + for name, url := range tests[i].request.Endpoints { + switch url { + case "http://example.com", "https://example.com": + newEndpoints[name] = mockServer.URL + case "http://httpbin.org/status/404": + newEndpoints[name] = mockServer.URL + "/status/404" + default: + newEndpoints[name] = url // Keep internal/localhost URLs as is + } + } + tests[i].request.Endpoints = newEndpoints + } + + for _, tt := range tests { + tt := tt // Capture range variable + t.Run(tt.name, func(t *testing.T) { + actualResp, err := HandleLambdaEvent(context.Background(), tt.request) + if err != nil { + t.Errorf("handleLambdaEvent returned an unexpected error: %v", err) + } + + if diff := cmp.Diff(tt.expectedResp, actualResp); diff != "" { + t.Errorf("handleLambdaEvent response mismatch (-want +got):\n%s", diff) + } + }) + } +} diff --git a/gitpod-network-check/pkg/runner/local-runner.go b/gitpod-network-check/pkg/runner/local-runner.go new file mode 100644 index 0000000..231e73a --- /dev/null +++ b/gitpod-network-check/pkg/runner/local-runner.go @@ -0,0 +1,105 @@ +package runner + +import ( + "context" + "fmt" + "net/http" + "net/url" + "time" + + log "github.com/sirupsen/logrus" + + "github.com/gitpod-io/enterprise-deployment-toolkit/gitpod-network-check/pkg/checks" +) + +// LocalTestRunner executes network checks directly from the local machine. +type LocalTestRunner struct { + // No fields needed for local runner currently +} + +// NewLocalTestRunner creates a new instance of LocalTestRunner. +func NewLocalTestRunner() *LocalTestRunner { + log.Info("ℹ️ Using local test runner") + return &LocalTestRunner{} +} + +// Prepare performs any setup required for the local runner. Currently a no-op. +func (r *LocalTestRunner) Prepare(ctx context.Context) error { + log.Debug("Local runner Prepare: No preparation needed.") + return nil // No setup needed for local execution +} + +// TestService runs connectivity tests to the specified service endpoints from the local machine. +// The subnets parameter is ignored in local mode. +func (r *LocalTestRunner) TestService(ctx context.Context, subnets []checks.Subnet, serviceEndpoints map[string]string) (bool, error) { + log.Debugf("Local runner TestService: Ignoring subnets (%d provided)", len(subnets)) + overallSuccess := true + + httpClient := &http.Client{ + Timeout: 15 * time.Second, // Sensible default timeout + Transport: &http.Transport{ + // Consider adding proxy support if needed later + // Proxy: http.ProxyFromEnvironment, + DisableKeepAlives: true, // Avoid reusing connections for distinct tests + }, + } + + for name, endpointURL := range serviceEndpoints { + log.Infof("ℹ️ Testing connectivity to %s (%s) locally...", name, endpointURL) + + // Ensure URL includes scheme + parsedURL, err := url.Parse(endpointURL) + if err != nil { + log.Errorf("❌ Failed to parse URL for %s (%s): %v", name, endpointURL, err) + overallSuccess = false + continue + } + if parsedURL.Scheme == "" { + // Default to HTTPS if no scheme is provided + parsedURL.Scheme = "https" + endpointURL = parsedURL.String() + log.Debugf("Assuming HTTPS for %s: %s", name, endpointURL) + } + + req, err := http.NewRequestWithContext(ctx, http.MethodHead, endpointURL, nil) + if err != nil { + log.Errorf("❌ Failed to create request for %s (%s): %v", name, endpointURL, err) + overallSuccess = false + continue + } + + // Add a user-agent? + // req.Header.Set("User-Agent", "gitpod-network-check/local") + + resp, err := httpClient.Do(req) + if err != nil { + log.Errorf("❌ Failed to connect to %s (%s): %v", name, endpointURL, err) + overallSuccess = false + continue + } + resp.Body.Close() // Ensure body is closed even if not read + + // Consider any 2xx or 3xx status code as success for a HEAD request. + // Some services might return 403 Forbidden for HEAD but are still reachable. + // Let's be lenient for now and accept anything < 500. + if resp.StatusCode >= 500 { + log.Errorf("❌ Connection test failed for %s (%s): Received status code %d", name, endpointURL, resp.StatusCode) + overallSuccess = false + } else { + log.Infof("✅ Successfully connected to %s (%s) - Status: %d", name, endpointURL, resp.StatusCode) + } + } + + if !overallSuccess { + return false, fmt.Errorf("one or more local connectivity tests failed") + } + + log.Info("✅ All local connectivity tests passed.") + return true, nil +} + +// Cleanup performs any teardown required for the local runner. Currently a no-op. +func (r *LocalTestRunner) Cleanup(ctx context.Context) error { + log.Debug("Local runner Cleanup: No cleanup needed.") + return nil // No cleanup needed for local execution +} diff --git a/memory-bank/activeContext.md b/memory-bank/activeContext.md new file mode 100644 index 0000000..546bb8a --- /dev/null +++ b/memory-bank/activeContext.md @@ -0,0 +1,33 @@ +# Active Context: gitpod-network-check (2025-04-03) + +**Current Focus:** + +Completed enhancements for the `lambda` execution mode based on initial review and `progress.md` TODOs. + +**Recent Changes:** + +* **Lambda Mode Enhancements:** + * Implemented CloudWatch Log Group deletion in `LambdaTestRunner.Cleanup`. + * Added wait/retry logic for Lambda function active state in `LambdaTestRunner.Prepare`. + * Added flags (`--lambda-role-arn`, `--lambda-sg-id`) and corresponding config fields (`LambdaRoleArn`, `LambdaSecurityGroupID`) to allow using existing AWS resources. + * Updated `LambdaTestRunner` `Prepare` and `Cleanup` methods to respect the new flags (skip creation/deletion of provided resources). + * Removed ad-hoc cleanup logic from `LambdaTestRunner.Prepare` and related functions, relying on the caller to invoke `Cleanup`. + * Added `cloudwatchlogs` dependency via `go get`. + * Aligned resource tagging in `LambdaTestRunner` with `EC2TestRunner` (`gitpod.io/network-check: true`). + * Implemented `LoadLambdaRunnerFromTags` function to discover existing Lambda resources for cleanup. + * Updated `LoadRunnerFromTags` in `common.go` to dispatch to `LoadLambdaRunnerFromTags`. + * Updated tag variables in `common.go` to be exported and updated references in both `lambda-runner.go` and `ec2-runner.go`. + * **Fixed `InvalidPermission.Duplicate` error:** Modified `getOrCreateSecurityGroup` in `lambda-runner.go` to check for existing default egress rules (IPv4/IPv6 allow-all) before attempting to add them, making the process idempotent. Added helper `ensureSecurityGroupEgressRule`. + * **Fixed Lambda function name length error:** Modified `NewLambdaTestRunner` in `lambda-runner.go` to use `time.Now().Unix()` (seconds) instead of `time.Now().UnixNano()` for the `runID` to keep function names under the 64-character limit. + * **Fixed IAM role trust policy error:** Modified `getOrCreateLambdaRole` in `lambda-runner.go` to check and update the assume role policy for existing roles (managed or user-provided) to ensure `lambda.amazonaws.com` is trusted. Added helper `ensureLambdaTrustPolicy` and a delay for IAM propagation. + * **Fixed invalid subnet ID/function name format error:** Added more robust cleaning logic in the `Prepare` function's deployment loop in `lambda-runner.go` to remove extraneous characters (spaces, brackets) from subnet IDs before using them. +* **Documentation:** + * Updated `gitpod-network-check/README.md` with details on `lambda` mode prerequisites, usage, and new flags. +* **Memory Bank:** + * Updated `memory-bank/progress.md` to reflect completed enhancements and remaining tasks. + * Updated this file (`memory-bank/activeContext.md`). + +**Next Steps:** + +* Perform testing of the `lambda` mode in a real AWS environment. +* Consider if more sophisticated error handling/rollback in `Prepare` is necessary based on testing results. diff --git a/memory-bank/productContext.md b/memory-bank/productContext.md new file mode 100644 index 0000000..492b195 --- /dev/null +++ b/memory-bank/productContext.md @@ -0,0 +1,25 @@ +# Product Context: gitpod-network-check + +**Problem Solved:** + +Deploying Gitpod requires specific network configurations to allow communication between its components and various external services (AWS APIs, container registries, identity providers, etc.). Misconfigurations are common and can be difficult to diagnose, leading to deployment failures or runtime issues. This tool aims to proactively identify these network connectivity problems before or during Gitpod installation/updates. + +**How it Should Work:** + +The tool executes predefined sets of network connectivity tests (`TestSets`) targeting specific endpoints required by Gitpod. These tests are run from environments that simulate where Gitpod components would run (e.g., within specific AWS subnets). + +* **Modes:** The tool supports different execution modes: + * `ec2`: (Existing) Launches temporary EC2 instances in specified subnets to run tests. Requires AWS credentials and permissions. + * `lambda`: (Planned/Partially Implemented?) Uses AWS Lambda functions for testing. + * `local`: (Current Task) Runs tests directly from the machine executing the CLI using standard Go libraries. Useful for basic outbound checks or when AWS resources aren't desired/available. +* **Test Sets:** Groups of related checks (e.g., connectivity to core AWS services from pod subnets). +* **Configuration:** Network details (subnets, region) and test parameters (hosts) are provided via CLI flags or a configuration file. +* **Output:** Logs detailed information about each check, clearly indicating success or failure. +* **Cleanup:** Automatically removes any temporary resources created during the `ec2` mode run. + +**User Experience Goals:** + +* **Simplicity:** Easy to run with sensible defaults. +* **Clarity:** Provide clear pass/fail results and informative error messages. +* **Flexibility:** Allow users to select specific test sets and execution modes. +* **Reliability:** Accurately reflect the network connectivity status relevant to Gitpod. diff --git a/memory-bank/progress.md b/memory-bank/progress.md new file mode 100644 index 0000000..cd76a30 --- /dev/null +++ b/memory-bank/progress.md @@ -0,0 +1,47 @@ +# Progress: gitpod-network-check (2025-04-03) + +**What Works:** + +* Core CLI structure using Cobra. +* Configuration loading via Viper (flags, file). +* `diagnose` command framework. +* `TestRunner` interface defined. +* `ec2` mode: + * Launches EC2 instances in specified subnets. + * Uses SSM to run check scripts on instances. + * Performs basic connectivity checks. + * `cleanup` command removes EC2 resources. +* `local` mode: + * Runs checks directly from the CLI host using Go's `net/http`. +* `lambda` mode: + * `LambdaTestRunner` implemented (`Prepare`, `TestService`, `Cleanup`). + * Internal `lambda-handler` subcommand created (`cmd/lambda_handler.go`) to perform checks inside Lambda, using shared types (`pkg/lambda_types`). + * `Prepare` handles IAM role, SG creation, packaging the *main binary* with a `bootstrap` script, and Lambda deployment per subnet using `provided.al2` runtime. + * `TestService` invokes Lambdas per subnet and aggregates JSON results. + * `Prepare` handles IAM role, SG creation (or uses existing ones via flags/config), packaging the main binary with a `bootstrap` script, Lambda deployment per subnet using `provided.al2` runtime, and waits for functions to become active. Includes basic deferred cleanup on error. + * `TestService` invokes Lambdas per subnet and aggregates JSON results. + * `Cleanup` handles Lambda function, CloudWatch Log Group deletion, and deletes managed SG/IAM role (skips deletion if user-provided). + * Integrated into `diagnose` (via `runner.NewRunner`) and `cleanup` commands. + * Flags (`--lambda-role-arn`, `--lambda-sg-id`) and config options added. + * Help text updated. + * README documentation updated for Lambda mode prerequisites and usage. + * Aligned resource tagging (`gitpod.io/network-check: true`) with EC2 mode. + * Removed ad-hoc cleanup logic from `Prepare`. + * Added `LoadLambdaRunnerFromTags` to discover existing resources for cleanup. + * Integrated `LoadLambdaRunnerFromTags` into the `cleanup` command via `LoadRunnerFromTags`. + * Removed separate Lambda handler code (`lambda/checker/`) and cleaned dependencies. + +**What's Left to Build:** + +* **`lambda` mode enhancements:** + * Testing in a real AWS environment. + * Consider more sophisticated rollback logic in `Prepare` if needed beyond basic deferred cleanup. + +**Current Status:** + +* Implementation of `lambda` mode enhancements (Log Group cleanup, readiness wait, existing resource flags, documentation, aligned tagging) and cleanup refactoring completed. +* Ready for testing. + +**Known Issues:** + +* Error handling during resource creation in `Prepare` relies solely on the caller invoking `Cleanup`. Complex partial failures might leave orphaned resources if `Cleanup` is not called or fails itself. diff --git a/memory-bank/projectbrief.md b/memory-bank/projectbrief.md new file mode 100644 index 0000000..b824e62 --- /dev/null +++ b/memory-bank/projectbrief.md @@ -0,0 +1,13 @@ +# Project Brief: gitpod-network-check + +**Core Purpose:** + +`gitpod-network-check` is a command-line interface (CLI) tool designed to diagnose network connectivity issues relevant to deploying and running Gitpod Self-Hosted or Gitpod Dedicated instances. + +**Key Goals:** + +* Provide a reliable way for administrators and support engineers to verify network prerequisites for Gitpod. +* Test connectivity from relevant network segments (e.g., pod subnets, main subnets) to required external services (AWS APIs, container registries, etc.) and internal components. +* Support different testing backends (e.g., EC2 instances, potentially Lambda, local execution) to suit various environments and testing needs. +* Offer clear, actionable output indicating success or failure for specific checks. +* Manage any temporary infrastructure created for testing (e.g., EC2 instances, security groups). diff --git a/memory-bank/systemPatterns.md b/memory-bank/systemPatterns.md new file mode 100644 index 0000000..3eac442 --- /dev/null +++ b/memory-bank/systemPatterns.md @@ -0,0 +1,47 @@ +# System Patterns: gitpod-network-check + +**Core Architecture:** + +* **CLI Application:** Built using Go and the Cobra library for command structure and flag parsing. +* **Configuration:** Uses Viper for managing configuration from files (e.g., `gitpod-network-check.yaml`) and environment variables, layered with CLI flags. +* **Modular Test Execution:** Employs a `TestRunner` interface (`pkg/runner/common.go`) to abstract the environment where network tests are executed. This allows plugging in different backends (EC2, Local, potentially Lambda). +* **Test Definitions:** Test logic is grouped into `TestSets` (`pkg/checks/`), which are functions returning endpoints and subnet types to test. + +**Key Patterns:** + +* **Strategy Pattern:** The `TestRunner` interface and its implementations (`EC2TestRunner`, `LocalTestRunner`) exemplify the Strategy pattern, allowing the test execution strategy to be selected at runtime (`--mode` flag). +* **Dependency Injection (Implicit):** The `NetworkConfig` struct is populated from configuration sources and passed down to components that need it (like the `EC2TestRunner`). +* **Resource Management (EC2):** The `EC2TestRunner` handles the lifecycle (Prepare, Cleanup) of temporary AWS resources (Instances, Roles, Security Groups) needed for testing. The `LocalTestRunner` requires no external resource management. +* **Command Pattern (Cobra):** Cobra organizes CLI functionality into distinct `Command` objects (`diagnose`, `cleanup`). + +**Component Relationships:** + +```mermaid +graph TD + CLI[gitpod-network-check CLI] --> RootCmd[cmd/root.go]; + RootCmd -- loads config --> Config[NetworkConfig]; + RootCmd -- registers --> DiagnoseCmd[cmd/checks.go]; + RootCmd -- registers --> CleanupCmd[cmd/cleanup.go]; + + DiagnoseCmd -- uses --> Config; + DiagnoseCmd -- selects based on mode --> RunnerInterface[pkg/runner/common.go#TestRunner]; + RunnerInterface -- implemented by --> EC2Runner[pkg/runner/ec2-runner.go]; + RunnerInterface -- implemented by --> LocalRunner[pkg/runner/local-runner.go]; + + DiagnoseCmd -- uses --> TestSets[pkg/checks/]; + TestSets -- define --> Endpoints; + TestSets -- define --> SubnetTypes; + + EC2Runner -- uses --> AWS_SDK[AWS SDK (EC2, IAM, SSM)]; + EC2Runner -- manages --> AWSResources[EC2 Instances, SG, IAM Roles]; + LocalRunner -- uses --> GoStdLib[Go net/http]; + + CleanupCmd -- uses --> Config; + CleanupCmd -- loads --> EC2Runner; + EC2Runner -- Cleanup --> AWS_SDK; + + CLI -- entrypoint --> main.go; + main.go -- calls --> RootCmd.Execute; + + style EC2Runner fill:#f9f,stroke:#333,stroke-width:2px; + style LocalRunner fill:#ccf,stroke:#333,stroke-width:2px; diff --git a/memory-bank/techContext.md b/memory-bank/techContext.md new file mode 100644 index 0000000..4c4ef79 --- /dev/null +++ b/memory-bank/techContext.md @@ -0,0 +1,32 @@ +# Tech Context: gitpod-network-check + +**Core Technologies:** + +* **Language:** Go (Golang) +* **CLI Framework:** Cobra (`github.com/spf13/cobra`) +* **Configuration Management:** Viper (`github.com/spf13/viper`) +* **Logging:** Logrus (`github.com/sirupsen/logrus`) +* **AWS Interaction (EC2 Mode):** AWS SDK for Go v2 (`github.com/aws/aws-sdk-go-v2`) + * Services used: EC2, IAM, SSM +* **HTTP Requests (Local Mode):** Go Standard Library (`net/http`) + +**Development Setup:** + +* Standard Go development environment (`go build`, `go test`, etc.). +* Dependencies managed via Go Modules (`go.mod`, `go.sum`). +* Likely developed within a containerized environment like Gitpod or Dev Containers for consistency. + +**Technical Constraints:** + +* **EC2 Mode:** Requires valid AWS credentials with sufficient permissions to create/manage EC2 instances, IAM roles/profiles, and security groups, and to use SSM. Assumes network connectivity for the AWS SDK itself. +* **Local Mode:** Relies on the network connectivity of the machine running the CLI. May not accurately reflect connectivity from within specific AWS subnets if run externally. +* **Go Version:** Compatibility depends on the Go version specified in `go.mod`. + +**Key Dependencies:** + +* `github.com/spf13/cobra`: CLI framework +* `github.com/spf13/viper`: Configuration +* `github.com/sirupsen/logrus`: Logging +* `github.com/aws/aws-sdk-go-v2/*`: AWS SDK components +* `golang.org/x/sync/errgroup`: Concurrency management +* `k8s.io/apimachinery/pkg/util/wait`: Polling/waiting utilities (used in EC2 runner)