Skip to content

Prevent leaking seclist ingress and egress rules when updating a service #238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 30, 2018

Conversation

templecloud
Copy link
Contributor

This commit fixes a particular type of seclist rule leak in the CCM. When a services port or health check port is mutated then the old/original 'single-range' port rules created in the 'CCM ingress' and 'K8sWorker egress' seclists (or equivalent) are not deleted. This can be replicated consistently by running the 'Canary' e2e tests that performs a mutation test on a service. This fix also includes small optimisation to only create seclist rules for a loadbalancer after it has been successfully created.

This commit includes the following:

  1. Two small conservative fixes to clean up the leaking ingress/egress rules when a service port is mutated. This required a few small alterations to pass up existing port information to the 'getEgressRules()' functions.

  2. Some additional unit tests to check the above fixes.

  3. Added an OCI client to the e2e test framework and added a simple sanity check to count the number seclist rules for a port before and after mutation. To keep things simple this test only runs if configured to do so. Configuring the tests consists of specifying the OCID of the target seclists to check as environment variables.

This commit does not include any refactoring or tidying up of these existing seclist rule management functionality. I think this could be harmonised a little and made more clear; then surrounded with further tests. In particular the getLoadBalancerEgressRules() and getNodeIngressRules() are related but work differently. I rolled back such changes for this commit as there was other ongoing work on seclists.

Anyway, please check carefully; this is my first foray into this stuff and needed a fun rebase to get the new changes to logging etc.

Makefile Outdated
@@ -29,7 +29,7 @@ SRC_DIRS := cmd pkg # directories which hold app source (not vendored)

# Allows overriding where the CCM should look for the cloud provider config
# when running via make run-dev.
CLOUD_PROVIDER_CFG ?= $$(pwd)/cloud-provider.yaml
CLOUDCONFIG ?= $$(pwd)/cloud-provider.yaml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remember why I wanted to change this. Can change back.

--kubeconfig="${KUBECONFIG}" \
--cloud-config="${CLOUDCONFIG}" \
--delete-namespace=false
-- --kubeconfig=${KUBECONFIG} --cloud-config=${CLOUDCONFIG} --delete-namespace=true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agh, the change in the Makefile above might have been to match this usage in the tests. I can still roll it back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems irrelevant to this PR so would prefer it to be reversed to simplify review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rolled this back, but reckon we should have kept it. No-one will fix it in a single PR and in any other PR the same argument can be made! Anyway, should be reverted now!

@@ -235,9 +235,9 @@ func buildConfigurationProvider(logger *zap.Logger, config *Config) (common.Conf
return cp, nil
}

// newRateLimiter builds and returns a struct containing read and write
// NewRateLimiter builds and returns a struct containing read and write
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed this when creating the new OCI client for the e2e tests. It seemed wrong to cut and paste the code. This function seems to be a 'constructor', so I made it public.

@@ -293,7 +287,18 @@ func (cp *CloudProvider) createLoadBalancer(ctx context.Context, spec *LBSpec) (
}

logger.With("loadBalancerID", *lb.Id).Info("Load balancer created")
return loadBalancerToStatus(lb)
status, err := loadBalancerToStatus(lb)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved adding seclists to after we have created the loadbalancer. If creation fails for some reason the rules will not be added. Could there be unforseen issues with this?

@@ -152,12 +152,20 @@ func (s *baseSecurityListManager) updateLoadBalancerRules(ctx context.Context, l

logger := s.logger.With("securityListID", *secList.Id)

lbEgressRules := getLoadBalancerEgressRules(logger, secList.EgressSecurityRules, nodeSubnets, ports.BackendPort, s.serviceLister)
lbEgressRules = getLoadBalancerEgressRules(logger, lbEgressRules, nodeSubnets, ports.HealthCheckerPort, s.serviceLister)
// -1 denotes nil ports.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to thread though the 'current' backEnd and health ports in-order to carefully delete the leaking rules. As I mentioned at some point I would like to tidy up this double invocation to getLoadBalancerEgressRules() and make it work more like getNodeIngressRules().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it just safe to use the zero value (i.e. 0) here as port 0 is reserved? Generally in-band errors / special values are a bit of a no-no (intertwined with the "make the zero value useful" philosophy).

@@ -362,6 +378,23 @@ func getNodeIngressRules(
ingressRules := []core.IngressSecurityRule{}

for _, rule := range rules {
// Remove (do not re-add) any rule that represents the old case when
// mutating a single ranged backend port or health check port.
if rule.TcpOptions != nil && rule.TcpOptions.DestinationPortRange != nil &&
Copy link
Contributor Author

@templecloud templecloud Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first part of the fix. I wanted to be on the safe side so made it very restrictive. It should only target the 'single port rule mutation' described.

@@ -531,8 +564,21 @@ func getLoadBalancerEgressRules(

egressRules := []core.EgressSecurityRule{}
for _, rule := range rules {
// Remove (do not re-add) any rule that represents the old case when mutating a single ranged port.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the second part of the fix. I wanted to be on the safe side so made it very restrictive. It should only target the 'single port rule mutation' described.

@@ -195,6 +195,37 @@ func TestGetNodeIngressRules(t *testing.T) {
makeIngressSecurityRule("1", 80),
makeIngressSecurityRule("2", 80),
},
}, {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test for health check port mutation. I chose to put more realistic strings in. IMHO it makes these tests easier to read/understand, especially if we added further input rules in. I can make it like the other though.

expected: []core.EgressSecurityRule{
makeEgressSecurityRule("existing", 9000),
makeEgressSecurityRule("1", 80),
makeEgressSecurityRule("2", 80),
},
}, {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test for service port mutation. I chose to put more realistic strings in. IMHO it makes these tests easier to read/understand, especially if we added further input rules in. I can make it like the other though.

@@ -64,13 +69,20 @@ type Framework struct {
ClientSet clientset.Interface
InternalClientset internalclientset.Interface

CloudProviderConfig *oci.Config // If specified, the CloudProviderConfig. This provides information on the configuration of the test cluster.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configurable OCI client to run some seclist sanity checks.

}

// createOCIClient creates an OCI client derived from the CCM's cloud provider config file.
func createOCIClient(cloudProviderConfig *oci.Config) (client.Interface, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see the point in adding configuration files. We grab the auth info from the cloud config.

@@ -0,0 +1,143 @@
/*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A set of utility functions that count the number of ingress/egress port rules in the specified seclists. Used by the test check the right number of rules are mutated and all the old rules are deleted.

@@ -238,7 +246,7 @@ func (s *defaultSecurityListManager) Delete(ctx context.Context, lbSubnets []*co
noSubnets := []*core.Subnet{}
noSourceCIDRs := []string{}

err := s.updateLoadBalancerRules(ctx, lbSubnets, noSubnets, noSourceCIDRs, ports)
err := s.updateLoadBalancerRules(ctx, lbSubnets, noSubnets, noSourceCIDRs, &ports, ports)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chose sending in port twice over nil. I think this makes sense.

@@ -146,6 +146,7 @@ def file_extension(filename):
"test/e2e/load_balancer.go",
"test/e2e/framework/cleanup.go",
"test/e2e/framework/framework.go",
"test/e2e/framework/seclist_utils.go",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seclist_utils => seclist_util to be consistent with test/e2e/framework/service_util.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was matching actually file called 'network_utils'. I made all util files singular.

@@ -143,7 +143,7 @@ func (s *baseSecurityListManager) updateBackendRules(ctx context.Context, lbSubn

// updateLoadBalancerRules handles updating the ingress and egress rules for the load balance subnets.
// If the listener is nil, then only egress rules from the load balancer to the backend subnets will be checked.
func (s *baseSecurityListManager) updateLoadBalancerRules(ctx context.Context, lbSubnets []*core.Subnet, nodeSubnets []*core.Subnet, sourceCIDRs []string, ports portSpec) error {
func (s *baseSecurityListManager) updateLoadBalancerRules(ctx context.Context, lbSubnets []*core.Subnet, nodeSubnets []*core.Subnet, sourceCIDRs []string, actualPorts *portSpec, desiredPorts portSpec) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have actualPorts and desiredPorts here but one is a pointer, the other isn't. Is there any reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeagh, I agree - darn fuggly init?

To cut a long story short I think it derives from another bit of fuggly + typo on master (here)[https://github.com/oracle/oci-cloud-controller-manager/blob/e11429bc410ec62e52ba0aa504b777677bd3f908/pkg/oci/load_balancer_security_lists.go#L75].

I didn't fix it up on this run because I was trying to keep the changes small and close to the problem I am attempting to fix - as requested. It is really just cosmetic. I am happy to fix it up in this commit or as a subsequent one. Up to you. Please advise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logically makes sense. Update is used both for creation and modification and therefore there are call sites that don't have any "actual" (existing) ports associated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just curious in case it was a typo. If there are call sites that require nil values then we can leave it as is.

@@ -95,7 +107,10 @@ func NewFramework(baseName string, client clientset.Interface) *Framework {
BaseName: baseName,
ClientSet: client,
}

// Dev/CI only configuration. The seclist for CCM load-balancer routes.
f.CCMSecListID = os.Getenv("CCM_SECLIST_ID")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both these values will default to an empty string if the env var is not set. Would that cause logical problems when passed to framework.CountSecListTunnelSinglePortRules ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good thing to query, but, hopefully it is ok. I hope it is the case that if these seclists values are not set then the test calculations will collapse into 0 - so they will reported a s valid. I tested it manually and it seemed to work ok. We could make it explicit, but, it would make the tests more verbose to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than environment variables I'd prefer we used flags as we're already using them relatively extensively for configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// mutating a single ranged backend port or health check port.
if rule.TcpOptions != nil && rule.TcpOptions.DestinationPortRange != nil &&
*rule.TcpOptions.DestinationPortRange.Min == *rule.TcpOptions.DestinationPortRange.Max &&
*rule.TcpOptions.DestinationPortRange.Min != desiredPorts.BackendPort && *rule.TcpOptions.DestinationPortRange.Max != desiredPorts.BackendPort &&
Copy link
Member

@owainlewis owainlewis Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This large logical conjunction looks quite prone to subtle mistakes and I think is duplicated again below. Would be worth factoring this into a smaller helper function and unit testing.

Copy link
Contributor Author

@templecloud templecloud Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeagh, I mentioned this in the JIRA and I started something like this on the 'other branch'. However, I thought we were meant to be making this fix with a few changes possible and then iterating to improve other bits? I am happy to have a go at improving this now, but, was trying to avoid a 'large commit'. Do you want me to refactor this code in this branch along with the fix attempt?

@templecloud
Copy link
Contributor Author

Thanks for the review OL! You have suggested several good changes that I was originally going to raise a JIRA for and do as a separate bit work. Can you please confirm you want me to do it here and now? Then I will get cracking!

@@ -152,12 +152,20 @@ func (s *baseSecurityListManager) updateLoadBalancerRules(ctx context.Context, l

logger := s.logger.With("securityListID", *secList.Id)

lbEgressRules := getLoadBalancerEgressRules(logger, secList.EgressSecurityRules, nodeSubnets, ports.BackendPort, s.serviceLister)
lbEgressRules = getLoadBalancerEgressRules(logger, lbEgressRules, nodeSubnets, ports.HealthCheckerPort, s.serviceLister)
// -1 denotes nil ports.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it just safe to use the zero value (i.e. 0) here as port 0 is reserved? Generally in-band errors / special values are a bit of a no-no (intertwined with the "make the zero value useful" philosophy).

@@ -319,6 +324,7 @@ func (cp *CloudProvider) EnsureLoadBalancer(ctx context.Context, clusterName str
}
subnets := []string{cp.config.LoadBalancer.Subnet1, cp.config.LoadBalancer.Subnet2}
spec, err := NewLBSpec(service, nodes, subnets, ssl, cp.securityListManagerFactory)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\n

@@ -319,6 +324,7 @@ func (cp *CloudProvider) EnsureLoadBalancer(ctx context.Context, clusterName str
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline from below under here IMHO 😎

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classic Prydie! :)

@@ -143,7 +143,7 @@ func (s *baseSecurityListManager) updateBackendRules(ctx context.Context, lbSubn

// updateLoadBalancerRules handles updating the ingress and egress rules for the load balance subnets.
// If the listener is nil, then only egress rules from the load balancer to the backend subnets will be checked.
func (s *baseSecurityListManager) updateLoadBalancerRules(ctx context.Context, lbSubnets []*core.Subnet, nodeSubnets []*core.Subnet, sourceCIDRs []string, ports portSpec) error {
func (s *baseSecurityListManager) updateLoadBalancerRules(ctx context.Context, lbSubnets []*core.Subnet, nodeSubnets []*core.Subnet, sourceCIDRs []string, actualPorts *portSpec, desiredPorts portSpec) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logically makes sense. Update is used both for creation and modification and therefore there are call sites that don't have any "actual" (existing) ports associated.

func createCloudProviderConfig(cloudConfigFile string) (*oci.Config, error) {
file, err := os.Open(cloudConfigFile)
if err != nil {
return nil, errors.Wrap(err, fmt.Sprintf("Couldn't open cloud provider configuration: %s.", cloudConfigFile))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.Wrapf

defer file.Close()
cloudProviderConfig, err := oci.ReadConfig(file)
if err != nil {
return nil, errors.Wrap(err, fmt.Sprintf("Couldn't create cloud provider configuration: %s.", cloudConfigFile))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.Wrapf

rateLimiter := oci.NewRateLimiter(logger.Sugar(), cloudProviderConfig.RateLimiter)
ociClient, err := client.New(logger.Sugar(), ociClientConfig, &rateLimiter)
if err != nil {
return nil, errors.Wrap(err, fmt.Sprintf("Couldn't create oci client from configuration: %s.", cloudConfigFile))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.Wrapf

@@ -226,3 +251,30 @@ func (f *Framework) AfterEach() {
Failf(strings.Join(messages, ","))
}
}

// createCloudProviderConfig unmarshalls the CCM's cloud provider config from the specified location so it can be used for testing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally doc strings are wrapped at 80 chars.

@@ -95,7 +107,10 @@ func NewFramework(baseName string, client clientset.Interface) *Framework {
BaseName: baseName,
ClientSet: client,
}

// Dev/CI only configuration. The seclist for CCM load-balancer routes.
f.CCMSecListID = os.Getenv("CCM_SECLIST_ID")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than environment variables I'd prefer we used flags as we're already using them relatively extensively for configuration.


// CountSecListTunnelSinglePortRules counts the number of 'single port'
// (non-ranged) egress/ingress rules for the specified list and port.
func CountSecListTunnelSinglePortRules(oci client.Interface, egressSecListID, ingressSecListID string, port int) (int, int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/CountSecListTunnelSinglePortRules/CountSinglePortSecListRules/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought this might come up! Just wanted to mention I preferred environment variables here as I only intended these tests to be executed by our CCM ci pipeline (where we manage the seclist ids). Not by the canary tests etc where we don't. I felt threading them through was a bit noisy. Still if you want flags for consistency ill do it.

@templecloud templecloud force-pushed the trjl/port-change-seclist-leak-fix branch from 391e9c2 to aab7c20 Compare August 30, 2018 10:24
@templecloud
Copy link
Contributor Author

Thanks for the review guys. Made all changes. Also added some additional e2e configuration to the wercker yaml after chatting with @prydie. Squashed all commits.

Copy link
Member

@kristenjacobs kristenjacobs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In essence, this boils down to a couple of bits of logic that say if a rule matches our old port, and not the new one, then we assume that this is no longer required thus it can be deleted. Sounds good to me.

@prydie
Copy link
Contributor

prydie commented Aug 30, 2018

🎉

@prydie prydie merged commit d94b7a7 into master Aug 30, 2018
@templecloud templecloud deleted the trjl/port-change-seclist-leak-fix branch September 14, 2018 13:18
ayushverma14 pushed a commit to ayushverma14/oci-cloud-controller-manager that referenced this pull request Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants