Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDK refactoring support #705

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open

Conversation

otaviomacedo
Copy link
Contributor

@otaviomacedo otaviomacedo commented Feb 19, 2025

This is a request for comments about CDK Refactoring Support. See #162 for
additional details.

APIs are signed off by @rix0rrr.


By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache-2.0 license

@otaviomacedo otaviomacedo marked this pull request as ready for review February 24, 2025 13:56
state nor in the desired state.

In particular, the logical ID won't match the CDK construct path, stored in the
resource's metadata. This has consequences for the CloudFormation console, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has consequences for the CloudFormation console, which will show a Tree view that is not consistent with the Flat view.

is this the only user-facing impact? if so, can user rerun the cdk commands and expect success? if not, do we have a workaround to fix cdk app in this state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can user rerun the cdk commands and expect success?

Yes, in the next run, there will be nothing to refactor, so it goes straight to deployment. If the deployment succeeds, the consistency is restored.

is this the only user-facing impact?

This is the most obvious one. But there is a more esoteric case I can think of:

  1. Developer renames a resource from "A" to "B", and also adds some new ones.
  2. Refactor succeeds.
  3. Deployment fails and the stack is rolled back.
  4. Before it has a chance to revert the refactor, the CLI gets interrupted.
  5. Developer decides to abandon that change, and reverts the code to the
    previous state (before the change described in step 1). Now, in the code, the
    resource is called "A", but in the deployed stack, it is called "B".
  6. Some time later, some other developer, unaware of this discrepancy, makes a
    change to the content of resource "A", believing this will lead to an update.
  7. The CDK CLI won't detect this as a rename (because of the content change),
    and will thus proceed to deployment, leading to the replacement of resource "B".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wont this also affect cdk diff? It will show metadata changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. But that works to our advantage, then, doesn't it? It's a warning to developers that it's probably not safe to deploy. But, of course, they can still ignore it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're example was exactly what came to mind to me for risk here.

  1. occurring may sound like a super niche case, but in practice this kind of thing happening on a large team with CD pipelines is fairly regular occurrence. It's concerning when the outcome is potentially catastrophic.

How about the cli being smarter to detect the current cloud formation status? Something like if refactor is turned on and the cli sees that the current stack state is Update Rollback X, then check if refactor rollback is needed before continuing with new deploy?

Another option would be something like refactor plan being stored as state somewhere (cdk asset bucket?) and this state only gets deleted if the stack update completes or the stack update fails and the refactor rollback completes. If the deploy gets interrupted, on next deploy the cdk would detect the refactor state and complete the rollback before continuing with next deploy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is refactoring itself atomic, could some resources be refactored and others not be refactored? Could such a state be recovered from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdoyle0182 that's a good suggestion. Thanks!

@SimonCMoore yes, refactoring itself is atomic.

state nor in the desired state.

In particular, the logical ID won't match the CDK construct path, stored in the
resource's metadata. This has consequences for the CloudFormation console, which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wont this also affect cdk diff? It will show metadata changes.

It's worth noting that there are at least two cases in which the mapping
produced by the CLI may not be what you want:

- Ambiguity: you may find yourself in the very unlikely situation in which two
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some resources like associations which barely have any properties of their own, this will be quite likely 😉.

Also, are we going to handle cascading property changes? I.e., one of my properties references another resource which is itself refactored/moved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some resources like associations which barely have any properties of their own, this will be quite likely 😉.

The identity (digest) of a resource is based not only on its properties but also on the identities of its dependencies. Different association resources will have different dependencies, and therefore different identities.

Also, are we going to handle cascading property changes? I.e., one of my properties references another resource which is itself refactored/moved.

There is no cascading effect. The digest doesn't take into account the resource location, only its identity.

Comment on lines 310 to 311
affect the resources themselves. The worst that can happen is you ending up with
incorrect resource IDs in CloudFormation. The second thing is that refactors can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say this like it's not a big deal, but even though the refactor itself won't be too bad, the next deployment can mess things up pretty badly if you're not paying attention.

Copy link
Contributor Author

@otaviomacedo otaviomacedo Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, two questions then:

  1. Can you elaborate? What is the mess up scenario you have in mind?
  2. What should we do? Always fail in ambiguous cases? (I'm open to considering that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the mess up scenario you have in mind?

I don't know yet 😆. But I do know that (for a super-contrived example) if you have a SecretsBucket and a PublicBucket in your code, and for some reason the refactoring mixed up the bucket references so that the PublicBucket logicalID now points to mystack-secretsbuckcv7sd732, and your next deployment applies a "world readable" policy to the PublicBucket, because that seems okay ... you wouldn't be happy.

What should we do?

I don't know we can necessarily do something. I like the rollback solution. All I'm saying is this paragraph is trivializing the potential problem by saying "ah, no biggie, the worst that can happens is you mix up the mapping".

Well yes, mixing up the mapping itself isn't a problem, it's what happens after you mix up the mapping! And it's hard to say what the consequences of that might be.

Comment on lines +519 to +520
Since the CloudFormation API expects not only the mappings, but also the
templates in their final states, we need to compute those as well. This is done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that not the output of a regular synth already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. The output of a synth may contain additions and deletions as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow.

  • A synth by itself returns a desired state.
  • Additions and deletions are operations, which are the result of detect_changes(state0, state1).

What do you mean when you say that synth returns additions and deletions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mean to say that the diff between the two states may contain additions and deletions (also updates). The refactoring API only accept a state whose diff with the current state is empty, up to refactors. This is why we can't use what synth generated.

2. Call the refactor API to move the bucket from `Producer1` to `Producer2`,
preserving the old output in `Producer1` while creating a new output in
`Producer2`, that references the bucket.
3. Update the `Fn::ImportValue` in `Consumer` to point to the new output in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So 3 deployments, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

enough". To achieve this, we need a few things:

- **A distance function**. To generalize the notion of strict equality to a more
flexible notion of similarity, we need to come up with a distance function,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering (out loud) whether this value should be a member of [0, 1, 2, ...], or a member of [0..1].

I.e., is the size of the difference absolute, or relative to the size of the objects we are comparing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't know. I'm guessing this will be some variation of an edit distance, and therefore will have no upper bound.

decide whether two resources are similar enough to be candidates for
refactoring. It will depend on the distance function we choose and also on
experimentation.
- **Graph isomorphism**. Since we will not have a digest anymore to quickly find
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All very true, but I think we can probably wing it with a little user feedback and be mostly fine.

Take objects of the same type, order them by distance, ask the user to pick the mapping. Bob's your uncle 😎

@otaviomacedo otaviomacedo marked this pull request as ready for review March 19, 2025 13:13
Comment on lines +130 to +135
- If you pass any filters to the `deploy` command, the refactor will work on
those stacks plus any other stacks the refactor touches. For example, if you
choose to only deploy stack A, and a resource was moved from stack A to stack
B, the refactor will involve both stacks. But if there was a rename in, let's
say, stack C, it will not be refactored. The same set of filters is available
for the `refactor` command.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also an --exclusive flag. This might need defining.


? What do you want to do? (Use arrow keys)
❯ Execute the refactor and deploy
Deploy without refactoring (will cause resource replacement)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the --export-mapping (or --record-mapping etc) will do.


For both `deploy` and `refactor`:

- `--record-resource-mapping=<FILE>`: writes the mapping to a file. The file can

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unclear if when I use this it only writes the mapping file out OR it does the remapping and writes a file out (to apply elsewhere)?

Even though all the resources involved in this scenario would remain unchanged,
we have decided to err on the side of caution and not perform the refactor. If,
as the result of computing a refactor, the CLI detects such a case of ambiguity,
it will fail with an error message explaining the situation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unclear if we could output a potential mapping file with --record-resource-mapping, edit it and then use that to do the correct mapping, of whether that will also stop before creating the output file? (maybe the output file should have some warnings about the ambiguity/needs human review).

I assume the ambiguity can be resolved by supplying the mapping file (however obtained), please make that clear.

For both `deploy` and `refactor`:

- `--record-resource-mapping=<FILE>`: writes the mapping to a file. The file can
be used later to apply the same refactors in other environments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that resource mappings are environment specific. So how can it a resource mapping be applied to other environments? Do you have an example of how you envision the file JSON structure would be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an extremely good point, thanks for catching this!

People can use CDK in 2 styles, and we don't tell them to prefer one over the other:

  • Style 1: define a single copy of a single stage in your app, with environment-agnostic stacks. Do 5 cdk deploys with different configs to deploy to 5 different environments.
  • Style 2: define a single app with all 5 stages in there, each specialized for a different environment. In principle you could deploy to 5 environments with a single cdk deploy, or you could do cdk deploy StackA && cdk deploy StackB && cdk deploy ....

@otaviomacedo, what does the record/apply workflow look like in both styles? Does it make a difference which one they use? How does Style 2 with recording work in—let's say—a homegrown Jenkins pipeline?

Also: what does the applier need to apply? Do they need the CDK source again? At the same source revision? Or do they just need a copy of the .json file and the CLI and that's it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the state of each environment, more than the style used. Examples:

  • What you want to refactor is the same across all environments (despite other things being different): then you can generate one file and use it everywhere.
  • Different environments have different logical IDs involved in the refactor: then you'll need to generate one file for each environment, and configure your CI/CD to use the right file for the right environment.

This is why I'm moving away from the "mapping files as the main mechanism" idea. It's just too complicated to deal with all the variations, and different states in different points in time.

consuming a mapping file, or interactively, when the CLI computes the refactor
to be made, and the user is asked whether to proceed. If you want to use this
feature in a CI/CD pipeline, so that the refactors are applied automatically,
you must use commit the mapping file along with your code and use it on each
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the mapping file be used in the next refactoring, or is it generated without considering the history? What happens if a customer refactors the same resource twice?

Comment on lines 14 to 20
AWS CloudFormation identifies resources by their logical IDs. As a consequence,
if you change the logical ID of a resource after it has been deployed,
CloudFormation will create a new resource, with the new logical ID, and possibly
delete the old one. For stateful resources, this may cause interruption of
service or data loss, or both.

Historically, we have advised developers to avoid changing logical IDs. But this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅 Extreme nit: this needs a connecting sentence and clarify that it is about construct IDs. All of us know this so we skip over it, but for the clarity of this document it would help if we were more precise:

AWS CloudFormation identifies resources by their logical IDs.
Logical IDs are derived from construct IDs.
As a consequence, if you change the logicalconstruct ID of a resource after it has been deployed,
CloudFormation will create a new resource, with the new logical ID, and possibly
delete the old one. For stateful resources, this may cause interruption of
service or data loss, or both.

Historically, we have advised developers to avoid changing logicalconstruct IDs. But this

(etc)

└─ Function

Even though none of the resources have changed, their paths have (from
`MyStack/Bucket/Resource` to `Web/Website/Origin/Resource` etc.) Since the CDK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅 I like the example! But you are now mentioning the L1s, and in the discussion above you hid the L1s. This is expecting a deep understanding of CDK internals from the user, or they will be left confused.

Suggested change
`MyStack/Bucket/Resource` to `Web/Website/Origin/Resource` etc.) Since the CDK
`MyStack/Bucket` to `Web/Website/Origin` etc.) Since the CDK

Will do, it doesn't change the point of the story and doesn't rely on being extremely familiar with internals.

Comment on lines +116 to +119
? What do you want to do? (Use arrow keys)
❯ Execute the refactor and deploy
Deploy without refactoring (will cause resource replacement)
Quit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I didn't get this. I write --refactoring-action=refactor, and then I get asked a question if I want to refactor?

I would expect the dialog above to pop up always, and passing a CLI flag implies my answer before I run the command?

`cdk refactor --resource-mapping=file.json` on every protected environment in
advance (i.e., before your changes get deployed to those environments). When
you import a mapping, the CLI won't try to detect refactors.
2. The `--resource-mapping` option is also available for the `deploy` command.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the deploy command I think something like --refactor-resource-mapping might make sense, to make clear that this is for the refactoring feature.

Or can we just call it a "mapping", and use:

cdk refactor --mapping=file.json
cdk refactor --record-mapping=file.json
cdk deploy --refactor-mapping=file.json

?

For both `deploy` and `refactor`:

- `--record-resource-mapping=<FILE>`: writes the mapping to a file. The file can
be used later to apply the same refactors in other environments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an extremely good point, thanks for catching this!

People can use CDK in 2 styles, and we don't tell them to prefer one over the other:

  • Style 1: define a single copy of a single stage in your app, with environment-agnostic stacks. Do 5 cdk deploys with different configs to deploy to 5 different environments.
  • Style 2: define a single app with all 5 stages in there, each specialized for a different environment. In principle you could deploy to 5 environments with a single cdk deploy, or you could do cdk deploy StackA && cdk deploy StackB && cdk deploy ....

@otaviomacedo, what does the record/apply workflow look like in both styles? Does it make a difference which one they use? How does Style 2 with recording work in—let's say—a homegrown Jenkins pipeline?

Also: what does the applier need to apply? Do they need the CDK source again? At the same source revision? Or do they just need a copy of the .json file and the CLI and that's it?


For `refactor` only:

- `--dry-run`: prints the mapping to the console, but does not apply it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean you need both --record-resource-mapping and --dry-run to prepare a refactoring for another team to execute?

Comment on lines 310 to 311
affect the resources themselves. The worst that can happen is you ending up with
incorrect resource IDs in CloudFormation. The second thing is that refactors can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the mess up scenario you have in mind?

I don't know yet 😆. But I do know that (for a super-contrived example) if you have a SecretsBucket and a PublicBucket in your code, and for some reason the refactoring mixed up the bucket references so that the PublicBucket logicalID now points to mystack-secretsbuckcv7sd732, and your next deployment applies a "world readable" policy to the PublicBucket, because that seems okay ... you wouldn't be happy.

What should we do?

I don't know we can necessarily do something. I like the rollback solution. All I'm saying is this paragraph is trivializing the potential problem by saying "ah, no biggie, the worst that can happens is you mix up the mapping".

Well yes, mixing up the mapping itself isn't a problem, it's what happens after you mix up the mapping! And it's hard to say what the consequences of that might be.

Comment on lines +519 to +520
Since the CloudFormation API expects not only the mappings, but also the
templates in their final states, we need to compute those as well. This is done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow.

  • A synth by itself returns a desired state.
  • Additions and deletions are operations, which are the result of detect_changes(state0, state1).

What do you mean when you say that synth returns additions and deletions?

time the logical ID changes.

To solve this, CDK constructs can be automatically excluded by calling the new
method `Stack.skipRefactoring(constructToBeSkipped)`. By calling this method,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rix0rrr is this a good way to do it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be some form of metadata.

Annotations.of(constructToBeSkipped).add(SKIP_REFACTORING, true);

Or something like that? Could be abstracted via a Stack method, sure.

@rix0rrr rix0rrr added the pr/do-not-merge Let mergify know not to auto merge label Apr 4, 2025
rix0rrr
rix0rrr previously approved these changes Apr 4, 2025
Copy link
Contributor

@rix0rrr rix0rrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final comments

files as a starting point to edit the mapping, combine multiple mappings into
one, split mappings into multiple files, etc.

#### Skip file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this better in one file together with the refactorings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my initial idea. But that file would be used to:

  1. Override the result of a computed mapping (the skip part).
  2. Preclude mapping computation altogether (the refactor part).

These two functions are mutually exclusive. It would be confusing to have one file with both.

@mergify mergify bot dismissed rix0rrr’s stale review April 4, 2025 15:22

Pull request has been modified.

@otaviomacedo otaviomacedo removed the pr/do-not-merge Let mergify know not to auto merge label Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.