Skip to content

[bridge] Logs stuck workspace instances to validate fix #12902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 14, 2022
Merged

Conversation

laushinka
Copy link
Contributor

@laushinka laushinka commented Sep 13, 2022

Description

This PR checks and logs if a workspace instance that the ws-manager does not know about is in pending, creating, initializing states, or if it is in stopping for longer than 10 seconds (a guard in case of a race condition).
We will monitor the logs and validate if we can deploy the planned fix here.

Related Issue(s)

Relates #11397
Relates #12955

How to test

Release Notes

NONE

Documentation

Werft options:

  • /werft with-preview

@werft-gitpod-dev-com
Copy link

started the job as gitpod-build-lau-ws-pending.2 because the annotations in the pull request description changed
(with .werft/ from main)

@laushinka
Copy link
Contributor Author

laushinka commented Sep 13, 2022

/werft run

👍 started the job as gitpod-build-lau-ws-pending.4
(with .werft/ from main)

@laushinka laushinka marked this pull request as ready for review September 13, 2022 09:35
@laushinka laushinka requested a review from a team September 13, 2022 09:35
@laushinka laushinka marked this pull request as draft September 13, 2022 09:36
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Sep 13, 2022
@AlexTugarev
Copy link
Member

Hi @laushinka! Is this ready for review?

@laushinka
Copy link
Contributor Author

Hi @laushinka! Is this ready for review?

@AlexTugarev I had issues where the previews would fail, but it looks like the builds succeed now. Would you like to review it? I can squash the changes later to avoid the build possibly breaking in the middle of a review.

@laushinka laushinka marked this pull request as ready for review September 13, 2022 15:19
@laushinka
Copy link
Contributor Author

Holding to squash later.

/hold

if (
!(
instance.status.phase === "running" ||
durationLongerThanSeconds(Date.parse(instance.creationTime), maxTimeToRunningPhaseSeconds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly you are adding a timeout for the phase pending.
We handle timeouts for the various phases in this method. Would be interesting to understand why the case in L583 doesn't do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svenefftinge We have two controlling methods, which check instances in different phases, to clearly separate concerns. This was introduced after we detected outdated and broken logic in the old, single method. 🙃

  1. This one here is responsible for handling phases which ws-manager is responsible for, e.g. things that for whatever reason felt through the cracks, incl running and pending.
  2. The one you linked is responsible for controlling phases WebApp is responsible for, e.g. preparing and building.

FYI: There is a diagram that tries to illustrate this a little bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that makes sense. I was confusing pending and preparing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confusing pending and preparing

Yeah, I agree. All of this feels more complex then it should 🙄

@laushinka I'd really love to have at least a comment on this line which explains why we think this is an instance we do (not) want to stop.

Copy link
Member

@svenefftinge svenefftinge Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the logic now (btw. it's a little hard to wrap my head around the negated or, tbh 😇) it seems like it would also handle all stopping cases. Which is generally great as we have leaking stopping states (#12938, #12939, #10624), that we should handle.
But applying we should apply the timeout on top of stoppingTime instead of creationTime in this case.
Also I think this code would now also trigger for the phases webapp is responsible for if the the timeout is smaller than the respective timeout for the phase.

How about we explicitly list the phases we want to handle in this function (i.e. ws-manager responsibilities) and use a single timeout against the max(CreationTime, StartedTime, StoppingTime) (i.e. the last signal we got). The timeout doesn't really need to be configurable I think, but can be rather small and static. We mainly want to make sure we don't act on anything that just in this moment got removed from ws-manager (i.e. races).

@@ -603,6 +619,7 @@ export class WorkspaceManagerBridge implements Disposable {
const nowISO = now.toISOString();
info.latestInstance.stoppingTime = nowISO;
info.latestInstance.stoppedTime = nowISO;
info.latestInstance.status.message = `Stopped by ws-manager-bridge. Previously in phase ${info.latestInstance.status.phase}`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@laushinka laushinka force-pushed the lau/ws-pending branch 2 times, most recently from 21bc213 to 1ea3e3b Compare September 14, 2022 12:50
log.info(
{ instanceId, workspaceId: instance.workspaceId },
"Database says the instance is running, but wsman does not know about it. Marking as stopped in database.",
{ installation },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: phase would be nice to have in here as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also put all the properties into one object?
I.e. as in:

loginfo("my message ...", { all, the, properties })

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svenefftinge There is a reason we have two bags: The first one is the OWI (owner-workspace-instance) shape. Having this typed means it's very easy to find all logs realted to a specific user or workspace or instance, across all components, across all tools (logs and tracing). Moving it our there breaks this. 😕
The second bag at the end is specifically for payload (arbitrary shape), which ends up in a different place in log entries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I didn't know we have this special OWI object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know either. Will create a PR to fix it back!

"Database says the instance is running, but wsman does not know about it. Marking as stopped in database.",
{ installation },
);
await this.markWorkspaceInstanceAsStopped(ctx, ri, new Date());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svenefftinge @laushinka What about rolling out this change more defensively? Maybe with copying some code, or introducing a separate branch for all checks that are "new"?

And then either use a feature-flag for rollout, or start with "logging first, enable later"? My point is that this is at the core of our software, and we really don't want to be ruining someones days because of assumptions we make about old code of ours. 🧘

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. I was thinking about how to test this better, because so far I tested it by cordoning the preview env node, but it feels lacking.

Logging first enable later could be a start. Could you elaborate on the separate branch idea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also happy to have another pair of eyes when testing. So far my flow is:

  1. Cordon the node
  2. Run a workspace
  3. See it stuck on pending
  4. See stoppingTime and stoppedTime inserted to the DB ✅

But for some reason it takes a while until the phase is stopped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe then you put in the old condition again and add the new if branch with just logging. We can review the log messages and then in another deploy activate the 'stopping'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES let's do that 🚀

@laushinka laushinka changed the title Stops stuck workspaces [bridge] Logs stuck workspace instances to validate fix Sep 14, 2022
if (instance.status.phase !== "running") {
const phase = instance.status.phase;
if (phase !== "running") {
// This below if block is to validate the planned fix
Copy link
Contributor Author

@laushinka laushinka Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because we want to keep the previous check. Therefore I also had to remove the if running check below. Does this make sense? @svenefftinge @geropl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect! let's see how it does in production 😅

@laushinka
Copy link
Contributor Author

laushinka commented Sep 14, 2022

/werft run

👍 started the job as gitpod-build-lau-ws-pending.16
(with .werft/ from main)

@roboquat roboquat merged commit f2af999 into main Sep 14, 2022
@roboquat roboquat deleted the lau/ws-pending branch September 14, 2022 20:09
phase === "initializing" ||
(phase === "stopping" &&
instance.stoppingTime &&
durationLongerThanSeconds(Date.parse(instance.stoppingTime), 10))
Copy link
Member

@geropl geropl Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laushinka One more comment on the timeout discussion (sorry for doing this out of band, but couldn't wrap my head around yesterday): We have to make sure that short disconnects between application and workspace clusters don't lead to falsely stopped workspaces. We've seen instances in the past where we had problems on and off for a couple of minutes (might have been due to bad gRPC connection options, might or might not be better by now). With this change, this would lead to a situation where we mark the workspace as stopped in the database, while users can use and work in their workspaces workspaces are stopping perfectly fine. 😬

I guess we just be very carefule with this change. Let's monitor this for now, and a couple of days. Plus double-check if we find hints in the logs on bridge -> ws-manager re-connects (this should be the relevant line in code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed Change is completely running in production release-note-none size/M team: webapp Issue belongs to the WebApp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants