[bridge] Logs stuck workspace instances to validate fix #12902

laushinka · 2022-09-13T07:15:54Z

Description

This PR checks and logs if a workspace instance that the ws-manager does not know about is in pending, creating, initializing states, or if it is in stopping for longer than 10 seconds (a guard in case of a race condition).
We will monitor the logs and validate if we can deploy the planned fix here.

Related Issue(s)

Relates #11397
Relates #12955

How to test

Release Notes

NONE

Documentation

Werft options:

/werft with-preview

werft-gitpod-dev-com · 2022-09-13T07:16:31Z

started the job as gitpod-build-lau-ws-pending.2 because the annotations in the pull request description changed
(with .werft/ from main)

laushinka · 2022-09-13T08:53:39Z

/werft run

👍 started the job as gitpod-build-lau-ws-pending.4
(with .werft/ from main)

AlexTugarev · 2022-09-13T15:16:12Z

Hi @laushinka! Is this ready for review?

laushinka · 2022-09-13T15:18:19Z

Hi @laushinka! Is this ready for review?

@AlexTugarev I had issues where the previews would fail, but it looks like the builds succeed now. Would you like to review it? I can squash the changes later to avoid the build possibly breaking in the middle of a review.

laushinka · 2022-09-13T15:19:50Z

Holding to squash later.

/hold

svenefftinge · 2022-09-14T06:54:26Z

components/ws-manager-bridge/src/bridge.ts

+                if (
+                    !(
+                        instance.status.phase === "running" ||
+                        durationLongerThanSeconds(Date.parse(instance.creationTime), maxTimeToRunningPhaseSeconds)


If I understood correctly you are adding a timeout for the phase pending.
We handle timeouts for the various phases in this method. Would be interesting to understand why the case in L583 doesn't do it.

@svenefftinge We have two controlling methods, which check instances in different phases, to clearly separate concerns. This was introduced after we detected outdated and broken logic in the old, single method. 🙃

This one here is responsible for handling phases which ws-manager is responsible for, e.g. things that for whatever reason felt through the cracks, incl running and pending.

The one you linked is responsible for controlling phases WebApp is responsible for, e.g. preparing and building.

FYI: There is a diagram that tries to illustrate this a little bit.

Ah, yes, that makes sense. I was confusing pending and preparing

I was confusing pending and preparing

Yeah, I agree. All of this feels more complex then it should 🙄

@laushinka I'd really love to have at least a comment on this line which explains why we think this is an instance we do (not) want to stop.

Looking at the logic now (btw. it's a little hard to wrap my head around the negated or, tbh 😇) it seems like it would also handle all stopping cases. Which is generally great as we have leaking stopping states (#12938, #12939, #10624), that we should handle.
But applying we should apply the timeout on top of stoppingTime instead of creationTime in this case.
Also I think this code would now also trigger for the phases webapp is responsible for if the the timeout is smaller than the respective timeout for the phase.

How about we explicitly list the phases we want to handle in this function (i.e. ws-manager responsibilities) and use a single timeout against the max(CreationTime, StartedTime, StoppingTime) (i.e. the last signal we got). The timeout doesn't really need to be configurable I think, but can be rather small and static. We mainly want to make sure we don't act on anything that just in this moment got removed from ws-manager (i.e. races).

components/ws-manager-bridge/src/bridge.ts

svenefftinge · 2022-09-14T10:45:53Z

components/ws-manager-bridge/src/bridge.ts

@@ -603,6 +619,7 @@ export class WorkspaceManagerBridge implements Disposable {
        const nowISO = now.toISOString();
        info.latestInstance.stoppingTime = nowISO;
        info.latestInstance.stoppedTime = nowISO;
+        info.latestInstance.status.message = `Stopped by ws-manager-bridge. Previously in phase ${info.latestInstance.status.phase}`;


components/ws-manager-bridge/src/config.ts

geropl · 2022-09-14T12:57:55Z

components/ws-manager-bridge/src/bridge.ts

+                    log.info(
+                        { instanceId, workspaceId: instance.workspaceId },
+                        "Database says the instance is running, but wsman does not know about it. Marking as stopped in database.",
+                        { installation },


nit: phase would be nice to have in here as well.

could you also put all the properties into one object?
I.e. as in:

loginfo("my message ...", { all, the, properties })

@svenefftinge There is a reason we have two bags: The first one is the OWI (owner-workspace-instance) shape. Having this typed means it's very easy to find all logs realted to a specific user or workspace or instance, across all components, across all tools (logs and tracing). Moving it our there breaks this. 😕
The second bag at the end is specifically for payload (arbitrary shape), which ends up in a different place in log entries.

Oh sorry, I didn't know we have this special OWI object.

I didn't know either. Will create a PR to fix it back!

components/ws-manager-bridge/src/bridge.ts

geropl · 2022-09-14T13:02:16Z

components/ws-manager-bridge/src/bridge.ts

+                        "Database says the instance is running, but wsman does not know about it. Marking as stopped in database.",
+                        { installation },
+                    );
+                    await this.markWorkspaceInstanceAsStopped(ctx, ri, new Date());


@svenefftinge @laushinka What about rolling out this change more defensively? Maybe with copying some code, or introducing a separate branch for all checks that are "new"?

And then either use a feature-flag for rollout, or start with "logging first, enable later"? My point is that this is at the core of our software, and we really don't want to be ruining someones days because of assumptions we make about old code of ours. 🧘

Definitely. I was thinking about how to test this better, because so far I tested it by cordoning the preview env node, but it feels lacking.

Logging first enable later could be a start. Could you elaborate on the separate branch idea?

Also happy to have another pair of eyes when testing. So far my flow is:

Cordon the node

Run a workspace

See it stuck on pending

See stoppingTime and stoppedTime inserted to the DB ✅

But for some reason it takes a while until the phase is stopped.

Maybe then you put in the old condition again and add the new if branch with just logging. We can review the log messages and then in another deploy activate the 'stopping'.

YES let's do that 🚀

laushinka · 2022-09-14T15:54:08Z

components/ws-manager-bridge/src/bridge.ts

-                if (instance.status.phase !== "running") {
+                const phase = instance.status.phase;
+                if (phase !== "running") {
+                    // This below if block is to validate the planned fix


I did it this way because we want to keep the previous check. Therefore I also had to remove the if running check below. Does this make sense? @svenefftinge @geropl

perfect! let's see how it does in production 😅

laushinka · 2022-09-14T20:01:35Z

/werft run

👍 started the job as gitpod-build-lau-ws-pending.16
(with .werft/ from main)

geropl · 2022-09-15T08:06:19Z

components/ws-manager-bridge/src/bridge.ts

+                        phase === "initializing" ||
+                        (phase === "stopping" &&
+                            instance.stoppingTime &&
+                            durationLongerThanSeconds(Date.parse(instance.stoppingTime), 10))


@laushinka One more comment on the timeout discussion (sorry for doing this out of band, but couldn't wrap my head around yesterday): We have to make sure that short disconnects between application and workspace clusters don't lead to falsely stopped workspaces. We've seen instances in the past where we had problems on and off for a couple of minutes (might have been due to bad gRPC connection options, might or might not be better by now). With this change, this would lead to a situation where we mark the workspace as stopped in the database, while ~~users can use and work in their workspaces~~ workspaces are stopping perfectly fine. 😬

I guess we just be very carefule with this change. Let's monitor this for now, and a couple of days. Plus double-check if we find hints in the logs on bridge -> ws-manager re-connects (this should be the relevant line in code).

roboquat added do-not-merge/work-in-progress do-not-merge/release-note-label-needed size/S labels Sep 13, 2022

laushinka force-pushed the lau/ws-pending branch from e2d6c7a to e74a946 Compare September 13, 2022 09:23

roboquat added release-note-none and removed do-not-merge/release-note-label-needed labels Sep 13, 2022

laushinka marked this pull request as ready for review September 13, 2022 09:35

laushinka requested a review from a team September 13, 2022 09:35

roboquat removed the do-not-merge/work-in-progress label Sep 13, 2022

laushinka marked this pull request as draft September 13, 2022 09:36

github-actions bot added the team: webapp Issue belongs to the WebApp team label Sep 13, 2022

roboquat added do-not-merge/work-in-progress size/M and removed size/S labels Sep 13, 2022

laushinka force-pushed the lau/ws-pending branch from 2e6ba92 to 5ad3020 Compare September 13, 2022 14:03

laushinka marked this pull request as ready for review September 13, 2022 15:19

roboquat removed the do-not-merge/work-in-progress label Sep 13, 2022

roboquat added the do-not-merge/hold label Sep 13, 2022

svenefftinge reviewed Sep 14, 2022

View reviewed changes

laushinka force-pushed the lau/ws-pending branch from 222ee32 to d45759e Compare September 14, 2022 10:14

laushinka commented Sep 14, 2022

View reviewed changes

components/ws-manager-bridge/src/bridge.ts Outdated Show resolved Hide resolved

laushinka requested a review from svenefftinge September 14, 2022 10:17

svenefftinge reviewed Sep 14, 2022

View reviewed changes

components/ws-manager-bridge/src/bridge.ts Show resolved Hide resolved

svenefftinge reviewed Sep 14, 2022

View reviewed changes

components/ws-manager-bridge/src/config.ts Outdated Show resolved Hide resolved

laushinka force-pushed the lau/ws-pending branch 2 times, most recently from 21bc213 to 1ea3e3b Compare September 14, 2022 12:50

laushinka requested a review from svenefftinge September 14, 2022 12:51

geropl reviewed Sep 14, 2022

View reviewed changes

components/ws-manager-bridge/src/bridge.ts Outdated Show resolved Hide resolved

geropl reviewed Sep 14, 2022

View reviewed changes

laushinka force-pushed the lau/ws-pending branch from 1ea3e3b to 8ee8f4b Compare September 14, 2022 13:07

Stops stuck workspaces

80337be

laushinka force-pushed the lau/ws-pending branch from 8ee8f4b to 80337be Compare September 14, 2022 15:46

laushinka changed the title ~~Stops stuck workspaces~~ [bridge] Logs stuck workspace instances to validate fix Sep 14, 2022

laushinka commented Sep 14, 2022

View reviewed changes

svenefftinge approved these changes Sep 14, 2022

View reviewed changes

laushinka removed the do-not-merge/hold label Sep 14, 2022

roboquat merged commit f2af999 into main Sep 14, 2022

roboquat deleted the lau/ws-pending branch September 14, 2022 20:09

geropl reviewed Sep 15, 2022

View reviewed changes

roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels Sep 15, 2022

This was referenced Sep 15, 2022

Workspace stuck at "stopping" state #12939

Closed

Unable to start workspace. Stuck at stopping. #12938

Closed

laushinka mentioned this pull request Sep 15, 2022

[bridge] Fix logging #12988

Merged

1 task

svenefftinge mentioned this pull request Sep 15, 2022

When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

Closed

1 task

laushinka mentioned this pull request Sep 27, 2022

[bridge] Marks stuck stopping and pending instances as stopped #13350

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bridge] Logs stuck workspace instances to validate fix #12902

[bridge] Logs stuck workspace instances to validate fix #12902

laushinka commented Sep 13, 2022 •

edited

Loading

werft-gitpod-dev-com bot commented Sep 13, 2022

laushinka commented Sep 13, 2022 •

edited by werft-gitpod-dev-com bot

Loading

AlexTugarev commented Sep 13, 2022

laushinka commented Sep 13, 2022

laushinka commented Sep 13, 2022

svenefftinge Sep 14, 2022

geropl Sep 14, 2022

svenefftinge Sep 14, 2022

geropl Sep 14, 2022

svenefftinge Sep 14, 2022 •

edited

Loading

svenefftinge Sep 14, 2022

geropl Sep 14, 2022

svenefftinge Sep 14, 2022

geropl Sep 15, 2022

svenefftinge Sep 15, 2022

laushinka Sep 15, 2022

geropl Sep 14, 2022

laushinka Sep 14, 2022

laushinka Sep 14, 2022

svenefftinge Sep 14, 2022

laushinka Sep 14, 2022

laushinka Sep 14, 2022 •

edited

Loading

svenefftinge Sep 14, 2022

laushinka commented Sep 14, 2022 •

edited by werft-gitpod-dev-com bot

Loading

geropl Sep 15, 2022 •

edited

Loading

[bridge] Logs stuck workspace instances to validate fix #12902

[bridge] Logs stuck workspace instances to validate fix #12902

Conversation

laushinka commented Sep 13, 2022 • edited Loading

Description

Related Issue(s)

How to test

Release Notes

Documentation

Werft options:

werft-gitpod-dev-com bot commented Sep 13, 2022

laushinka commented Sep 13, 2022 • edited by werft-gitpod-dev-com bot Loading

AlexTugarev commented Sep 13, 2022

laushinka commented Sep 13, 2022

laushinka commented Sep 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svenefftinge Sep 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

laushinka Sep 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

laushinka commented Sep 14, 2022 • edited by werft-gitpod-dev-com bot Loading

geropl Sep 15, 2022 • edited Loading

Choose a reason for hiding this comment

laushinka commented Sep 13, 2022 •

edited

Loading

laushinka commented Sep 13, 2022 •

edited by werft-gitpod-dev-com bot

Loading

svenefftinge Sep 14, 2022 •

edited

Loading

laushinka Sep 14, 2022 •

edited

Loading

laushinka commented Sep 14, 2022 •

edited by werft-gitpod-dev-com bot

Loading

geropl Sep 15, 2022 •

edited

Loading