-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[bridge] Logs stuck workspace instances to validate fix #12902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
started the job as gitpod-build-lau-ws-pending.2 because the annotations in the pull request description changed |
/werft run 👍 started the job as gitpod-build-lau-ws-pending.4 |
e2d6c7a
to
e74a946
Compare
2e6ba92
to
5ad3020
Compare
Hi @laushinka! Is this ready for review? |
@AlexTugarev I had issues where the previews would fail, but it looks like the builds succeed now. Would you like to review it? I can squash the changes later to avoid the build possibly breaking in the middle of a review. |
Holding to squash later. /hold |
if ( | ||
!( | ||
instance.status.phase === "running" || | ||
durationLongerThanSeconds(Date.parse(instance.creationTime), maxTimeToRunningPhaseSeconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood correctly you are adding a timeout for the phase pending
.
We handle timeouts for the various phases in this method. Would be interesting to understand why the case in L583 doesn't do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svenefftinge We have two controlling methods, which check instances in different phases, to clearly separate concerns. This was introduced after we detected outdated and broken logic in the old, single method. 🙃
- This one here is responsible for handling phases which ws-manager is responsible for, e.g. things that for whatever reason felt through the cracks, incl running and
pending
. - The one you linked is responsible for controlling phases WebApp is responsible for, e.g.
preparing
andbuilding
.
FYI: There is a diagram that tries to illustrate this a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, that makes sense. I was confusing pending
and preparing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confusing pending and preparing
Yeah, I agree. All of this feels more complex then it should 🙄
@laushinka I'd really love to have at least a comment on this line which explains why we think this is an instance we do (not) want to stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the logic now (btw. it's a little hard to wrap my head around the negated or, tbh 😇) it seems like it would also handle all stopping
cases. Which is generally great as we have leaking stopping states (#12938, #12939, #10624), that we should handle.
But applying we should apply the timeout on top of stoppingTime instead of creationTime in this case.
Also I think this code would now also trigger for the phases webapp is responsible for if the the timeout is smaller than the respective timeout for the phase.
How about we explicitly list the phases we want to handle in this function (i.e. ws-manager responsibilities) and use a single timeout against the max(CreationTime, StartedTime, StoppingTime)
(i.e. the last signal we got). The timeout doesn't really need to be configurable I think, but can be rather small and static. We mainly want to make sure we don't act on anything that just in this moment got removed from ws-manager (i.e. races).
222ee32
to
d45759e
Compare
@@ -603,6 +619,7 @@ export class WorkspaceManagerBridge implements Disposable { | |||
const nowISO = now.toISOString(); | |||
info.latestInstance.stoppingTime = nowISO; | |||
info.latestInstance.stoppedTime = nowISO; | |||
info.latestInstance.status.message = `Stopped by ws-manager-bridge. Previously in phase ${info.latestInstance.status.phase}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
21bc213
to
1ea3e3b
Compare
log.info( | ||
{ instanceId, workspaceId: instance.workspaceId }, | ||
"Database says the instance is running, but wsman does not know about it. Marking as stopped in database.", | ||
{ installation }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: phase
would be nice to have in here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also put all the properties into one object?
I.e. as in:
loginfo("my message ...", { all, the, properties })
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svenefftinge There is a reason we have two bags: The first one is the OWI (owner-workspace-instance) shape. Having this typed means it's very easy to find all logs realted to a specific user or workspace or instance, across all components, across all tools (logs and tracing). Moving it our there breaks this. 😕
The second bag at the end is specifically for payload (arbitrary shape), which ends up in a different place in log entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I didn't know we have this special OWI object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know either. Will create a PR to fix it back!
"Database says the instance is running, but wsman does not know about it. Marking as stopped in database.", | ||
{ installation }, | ||
); | ||
await this.markWorkspaceInstanceAsStopped(ctx, ri, new Date()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svenefftinge @laushinka What about rolling out this change more defensively? Maybe with copying some code, or introducing a separate branch for all checks that are "new"?
And then either use a feature-flag for rollout, or start with "logging first, enable later"? My point is that this is at the core of our software, and we really don't want to be ruining someones days because of assumptions we make about old code of ours. 🧘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely. I was thinking about how to test this better, because so far I tested it by cordoning the preview env node, but it feels lacking.
Logging first enable later could be a start. Could you elaborate on the separate branch idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also happy to have another pair of eyes when testing. So far my flow is:
- Cordon the node
- Run a workspace
- See it stuck on pending
- See stoppingTime and stoppedTime inserted to the DB ✅
But for some reason it takes a while until the phase is stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe then you put in the old condition again and add the new if branch with just logging. We can review the log messages and then in another deploy activate the 'stopping'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES let's do that 🚀
1ea3e3b
to
8ee8f4b
Compare
8ee8f4b
to
80337be
Compare
if (instance.status.phase !== "running") { | ||
const phase = instance.status.phase; | ||
if (phase !== "running") { | ||
// This below if block is to validate the planned fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way because we want to keep the previous check. Therefore I also had to remove the if running
check below. Does this make sense? @svenefftinge @geropl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect! let's see how it does in production 😅
/werft run 👍 started the job as gitpod-build-lau-ws-pending.16 |
phase === "initializing" || | ||
(phase === "stopping" && | ||
instance.stoppingTime && | ||
durationLongerThanSeconds(Date.parse(instance.stoppingTime), 10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laushinka One more comment on the timeout discussion (sorry for doing this out of band, but couldn't wrap my head around yesterday): We have to make sure that short disconnects between application and workspace clusters don't lead to falsely stopped workspaces. We've seen instances in the past where we had problems on and off for a couple of minutes (might have been due to bad gRPC connection options, might or might not be better by now). With this change, this would lead to a situation where we mark the workspace as stopped in the database, while users can use and work in their workspaces workspaces are stopping
perfectly fine. 😬
I guess we just be very carefule with this change. Let's monitor this for now, and a couple of days. Plus double-check if we find hints in the logs on bridge -> ws-manager re-connects (this should be the relevant line in code).
Description
This PR checks and logs if a workspace instance that the ws-manager does not know about is in
pending
,creating
,initializing
states, or if it is instopping
for longer than 10 seconds (a guard in case of a race condition).We will monitor the logs and validate if we can deploy the planned fix here.
Related Issue(s)
Relates #11397
Relates #12955
How to test
Release Notes
Documentation
Werft options: