-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[bridge] Logs stuck workspace instances to validate fix #12902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -446,13 +446,13 @@ export class WorkspaceManagerBridge implements Disposable { | |
const installation = this.cluster.name; | ||
log.debug("Controlling instances...", { installation }); | ||
|
||
const runningInstances = await this.workspaceDB | ||
const nonStoppedInstances = await this.workspaceDB | ||
.trace(ctx) | ||
.findRunningInstancesWithWorkspaces(installation, undefined, true); | ||
|
||
// Control running workspace instances against ws-manager | ||
try { | ||
await this.controlRunningInstances(ctx, runningInstances, clientProvider); | ||
await this.controlNonStoppedWSManagerManagedInstances(ctx, nonStoppedInstances, clientProvider); | ||
|
||
disconnectStarted = Number.MAX_SAFE_INTEGER; // Reset disconnect period | ||
} catch (err) { | ||
|
@@ -466,7 +466,7 @@ export class WorkspaceManagerBridge implements Disposable { | |
} | ||
|
||
// Control workspace instances against timeouts | ||
await this.controlInstancesTimeouts(ctx, runningInstances); | ||
await this.controlInstancesTimeouts(ctx, nonStoppedInstances); | ||
|
||
log.debug("Done controlling instances.", { installation }); | ||
} catch (err) { | ||
|
@@ -485,17 +485,17 @@ export class WorkspaceManagerBridge implements Disposable { | |
* This methods controls all instances that we have currently marked as "running" in the DB. | ||
* It checks whether they are still running with their respective ws-manager, and if not, marks them as stopped in the DB. | ||
*/ | ||
protected async controlRunningInstances( | ||
protected async controlNonStoppedWSManagerManagedInstances( | ||
parentCtx: TraceContext, | ||
runningInstances: RunningWorkspaceInfo[], | ||
clientProvider: ClientProvider, | ||
) { | ||
const installation = this.config.installation; | ||
|
||
const span = TraceContext.startSpan("controlRunningInstances", parentCtx); | ||
const span = TraceContext.startSpan("controlNonStoppedWSManagerManagedInstances", parentCtx); | ||
const ctx = { span }; | ||
try { | ||
log.debug("Controlling running instances...", { installation }); | ||
log.debug("Controlling non-stopped instances that are managed by WS Manager...", { installation }); | ||
|
||
const runningInstancesIdx = new Map<string, RunningWorkspaceInfo>(); | ||
runningInstances.forEach((i) => runningInstancesIdx.set(i.latestInstance.id, i)); | ||
|
@@ -504,9 +504,27 @@ export class WorkspaceManagerBridge implements Disposable { | |
const actuallyRunningInstances = await client.getWorkspaces(ctx, new GetWorkspacesRequest()); | ||
actuallyRunningInstances.getStatusList().forEach((s) => runningInstancesIdx.delete(s.getId())); | ||
|
||
// runningInstancesIdx only contains instances that ws-manager is not aware of | ||
for (const [instanceId, ri] of runningInstancesIdx.entries()) { | ||
const instance = ri.latestInstance; | ||
if (instance.status.phase !== "running") { | ||
const phase = instance.status.phase; | ||
if (phase !== "running") { | ||
// This below if block is to validate the planned fix | ||
if ( | ||
phase === "pending" || | ||
phase === "creating" || | ||
phase === "initializing" || | ||
(phase === "stopping" && | ||
instance.stoppingTime && | ||
durationLongerThanSeconds(Date.parse(instance.stoppingTime), 10)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @laushinka One more comment on the timeout discussion (sorry for doing this out of band, but couldn't wrap my head around yesterday): We have to make sure that short disconnects between application and workspace clusters don't lead to falsely stopped workspaces. We've seen instances in the past where we had problems on and off for a couple of minutes (might have been due to bad gRPC connection options, might or might not be better by now). With this change, this would lead to a situation where we mark the workspace as stopped in the database, while I guess we just be very carefule with this change. Let's monitor this for now, and a couple of days. Plus double-check if we find hints in the logs on bridge -> ws-manager re-connects (this should be the relevant line in code). |
||
) { | ||
log.info("Logging to validate #12902. Should mark as stopped in database.", { | ||
instanceId, | ||
workspaceId: instance.workspaceId, | ||
installation, | ||
phase, | ||
}); | ||
} | ||
log.debug({ instanceId }, "Skipping instance", { | ||
laushinka marked this conversation as resolved.
Show resolved
Hide resolved
|
||
phase: instance.status.phase, | ||
creationTime: instance.creationTime, | ||
|
@@ -516,9 +534,8 @@ export class WorkspaceManagerBridge implements Disposable { | |
} | ||
|
||
log.info( | ||
{ instanceId, workspaceId: instance.workspaceId }, | ||
"Database says the instance is running, but wsman does not know about it. Marking as stopped in database.", | ||
{ installation }, | ||
{ instanceId, workspaceId: instance.workspaceId, installation, phase }, | ||
); | ||
await this.markWorkspaceInstanceAsStopped(ctx, ri, new Date()); | ||
} | ||
|
@@ -603,6 +620,7 @@ export class WorkspaceManagerBridge implements Disposable { | |
const nowISO = now.toISOString(); | ||
info.latestInstance.stoppingTime = nowISO; | ||
info.latestInstance.stoppedTime = nowISO; | ||
info.latestInstance.status.message = `Stopped by ws-manager-bridge. Previously in phase ${info.latestInstance.status.phase}`; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
info.latestInstance.status.phase = "stopped"; | ||
await this.workspaceDB.trace(ctx).storeInstance(info.latestInstance); | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way because we want to keep the previous check. Therefore I also had to remove the
if running
check below. Does this make sense? @svenefftinge @geroplThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect! let's see how it does in production 😅