Skip to content

[bridge] Logs stuck workspace instances to validate fix #12902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 14, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 27 additions & 9 deletions components/ws-manager-bridge/src/bridge.ts
Original file line number Diff line number Diff line change
Expand Up @@ -446,13 +446,13 @@ export class WorkspaceManagerBridge implements Disposable {
const installation = this.cluster.name;
log.debug("Controlling instances...", { installation });

const runningInstances = await this.workspaceDB
const nonStoppedInstances = await this.workspaceDB
.trace(ctx)
.findRunningInstancesWithWorkspaces(installation, undefined, true);

// Control running workspace instances against ws-manager
try {
await this.controlRunningInstances(ctx, runningInstances, clientProvider);
await this.controlNonStoppedWSManagerManagedInstances(ctx, nonStoppedInstances, clientProvider);

disconnectStarted = Number.MAX_SAFE_INTEGER; // Reset disconnect period
} catch (err) {
Expand All @@ -466,7 +466,7 @@ export class WorkspaceManagerBridge implements Disposable {
}

// Control workspace instances against timeouts
await this.controlInstancesTimeouts(ctx, runningInstances);
await this.controlInstancesTimeouts(ctx, nonStoppedInstances);

log.debug("Done controlling instances.", { installation });
} catch (err) {
Expand All @@ -485,17 +485,17 @@ export class WorkspaceManagerBridge implements Disposable {
* This methods controls all instances that we have currently marked as "running" in the DB.
* It checks whether they are still running with their respective ws-manager, and if not, marks them as stopped in the DB.
*/
protected async controlRunningInstances(
protected async controlNonStoppedWSManagerManagedInstances(
parentCtx: TraceContext,
runningInstances: RunningWorkspaceInfo[],
clientProvider: ClientProvider,
) {
const installation = this.config.installation;

const span = TraceContext.startSpan("controlRunningInstances", parentCtx);
const span = TraceContext.startSpan("controlNonStoppedWSManagerManagedInstances", parentCtx);
const ctx = { span };
try {
log.debug("Controlling running instances...", { installation });
log.debug("Controlling non-stopped instances that are managed by WS Manager...", { installation });

const runningInstancesIdx = new Map<string, RunningWorkspaceInfo>();
runningInstances.forEach((i) => runningInstancesIdx.set(i.latestInstance.id, i));
Expand All @@ -504,9 +504,27 @@ export class WorkspaceManagerBridge implements Disposable {
const actuallyRunningInstances = await client.getWorkspaces(ctx, new GetWorkspacesRequest());
actuallyRunningInstances.getStatusList().forEach((s) => runningInstancesIdx.delete(s.getId()));

// runningInstancesIdx only contains instances that ws-manager is not aware of
for (const [instanceId, ri] of runningInstancesIdx.entries()) {
const instance = ri.latestInstance;
if (instance.status.phase !== "running") {
const phase = instance.status.phase;
if (phase !== "running") {
// This below if block is to validate the planned fix
Copy link
Contributor Author

@laushinka laushinka Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because we want to keep the previous check. Therefore I also had to remove the if running check below. Does this make sense? @svenefftinge @geropl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect! let's see how it does in production 😅

if (
phase === "pending" ||
phase === "creating" ||
phase === "initializing" ||
(phase === "stopping" &&
instance.stoppingTime &&
durationLongerThanSeconds(Date.parse(instance.stoppingTime), 10))
Copy link
Member

@geropl geropl Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laushinka One more comment on the timeout discussion (sorry for doing this out of band, but couldn't wrap my head around yesterday): We have to make sure that short disconnects between application and workspace clusters don't lead to falsely stopped workspaces. We've seen instances in the past where we had problems on and off for a couple of minutes (might have been due to bad gRPC connection options, might or might not be better by now). With this change, this would lead to a situation where we mark the workspace as stopped in the database, while users can use and work in their workspaces workspaces are stopping perfectly fine. 😬

I guess we just be very carefule with this change. Let's monitor this for now, and a couple of days. Plus double-check if we find hints in the logs on bridge -> ws-manager re-connects (this should be the relevant line in code).

) {
log.info("Logging to validate #12902. Should mark as stopped in database.", {
instanceId,
workspaceId: instance.workspaceId,
installation,
phase,
});
}
log.debug({ instanceId }, "Skipping instance", {
phase: instance.status.phase,
creationTime: instance.creationTime,
Expand All @@ -516,9 +534,8 @@ export class WorkspaceManagerBridge implements Disposable {
}

log.info(
{ instanceId, workspaceId: instance.workspaceId },
"Database says the instance is running, but wsman does not know about it. Marking as stopped in database.",
{ installation },
{ instanceId, workspaceId: instance.workspaceId, installation, phase },
);
await this.markWorkspaceInstanceAsStopped(ctx, ri, new Date());
}
Expand Down Expand Up @@ -603,6 +620,7 @@ export class WorkspaceManagerBridge implements Disposable {
const nowISO = now.toISOString();
info.latestInstance.stoppingTime = nowISO;
info.latestInstance.stoppedTime = nowISO;
info.latestInstance.status.message = `Stopped by ws-manager-bridge. Previously in phase ${info.latestInstance.status.phase}`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

info.latestInstance.status.phase = "stopped";
await this.workspaceDB.trace(ctx).storeInstance(info.latestInstance);

Expand Down