Improve monitoring and feedback for errors during workspace cluster selection #7829

kylos101 · 2022-01-26T00:21:32Z

Bug description

On start, some workspaces get stuck on "Allocating resources (from Discord)".

Another example from Twitter.

Steps to reproduce

Unconfirmed, but, try to start a workspace when prebuild load is very high.

Workspace affected

vipertools-css-eamifbfe80m

Expected behavior

It should start, instead of getting stuck.

Example repository

n/a

Anything else?

The issue occurred for some users on January 25 @ 10:52pm UTC.

thisisommore · 2022-01-26T05:18:22Z

I can confirm this, in my case every first workspace of the same context is stuck

kylos101 · 2022-02-01T05:07:18Z

This may be an issue with server and the # of socket connections, and live on the WebApp side of the house.

To start, let's gauge the frequency for this error by searching in the logs. Do the entries for the error line up with peeks in our volume? If yes, a simple mitigation may be to scale up server (+1 replica).

This may be difficult to recreate, but please try, and report the result here.

cc: @jldec @JanKoehnlein as a heads up. I am going to add this to your inbox as well, to get your thoughts.

JanKoehnlein · 2022-02-01T14:50:04Z

Thanks for the heads up. Let us know if you have a bit more information.

thisisommore · 2022-02-01T17:05:30Z

I can confirm this, in my case every first workspace of the same context is stuck

Now it is working I guess, never got into it again

princerachit · 2022-02-08T10:10:17Z

I see that usually when this error occurred it was preceeded by {"phase": "stopped", "message": "Workspace cannot be started: Error: no available workspace cluster to choose from!", "conditions": {"failed": "Error: no available workspace cluster to choose from!"}}

princerachit · 2022-02-08T10:24:57Z

I looked for the warning "websocket in strange state" but it seems to be not too bad during that time.

princerachit · 2022-02-08T10:29:06Z

I cannot interpret anything in the dashboard that can establish the relation bw server and the error.

kylos101 · 2022-02-09T16:14:49Z

Hey @princerachit, @sagor999 and I observed this behavior yesterday while server event loop lag was high.

@geropl @JanKoehnlein can you think of a way for us to artificially cause server event loop lag to increase (not in production), so we can test this hypothesis?

kylos101 · 2022-02-10T01:08:51Z

Hey @princerachit, @sagor999 and I observed this behavior yesterday while server event loop lag was high.

@geropl @JanKoehnlein can you think of a way for us to artificially cause server event loop lag to increase (not in production), so we can test this hypothesis?

Another potential culprit is ws-manager-bridge, if you can think of a way to make it perform poorly, that would be an ideal scenario to see if it causes this condition to occur.

geropl · 2022-02-10T07:56:12Z

@kylos101
Regarding reproducing event-loop lag high: No, I don't know of any way to do so, sorry. And it might be related to general slowness, but involvement in persistent problems (longer than 1-2s) is more unlikely.

Another potential culprit is ws-manager-bridge

This, in turn, is very likely. First of all, you should see either a) high DB CPU usage, or b) high ws-manager-bridge CPU usage.
If both is not the case, you could also c) look for "deadlock" in the (ws-manager-)bridge logs.

Also, not sure how to reproduce it. If you want to do it artificially to inspect the effect on the system you could try insert a await new Promise(resolve => setTimeout(resolve, 2000)); here.

geropl · 2022-02-10T07:57:07Z

This message from @princerachit looks interesting as well:

I see that usually when this error occurred it was preceeded by {"phase": "stopped", "message": "Workspace cannot be started: Error: no available workspace cluster to choose from!", "conditions": {"failed": "Error: no available workspace cluster to choose from!"}}

This could happen if workspaceClusters are not registered properly (for whatever reason, cannot imagine one 🤷 ).

kylos101 · 2022-02-16T21:53:59Z

Hi @JanKoehnlein and @jldec 👋 , #8173 was closed in favor of this issue. I wanted to give you a heads up, because, I know you had #8173 scheduled - and might be wondering what happened. 😄

Also, @csweichel shared some feedback, for related metrics on the WebApp side, which could help with alerting.

CC: @princerachit

geropl · 2022-02-24T13:16:31Z

@JanKoehnlein IMO it make sense to schedule this, and add metric for and alert on the scheduling success rate

kylos101 · 2022-02-25T01:05:15Z

Hey @JanKoehnlein @geropl and @jldec 👋 ,

I spent some time researching this issue today for workspace shaal-drupalpod-wp2k1qd0khc and this report in Discord.

May I ask you to review and let us know if there's anything else we can do to help?

This log from the gitpod project indicates server tried starting the workspace instance ID, but could not find a workspace-cluster.
- It seems like server tried starting one time.
- Would it make sense to exponential back-off for up to 3-5 minutes? 🤔 Perhaps even sharing the attempt with the user?
- Related traces. It doesn't look like this instance ever actually made it to a workspace cluster.
I found zero entries for this workspace instance ID, but was able to find owner IDs in the workspace-clusters project at this same timeframe. So this tells us they were able to run other workspaces, just not this one.
While doing this research, I realized it is hard for us to compare startWorkspace requests with web app, and we don't log them (unless debug is enabled), so I made this issue to help.

CC: @aledbf @sagor999

geropl · 2022-02-28T09:14:51Z

@kylos101 Thx for the analysis, and I 💯% agree.

Would it make sense to exponential back-off for up to 3-5 minutes

I think we should re-try, but not over such a long time period, but more in the order of a couple of seconds. Longer outages like this one should escalate instead, triggered by metrics and alerts.

To users, if the re-try fails as well because of "no cluster available", we should present a message that explains:

1.) this is a temporary issue
2.) we're already aware (due to alerts).

IMO the best way to transport this message is to update WorkspaceInstance and set a special condition startFailed. 👍

Suggestions for metrics names:

gitpod_server_start_workspace_success_total
gitpod_server_start_workspace_retries_total
gitpod_server_start_workspace_failure_total

eliezedeck · 2022-10-04T08:31:56Z

I'm still getting this issue right now: https://eliezedeck-workspace-yb0kjyp3sio.ws-eu67.gitpod.io/

You should be able to see the actual project on your end. But currently, I consistently get stuck at "Allocation resources ..."

pawlean · 2022-10-04T09:28:22Z

Hey @eliezedeck! I saw your message on Discord and wanted to close this loop 🔁 Everything should be good again, but let us know if you see any other issues.

kylos101 added priority: highest (user impact) Directly user impacting team: workspace Issue belongs to the Workspace team labels Jan 26, 2022

kylos101 added this to 🌌 Workspace Team Jan 26, 2022

kylos101 moved this to Scheduled in 🌌 Workspace Team Jan 26, 2022

kylos101 changed the title ~~Some workspaces are getting stuck in "Allocation Resources"~~ Some workspaces do not start, are stuck in Preparing, displaying "Allocating resources..." Jan 26, 2022

kylos101 moved this from Scheduled to In Progress in 🌌 Workspace Team Jan 26, 2022

kylos101 assigned aledbf Jan 26, 2022

kylos101 removed the status in 🌌 Workspace Team Jan 28, 2022

kylos101 unassigned aledbf Jan 28, 2022

kylos101 moved this to Scheduled in 🌌 Workspace Team Jan 30, 2022

kylos101 added this to 🍎 WebApp Team Feb 1, 2022

princerachit self-assigned this Feb 7, 2022

princerachit moved this from Scheduled to In Progress in 🌌 Workspace Team Feb 8, 2022

kylos101 unassigned princerachit Feb 10, 2022

kylos101 removed the status in 🌌 Workspace Team Feb 10, 2022

kylos101 moved this to Blocked in 🌌 Workspace Team Feb 10, 2022

kylos101 moved this from Blocked to Scheduled in 🌌 Workspace Team Feb 14, 2022

kylos101 moved this to In Progress in 🍎 WebApp Team Feb 16, 2022

kylos101 removed the status in 🍎 WebApp Team Feb 16, 2022

kylos101 mentioned this issue Feb 16, 2022

Error: cannot start a workspace because no workspace clusters are available #8173

Closed

kylos101 removed the status in 🌌 Workspace Team Feb 17, 2022

geropl moved this to Scheduled in 🍎 WebApp Team Feb 24, 2022

kylos101 removed this from 🌌 Workspace Team Feb 24, 2022

geropl changed the title ~~Some workspaces do not start, are stuck in Preparing, displaying "Allocating resources..."~~ Improve monitoring and feedback for errors during workspace cluster selection Feb 28, 2022

geropl self-assigned this Feb 28, 2022

This was referenced Feb 28, 2022

Improve monitoring and feedback for errors during workspace cluster selection #8485

Closed

Improve monitoring and feedback for errors during workspace cluster selection #8486

Merged

geropl added team: webapp Issue belongs to the WebApp team and removed team: workspace Issue belongs to the Workspace team labels Feb 28, 2022

geropl moved this from Scheduled to In Progress in 🍎 WebApp Team Mar 1, 2022

roboquat closed this as completed in #8486 Mar 7, 2022

Repository owner moved this from In Progress to Done in 🍎 WebApp Team Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve monitoring and feedback for errors during workspace cluster selection #7829

Improve monitoring and feedback for errors during workspace cluster selection #7829

kylos101 commented Jan 26, 2022

thisisommore commented Jan 26, 2022

kylos101 commented Feb 1, 2022

JanKoehnlein commented Feb 1, 2022

thisisommore commented Feb 1, 2022

princerachit commented Feb 8, 2022

princerachit commented Feb 8, 2022

princerachit commented Feb 8, 2022

kylos101 commented Feb 9, 2022 •

edited

Loading

kylos101 commented Feb 10, 2022

geropl commented Feb 10, 2022

geropl commented Feb 10, 2022

kylos101 commented Feb 16, 2022

geropl commented Feb 24, 2022

kylos101 commented Feb 25, 2022

geropl commented Feb 28, 2022 •

edited

Loading

eliezedeck commented Oct 4, 2022

pawlean commented Oct 4, 2022

Improve monitoring and feedback for errors during workspace cluster selection #7829

Improve monitoring and feedback for errors during workspace cluster selection #7829

Comments

kylos101 commented Jan 26, 2022

Bug description

Steps to reproduce

Workspace affected

Expected behavior

Example repository

Anything else?

thisisommore commented Jan 26, 2022

kylos101 commented Feb 1, 2022

JanKoehnlein commented Feb 1, 2022

thisisommore commented Feb 1, 2022

princerachit commented Feb 8, 2022

princerachit commented Feb 8, 2022

princerachit commented Feb 8, 2022

kylos101 commented Feb 9, 2022 • edited Loading

kylos101 commented Feb 10, 2022

geropl commented Feb 10, 2022

geropl commented Feb 10, 2022

kylos101 commented Feb 16, 2022

geropl commented Feb 24, 2022

kylos101 commented Feb 25, 2022

geropl commented Feb 28, 2022 • edited Loading

eliezedeck commented Oct 4, 2022

pawlean commented Oct 4, 2022

kylos101 commented Feb 9, 2022 •

edited

Loading

geropl commented Feb 28, 2022 •

edited

Loading