Skip to content

Improve monitoring and feedback for errors during workspace cluster selection #7829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kylos101 opened this issue Jan 26, 2022 · 17 comments · Fixed by #8486
Closed

Improve monitoring and feedback for errors during workspace cluster selection #7829

kylos101 opened this issue Jan 26, 2022 · 17 comments · Fixed by #8486
Assignees
Labels
priority: highest (user impact) Directly user impacting team: webapp Issue belongs to the WebApp team

Comments

@kylos101
Copy link
Contributor

Bug description

On start, some workspaces get stuck on "Allocating resources (from Discord)".

Another example from Twitter.

Steps to reproduce

Unconfirmed, but, try to start a workspace when prebuild load is very high.

Workspace affected

vipertools-css-eamifbfe80m

Expected behavior

It should start, instead of getting stuck.

Example repository

n/a

Anything else?

The issue occurred for some users on January 25 @ 10:52pm UTC.

@kylos101 kylos101 added priority: highest (user impact) Directly user impacting team: workspace Issue belongs to the Workspace team labels Jan 26, 2022
@kylos101 kylos101 moved this to Scheduled in 🌌 Workspace Team Jan 26, 2022
@kylos101 kylos101 changed the title Some workspaces are getting stuck in "Allocation Resources" Some workspaces do not start, are stuck in Preparing, displaying "Allocating resources..." Jan 26, 2022
@kylos101 kylos101 moved this from Scheduled to In Progress in 🌌 Workspace Team Jan 26, 2022
@thisisommore
Copy link

I can confirm this, in my case every first workspace of the same context is stuck

@kylos101 kylos101 removed the status in 🌌 Workspace Team Jan 28, 2022
@kylos101 kylos101 moved this to Scheduled in 🌌 Workspace Team Jan 30, 2022
@kylos101
Copy link
Contributor Author

kylos101 commented Feb 1, 2022

This may be an issue with server and the # of socket connections, and live on the WebApp side of the house.

To start, let's gauge the frequency for this error by searching in the logs. Do the entries for the error line up with peeks in our volume? If yes, a simple mitigation may be to scale up server (+1 replica).

This may be difficult to recreate, but please try, and report the result here.

cc: @jldec @JanKoehnlein as a heads up. I am going to add this to your inbox as well, to get your thoughts.

@JanKoehnlein
Copy link
Contributor

Thanks for the heads up. Let us know if you have a bit more information.

@thisisommore
Copy link

I can confirm this, in my case every first workspace of the same context is stuck

Now it is working I guess, never got into it again

@princerachit princerachit self-assigned this Feb 7, 2022
@princerachit princerachit moved this from Scheduled to In Progress in 🌌 Workspace Team Feb 8, 2022
@princerachit
Copy link
Contributor

I see that usually when this error occurred it was preceeded by {"phase": "stopped", "message": "Workspace cannot be started: Error: no available workspace cluster to choose from!", "conditions": {"failed": "Error: no available workspace cluster to choose from!"}}

@princerachit
Copy link
Contributor

I looked for the warning "websocket in strange state" but it seems to be not too bad during that time.

@princerachit
Copy link
Contributor

I cannot interpret anything in the dashboard that can establish the relation bw server and the error.

@kylos101
Copy link
Contributor Author

kylos101 commented Feb 9, 2022

Hey @princerachit, @sagor999 and I observed this behavior yesterday while server event loop lag was high.

@geropl @JanKoehnlein can you think of a way for us to artificially cause server event loop lag to increase (not in production), so we can test this hypothesis?

@kylos101
Copy link
Contributor Author

Hey @princerachit, @sagor999 and I observed this behavior yesterday while server event loop lag was high.

@geropl @JanKoehnlein can you think of a way for us to artificially cause server event loop lag to increase (not in production), so we can test this hypothesis?

Another potential culprit is ws-manager-bridge, if you can think of a way to make it perform poorly, that would be an ideal scenario to see if it causes this condition to occur.

@kylos101 kylos101 moved this to Blocked in 🌌 Workspace Team Feb 10, 2022
@geropl
Copy link
Member

geropl commented Feb 10, 2022

@kylos101
Regarding reproducing event-loop lag high: No, I don't know of any way to do so, sorry. And it might be related to general slowness, but involvement in persistent problems (longer than 1-2s) is more unlikely.

Another potential culprit is ws-manager-bridge

This, in turn, is very likely. First of all, you should see either a) high DB CPU usage, or b) high ws-manager-bridge CPU usage.
If both is not the case, you could also c) look for "deadlock" in the (ws-manager-)bridge logs.

Also, not sure how to reproduce it. If you want to do it artificially to inspect the effect on the system you could try insert a await new Promise(resolve => setTimeout(resolve, 2000)); here.

@geropl
Copy link
Member

geropl commented Feb 10, 2022

This message from @princerachit looks interesting as well:

I see that usually when this error occurred it was preceeded by {"phase": "stopped", "message": "Workspace cannot be started: Error: no available workspace cluster to choose from!", "conditions": {"failed": "Error: no available workspace cluster to choose from!"}}

This could happen if workspaceClusters are not registered properly (for whatever reason, cannot imagine one 🤷 ).

@kylos101 kylos101 moved this from Blocked to Scheduled in 🌌 Workspace Team Feb 14, 2022
@kylos101 kylos101 moved this to In Progress in 🍎 WebApp Team Feb 16, 2022
@kylos101 kylos101 removed the status in 🍎 WebApp Team Feb 16, 2022
@kylos101
Copy link
Contributor Author

Hi @JanKoehnlein and @jldec 👋 , #8173 was closed in favor of this issue. I wanted to give you a heads up, because, I know you had #8173 scheduled - and might be wondering what happened. 😄

Also, @csweichel shared some feedback, for related metrics on the WebApp side, which could help with alerting.

CC: @princerachit

@kylos101 kylos101 removed the status in 🌌 Workspace Team Feb 17, 2022
@geropl
Copy link
Member

geropl commented Feb 24, 2022

@JanKoehnlein IMO it make sense to schedule this, and add metric for and alert on the scheduling success rate

@geropl geropl moved this to Scheduled in 🍎 WebApp Team Feb 24, 2022
@kylos101
Copy link
Contributor Author

Hey @JanKoehnlein @geropl and @jldec 👋 ,

I spent some time researching this issue today for workspace shaal-drupalpod-wp2k1qd0khc and this report in Discord.

May I ask you to review and let us know if there's anything else we can do to help?

  1. This log from the gitpod project indicates server tried starting the workspace instance ID, but could not find a workspace-cluster.
    • It seems like server tried starting one time.
    • Would it make sense to exponential back-off for up to 3-5 minutes? 🤔 Perhaps even sharing the attempt with the user?
    • Related traces. It doesn't look like this instance ever actually made it to a workspace cluster.
  2. I found zero entries for this workspace instance ID, but was able to find owner IDs in the workspace-clusters project at this same timeframe. So this tells us they were able to run other workspaces, just not this one.
  3. While doing this research, I realized it is hard for us to compare startWorkspace requests with web app, and we don't log them (unless debug is enabled), so I made this issue to help.

CC: @aledbf @sagor999

@geropl
Copy link
Member

geropl commented Feb 28, 2022

@kylos101 Thx for the analysis, and I 💯% agree.

Would it make sense to exponential back-off for up to 3-5 minutes

I think we should re-try, but not over such a long time period, but more in the order of a couple of seconds. Longer outages like this one should escalate instead, triggered by metrics and alerts.

To users, if the re-try fails as well because of "no cluster available", we should present a message that explains:

  • 1.) this is a temporary issue
  • 2.) we're already aware (due to alerts).

IMO the best way to transport this message is to update WorkspaceInstance and set a special condition startFailed. 👍

Suggestions for metrics names:

  • gitpod_server_start_workspace_success_total
  • gitpod_server_start_workspace_retries_total
  • gitpod_server_start_workspace_failure_total

@geropl geropl changed the title Some workspaces do not start, are stuck in Preparing, displaying "Allocating resources..." Improve monitoring and feedback for errors during workspace cluster selection Feb 28, 2022
@geropl geropl self-assigned this Feb 28, 2022
@geropl geropl added team: webapp Issue belongs to the WebApp team and removed team: workspace Issue belongs to the Workspace team labels Feb 28, 2022
@geropl geropl moved this from Scheduled to In Progress in 🍎 WebApp Team Mar 1, 2022
Repository owner moved this from In Progress to Done in 🍎 WebApp Team Mar 7, 2022
@eliezedeck
Copy link

I'm still getting this issue right now: https://eliezedeck-workspace-yb0kjyp3sio.ws-eu67.gitpod.io/

You should be able to see the actual project on your end. But currently, I consistently get stuck at "Allocation resources ..."

@pawlean
Copy link
Contributor

pawlean commented Oct 4, 2022

Hey @eliezedeck! I saw your message on Discord and wanted to close this loop 🔁 Everything should be good again, but let us know if you see any other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: highest (user impact) Directly user impacting team: webapp Issue belongs to the WebApp team
Projects
Archived in project
9 participants