-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Improve monitoring and feedback for errors during workspace cluster selection #7829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can confirm this, in my case every first workspace of the same context is stuck |
This may be an issue with To start, let's gauge the frequency for this error by searching in the logs. Do the entries for the error line up with peeks in our volume? If yes, a simple mitigation may be to scale up server (+1 replica). This may be difficult to recreate, but please try, and report the result here. cc: @jldec @JanKoehnlein as a heads up. I am going to add this to your inbox as well, to get your thoughts. |
Thanks for the heads up. Let us know if you have a bit more information. |
Now it is working I guess, never got into it again |
I see that usually when this error occurred it was preceeded by |
I looked for the warning |
I cannot interpret anything in the dashboard that can establish the relation bw server and the error. |
Hey @princerachit, @sagor999 and I observed this behavior yesterday while server event loop lag was high. @geropl @JanKoehnlein can you think of a way for us to artificially cause server event loop lag to increase (not in production), so we can test this hypothesis? |
Another potential culprit is ws-manager-bridge, if you can think of a way to make it perform poorly, that would be an ideal scenario to see if it causes this condition to occur. |
@kylos101
This, in turn, is very likely. First of all, you should see either a) high DB CPU usage, or b) high ws-manager-bridge CPU usage. Also, not sure how to reproduce it. If you want to do it artificially to inspect the effect on the system you could try insert a |
This message from @princerachit looks interesting as well:
This could happen if workspaceClusters are not registered properly (for whatever reason, cannot imagine one 🤷 ). |
Hi @JanKoehnlein and @jldec 👋 , #8173 was closed in favor of this issue. I wanted to give you a heads up, because, I know you had #8173 scheduled - and might be wondering what happened. 😄 Also, @csweichel shared some feedback, for related metrics on the WebApp side, which could help with alerting. CC: @princerachit |
@JanKoehnlein IMO it make sense to schedule this, and add metric for and alert on the scheduling success rate |
Hey @JanKoehnlein @geropl and @jldec 👋 , I spent some time researching this issue today for workspace May I ask you to review and let us know if there's anything else we can do to help?
|
@kylos101 Thx for the analysis, and I 💯% agree.
I think we should re-try, but not over such a long time period, but more in the order of a couple of seconds. Longer outages like this one should escalate instead, triggered by metrics and alerts. To users, if the re-try fails as well because of "no cluster available", we should present a message that explains:
IMO the best way to transport this message is to update Suggestions for metrics names:
|
I'm still getting this issue right now: You should be able to see the actual project on your end. But currently, I consistently get stuck at "Allocation resources ..." |
Hey @eliezedeck! I saw your message on Discord and wanted to close this loop 🔁 Everything should be good again, but let us know if you see any other issues. |
Bug description
On start, some workspaces get stuck on "Allocating resources (from Discord)".
Another example from Twitter.
Steps to reproduce
Unconfirmed, but, try to start a workspace when prebuild load is very high.
Workspace affected
vipertools-css-eamifbfe80m
Expected behavior
It should start, instead of getting stuck.
Example repository
n/a
Anything else?
The issue occurred for some users on January 25 @ 10:52pm UTC.
The text was updated successfully, but these errors were encountered: