-
Notifications
You must be signed in to change notification settings - Fork 1.3k
workspaces stuck in INITIALIZING #14108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Potentially related changes: |
This appears to have triggered again just now - https://gitpod.pagerduty.com/incidents/Q1HZ3KVKZ31OZ8 |
For gen72, we did a pprof as @jenting shared above, also observed goroutine count were very high 10k and 20k, and fixed with gen73 via #14214. We cannot necessarily recreate the original issue (being unable to start workspaces, accruing regular not active workspaces), though. However, the goroutine count is now stable in gen73 (below 1k for both clusters while active). We asserted the goroutine behavior prior to shipping gen73 when load testing, and updated our baseline accordingly. If ws-manager is unable to start workspaces again, and they get stuck in Regular Not Active, we should carefully observe it's dashboard, and create a new issue. I am going to move this issue to Done, we've validated the goroutine leak is fixed. |
incident: https://app.incident.io/incidents/260
affected cluster:
us72
We noticed that new workspace got stuck in INITIALIZING (their content readiness probe failed), piods had status RUNNING.
The only thing we identified was that ws-manager consumed a lot of RAM and CPU. Restarting it mitigated the symptoms.
These are the logs from around the time when CPU & allocation rates started rocketing: logs
I see only two errors:
The container could not be located when the pod was deleted. The container used to be Running
(instanceId: f5345fbd-2625-45ec-b58f-a719df59e477, log)doFinalize failed
(instanceId: d309b3b3-61fe-428d-b46c-8e6c5558fc1d, log)Other resources:
The text was updated successfully, but these errors were encountered: