Skip to content

workspaces stuck in INITIALIZING #14108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
geropl opened this issue Oct 22, 2022 · 6 comments
Closed

workspaces stuck in INITIALIZING #14108

geropl opened this issue Oct 22, 2022 · 6 comments
Assignees
Labels
team: workspace Issue belongs to the Workspace team type: bug Something isn't working type: incident Gitpod.io service is unstable

Comments

@geropl
Copy link
Member

geropl commented Oct 22, 2022

incident: https://app.incident.io/incidents/260
affected cluster: us72

We noticed that new workspace got stuck in INITIALIZING (their content readiness probe failed), piods had status RUNNING.
The only thing we identified was that ws-manager consumed a lot of RAM and CPU. Restarting it mitigated the symptoms.

image

These are the logs from around the time when CPU & allocation rates started rocketing: logs
I see only two errors:

  • The container could not be located when the pod was deleted. The container used to be Running (instanceId: f5345fbd-2625-45ec-b58f-a719df59e477, log)
  • doFinalize failed (instanceId: d309b3b3-61fe-428d-b46c-8e6c5558fc1d, log)

Other resources:

@geropl geropl added type: bug Something isn't working type: incident Gitpod.io service is unstable team: workspace Issue belongs to the Workspace team labels Oct 22, 2022
@geropl
Copy link
Member Author

geropl commented Oct 22, 2022

@mrsimonemms
Copy link
Contributor

mrsimonemms commented Oct 25, 2022

This appears to have triggered again just now - https://gitpod.pagerduty.com/incidents/Q1HZ3KVKZ31OZ8

image

@jenting
Copy link
Contributor

jenting commented Oct 25, 2022

ws-manager pprof

@kylos101 kylos101 moved this to Breakdown in 🌌 Workspace Team Oct 25, 2022
@atduarte
Copy link
Contributor

Potentially resolved by #14214.

@aledbf were you able to reproduce the issue and verify that what is described in this issue is not happening anymore?

@atduarte atduarte moved this from Breakdown to In Progress in 🌌 Workspace Team Oct 27, 2022
@utam0k
Copy link
Contributor

utam0k commented Oct 27, 2022

The container could not be located when the pod was deleted. The container used to be Running (instanceId: f5345fbd-2625-45ec-b58f-a719df59e477, log)

I have created the issue about it
#12021

@kylos101
Copy link
Contributor

@aledbf were you able to reproduce the issue and verify that what is described in this issue is not happening anymore?

For gen72, we did a pprof as @jenting shared above, also observed goroutine count were very high 10k and 20k, and fixed with gen73 via #14214.

We cannot necessarily recreate the original issue (being unable to start workspaces, accruing regular not active workspaces), though. However, the goroutine count is now stable in gen73 (below 1k for both clusters while active). We asserted the goroutine behavior prior to shipping gen73 when load testing, and updated our baseline accordingly.

If ws-manager is unable to start workspaces again, and they get stuck in Regular Not Active, we should carefully observe it's dashboard, and create a new issue.

I am going to move this issue to Done, we've validated the goroutine leak is fixed.

Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Oct 28, 2022
@kylos101 kylos101 moved this from Awaiting Deployment to Done in 🌌 Workspace Team Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: workspace Issue belongs to the Workspace team type: bug Something isn't working type: incident Gitpod.io service is unstable
Projects
No open projects
Status: Done
Development

No branches or pull requests

7 participants