workspaces stuck in INITIALIZING #14108

geropl · 2022-10-22T14:31:13Z

incident: https://app.incident.io/incidents/260
affected cluster: us72

We noticed that new workspace got stuck in INITIALIZING (their content readiness probe failed), piods had status RUNNING.
The only thing we identified was that ws-manager consumed a lot of RAM and CPU. Restarting it mitigated the symptoms.

These are the logs from around the time when CPU & allocation rates started rocketing: logs
I see only two errors:

The container could not be located when the pod was deleted. The container used to be Running (instanceId: f5345fbd-2625-45ec-b58f-a719df59e477, log)
doFinalize failed (instanceId: d309b3b3-61fe-428d-b46c-8e6c5558fc1d, log)

Other resources:

The text was updated successfully, but these errors were encountered:

geropl · 2022-10-22T14:33:27Z

Potentially related changes:

mrsimonemms · 2022-10-25T08:52:35Z

This appears to have triggered again just now - https://gitpod.pagerduty.com/incidents/Q1HZ3KVKZ31OZ8

jenting · 2022-10-25T08:57:01Z

ws-manager pprof

atduarte · 2022-10-27T07:40:35Z

Potentially resolved by #14214.

@aledbf were you able to reproduce the issue and verify that what is described in this issue is not happening anymore?

utam0k · 2022-10-27T07:41:21Z

The container could not be located when the pod was deleted. The container used to be Running (instanceId: f5345fbd-2625-45ec-b58f-a719df59e477, log)

I have created the issue about it
#12021

kylos101 · 2022-10-28T17:47:45Z

@aledbf were you able to reproduce the issue and verify that what is described in this issue is not happening anymore?

For gen72, we did a pprof as @jenting shared above, also observed goroutine count were very high 10k and 20k, and fixed with gen73 via #14214.

We cannot necessarily recreate the original issue (being unable to start workspaces, accruing regular not active workspaces), though. However, the goroutine count is now stable in gen73 (below 1k for both clusters while active). We asserted the goroutine behavior prior to shipping gen73 when load testing, and updated our baseline accordingly.

If ws-manager is unable to start workspaces again, and they get stuck in Regular Not Active, we should carefully observe it's dashboard, and create a new issue.

I am going to move this issue to Done, we've validated the goroutine leak is fixed.

geropl added type: bug Something isn't working type: incident Gitpod.io service is unstable team: workspace Issue belongs to the Workspace team labels Oct 22, 2022

geropl added this to 🌌 Workspace Team Oct 22, 2022

aledbf mentioned this issue Oct 22, 2022

[ws-manager] Refactor workspace probes #14109

Merged

4 tasks

kylos101 moved this to Breakdown in 🌌 Workspace Team Oct 25, 2022

atduarte moved this from Breakdown to In Progress in 🌌 Workspace Team Oct 27, 2022

atduarte mentioned this issue Oct 27, 2022

[ws-manager] Configure http client timeouts #14214

Merged

4 tasks

kylos101 assigned aledbf Oct 28, 2022

kylos101 closed this as completed Oct 28, 2022

Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Oct 28, 2022

kylos101 moved this from Awaiting Deployment to Done in 🌌 Workspace Team Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workspaces stuck in INITIALIZING #14108

workspaces stuck in INITIALIZING #14108

geropl commented Oct 22, 2022 •

edited by atduarte

Loading

geropl commented Oct 22, 2022

mrsimonemms commented Oct 25, 2022 •

edited

Loading

jenting commented Oct 25, 2022

atduarte commented Oct 27, 2022

utam0k commented Oct 27, 2022

kylos101 commented Oct 28, 2022

workspaces stuck in INITIALIZING #14108

workspaces stuck in INITIALIZING #14108

Comments

geropl commented Oct 22, 2022 • edited by atduarte Loading

geropl commented Oct 22, 2022

mrsimonemms commented Oct 25, 2022 • edited Loading

jenting commented Oct 25, 2022

atduarte commented Oct 27, 2022

utam0k commented Oct 27, 2022

kylos101 commented Oct 28, 2022

geropl commented Oct 22, 2022 •

edited by atduarte

Loading

mrsimonemms commented Oct 25, 2022 •

edited

Loading