[observability] Show workspace success rate by cluster & include prebuilds #9026

kylos101 · 2022-03-30T17:33:49Z

Is your feature request related to a problem? Please describe

While working this issue, we found that when prebuilds fail, it reduces the workspace success rate. Why? Prebuilds are included with the grpc_server_handled_total metric, which is a component of our calculation:

sum(rate(grpc_server_handled_total{grpc_method="StartWorkspace",grpc_code!~"OK|ResourceExhausted"}[1d])) OR on() vector(0)
    /
    sum(rate(grpc_server_handled_total{grpc_method="StartWorkspace"}[1d]))

Additionally, we found it was difficult to view success rate by cluster (now that we have 4).

Describe the behaviour you'd like

Add a new pane that has one success rate line for each cluster, alternating the legend color for clusters
Remove the prebuild filters from the stop measures, because the start metric includes prebuilds too (many failed prebuilds this week caused the success rate to plummet).

i.e.

    sum(rate(gitpod_ws_manager_workspace_stops_total{reason="failed"}[1d])) OR on() vector(0)
    /
    sum(rate(gitpod_ws_manager_workspace_stops_total[1d]))

Avoid treating prebuilds which users cancel manually as failures. This will require some research, as its unclear if the status to exclude is Canceled or Aborted or something else. This is a guess, you'll need to read code and/or test to confirm. The existing promQL excludes OK and ResourceExhausted.

Describe alternatives you've considered

Update the grpc_server_handled_total metric to include a field that allows us to filter by workspace type, and exclude prebuilds. I opted to not go down this path, because we should pay more attention to prebuilds.

Additional context

This will give us a better picture of the current state for all workspaces, not just regular workspaces.

The text was updated successfully, but these errors were encountered:

kylos101 moved this to Scheduled in 🌌 Workspace Team Mar 30, 2022

kylos101 added groundwork: scheduled operations: observability This issue relates to the observability of Gitpod (metrics, logs, traces) and removed groundwork: scheduled labels Mar 30, 2022

princerachit self-assigned this Apr 1, 2022

princerachit mentioned this issue Apr 4, 2022

[dashboard] Update success criteria dashboard #9098

Merged

roboquat closed this as completed in #9098 Apr 4, 2022

Repository owner moved this from Scheduled to Done in 🌌 Workspace Team Apr 4, 2022

princerachit reopened this Apr 5, 2022

princerachit moved this from Done to In Progress in 🌌 Workspace Team Apr 5, 2022

kylos101 mentioned this issue Apr 6, 2022

[observability] Update success criteria formula #9146

Merged

kylos101 linked a pull request Apr 6, 2022 that will close this issue

[observability] Update success criteria formula #9146

Merged

roboquat closed this as completed in #9146 Apr 6, 2022

Repository owner moved this from In Progress to Done in 🌌 Workspace Team Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[observability] Show workspace success rate by cluster & include prebuilds #9026

[observability] Show workspace success rate by cluster & include prebuilds #9026

kylos101 commented Mar 30, 2022 •

edited

Loading

[observability] Show workspace success rate by cluster & include prebuilds #9026

[observability] Show workspace success rate by cluster & include prebuilds #9026

Comments

kylos101 commented Mar 30, 2022 • edited Loading

Is your feature request related to a problem? Please describe

Describe the behaviour you'd like

Describe alternatives you've considered

Additional context

kylos101 commented Mar 30, 2022 •

edited

Loading