Skip to content

[observability] Show workspace success rate by cluster & include prebuilds #9026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kylos101 opened this issue Mar 30, 2022 · 0 comments · Fixed by #9098 or #9146
Closed

[observability] Show workspace success rate by cluster & include prebuilds #9026

kylos101 opened this issue Mar 30, 2022 · 0 comments · Fixed by #9098 or #9146
Assignees
Labels
operations: observability This issue relates to the observability of Gitpod (metrics, logs, traces)

Comments

@kylos101
Copy link
Contributor

kylos101 commented Mar 30, 2022

Is your feature request related to a problem? Please describe

While working this issue, we found that when prebuilds fail, it reduces the workspace success rate. Why? Prebuilds are included with the grpc_server_handled_total metric, which is a component of our calculation:

sum(rate(grpc_server_handled_total{grpc_method="StartWorkspace",grpc_code!~"OK|ResourceExhausted"}[1d])) OR on() vector(0)
    /
    sum(rate(grpc_server_handled_total{grpc_method="StartWorkspace"}[1d]))

Additionally, we found it was difficult to view success rate by cluster (now that we have 4).

Describe the behaviour you'd like

  1. Add a new pane that has one success rate line for each cluster, alternating the legend color for clusters
  2. Remove the prebuild filters from the stop measures, because the start metric includes prebuilds too (many failed prebuilds this week caused the success rate to plummet).

i.e.

    sum(rate(gitpod_ws_manager_workspace_stops_total{reason="failed"}[1d])) OR on() vector(0)
    /
    sum(rate(gitpod_ws_manager_workspace_stops_total[1d]))
  1. Avoid treating prebuilds which users cancel manually as failures. This will require some research, as its unclear if the status to exclude is Canceled or Aborted or something else. This is a guess, you'll need to read code and/or test to confirm. The existing promQL excludes OK and ResourceExhausted.

Describe alternatives you've considered

Update the grpc_server_handled_total metric to include a field that allows us to filter by workspace type, and exclude prebuilds. I opted to not go down this path, because we should pay more attention to prebuilds.

Additional context

This will give us a better picture of the current state for all workspaces, not just regular workspaces.

@kylos101 kylos101 moved this to Scheduled in 🌌 Workspace Team Mar 30, 2022
@kylos101 kylos101 added groundwork: scheduled operations: observability This issue relates to the observability of Gitpod (metrics, logs, traces) and removed groundwork: scheduled labels Mar 30, 2022
@princerachit princerachit self-assigned this Apr 1, 2022
Repository owner moved this from Scheduled to Done in 🌌 Workspace Team Apr 4, 2022
@princerachit princerachit reopened this Apr 5, 2022
@princerachit princerachit moved this from Done to In Progress in 🌌 Workspace Team Apr 5, 2022
@kylos101 kylos101 linked a pull request Apr 6, 2022 that will close this issue
Repository owner moved this from In Progress to Done in 🌌 Workspace Team Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operations: observability This issue relates to the observability of Gitpod (metrics, logs, traces)
Projects
No open projects
Archived in project
2 participants