You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe
While working this issue, we found that when prebuilds fail, it reduces the workspace success rate. Why? Prebuilds are included with the grpc_server_handled_total metric, which is a component of our calculation:
sum(rate(grpc_server_handled_total{grpc_method="StartWorkspace",grpc_code!~"OK|ResourceExhausted"}[1d])) OR on() vector(0)
/
sum(rate(grpc_server_handled_total{grpc_method="StartWorkspace"}[1d]))
Additionally, we found it was difficult to view success rate by cluster (now that we have 4).
Describe the behaviour you'd like
Add a new pane that has one success rate line for each cluster, alternating the legend color for clusters
Remove the prebuild filters from the stop measures, because the start metric includes prebuilds too (many failed prebuilds this week caused the success rate to plummet).
i.e.
sum(rate(gitpod_ws_manager_workspace_stops_total{reason="failed"}[1d])) OR on() vector(0)
/
sum(rate(gitpod_ws_manager_workspace_stops_total[1d]))
Avoid treating prebuilds which users cancel manually as failures. This will require some research, as its unclear if the status to exclude is Canceled or Aborted or something else. This is a guess, you'll need to read code and/or test to confirm. The existing promQL excludes OK and ResourceExhausted.
Describe alternatives you've considered
Update the grpc_server_handled_total metric to include a field that allows us to filter by workspace type, and exclude prebuilds. I opted to not go down this path, because we should pay more attention to prebuilds.
Additional context
This will give us a better picture of the current state for all workspaces, not just regular workspaces.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe
While working this issue, we found that when prebuilds fail, it reduces the workspace success rate. Why? Prebuilds are included with the
grpc_server_handled_total
metric, which is a component of our calculation:Additionally, we found it was difficult to view success rate by cluster (now that we have 4).
Describe the behaviour you'd like
i.e.
Describe alternatives you've considered
Update the
grpc_server_handled_total
metric to include a field that allows us to filter by workspace type, and exclude prebuilds. I opted to not go down this path, because we should pay more attention to prebuilds.Additional context
This will give us a better picture of the current state for all workspaces, not just regular workspaces.
The text was updated successfully, but these errors were encountered: