Skip to content

[ws-manager-bridge] Use more reasonable duration buckets for workspace instance updates #12798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 9, 2022

Conversation

jankeromnes
Copy link
Contributor

Description

Currently, we are tracking the duration of the handleStatusUpdate span, and storing it in the following buckets:

buckets: prom.exponentialBuckets(2, 2, 8),

  • < 2 seconds
  • < 4 seconds
  • < 8 seconds
  • < 16 seconds
  • < 32 seconds
  • < 64 seconds
  • < 128 seconds
  • < 256 seconds
  • < +Infinity

However, most of the time, this span takes actually around 50ms:

Screenshot 2022-09-09 at 09 31 55

Thus, all our measurements land in the very first "< 2 seconds" bucket, which is not very useful.

Instead, I propose that we store durations in the following buckets:

prom.exponentialBuckets(0.050, 2, 8)
  • < 50ms
  • < 100ms
  • < 200ms
  • < 400ms
  • < 800ms
  • < 1600ms
  • < 3200ms
  • < 6400ms
  • < +Infinity

Related Issue(s)

Fixes #

How to test

Release Notes

NONE

Documentation

Werft options:

  • /werft with-preview

@jankeromnes jankeromnes requested a review from a team September 9, 2022 07:43
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Sep 9, 2022
Copy link
Member

@geropl geropl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@easyCZ
Copy link
Member

easyCZ commented Sep 9, 2022

How much more useful is the extra precision for below <2 secs? I'm asking because if there's no direct value (aside from more accurate data), we could drop the precision and maybe only have 2 buckets in the [0, 2] sec range.

For example, if our SLO was at 2 sec, there would actually be almost no value in making this change.

@roboquat roboquat merged commit dadc064 into main Sep 9, 2022
@roboquat roboquat deleted the jx/fix-duration-buckets branch September 9, 2022 07:51
@jankeromnes
Copy link
Contributor Author

jankeromnes commented Sep 9, 2022

Many thanks @geropl for the fast review! 🚀 (I hope this will make the deployment in 7min 😅)

@easyCZ Good point, maybe 8 buckets isn't actually needed here (I just fixed the existing bucket values, but didn't question the existing bucket count).

But I think we can do much better than a 2 seconds SLO for handling status updates. This is user-facing after all. Maybe something around 200ms or 400ms could be a better SLO for 99+% of updates (and still easily achievable).

@roboquat roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed Change is completely running in production release-note-none size/XS team: webapp Issue belongs to the WebApp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants