Skip to content

research: time metrics with honeycomb #1115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kratsg opened this issue Oct 15, 2020 · 10 comments
Open

research: time metrics with honeycomb #1115

kratsg opened this issue Oct 15, 2020 · 10 comments
Labels
research experimental stuff

Comments

@kratsg
Copy link
Contributor

kratsg commented Oct 15, 2020

Description

See the python SDK: https://github.com/honeycombio/libhoney-py

Workflow I had in mind

  • master: baseline time that we always know works (and we can do nightly metrics to be sure)
  • PR opens: if it makes one of our metrics slower, trigger some alert or comment on what's slower, link to honeycomb w/ details or w/e

In general, we won't merge in PRs unless we can fix the slow stuff.

@ismith:

caveat: honeycomb is ideally meant for a lookback window of no more than 2 weeks; you can set a query to look back up to two months
so in events, you can specify fields.
you get duration_ms for ~free; you also might get function name for free. but you'll want, in config, to add maybe branch name and PR [id]
and then you can do a query that creates a graph of master vs non-master, and define thresholds for yay/nay.
we do not have an automated github check, so no automated enforcement, but we do offer slack/email/pagerduty and webhooks if you want somehting custom
if you blog this when you're done we'll give you stickers and maybe a tshirt
in a coveralls world this might be configurable as "PR is red, may not merge" same as if you failed CI
we don't offer that out of the box and i don't know that you want that.
but setting it up to comment on the PR is not hard to build with a webhook

@kratsg kratsg added the research experimental stuff label Oct 15, 2020
@matthewfeickert
Copy link
Member

Probably also worth looking at airspeed velocity as this seems to be basically exactly what I had in mind.

@matthewfeickert
Copy link
Member

This might be worth looking into if we can get an external grant to pay for us to run a small Digital Ocean or AWS instance to host this. Seems pretty valuable.

@matthewfeickert
Copy link
Member

Probably also worth looking at airspeed velocity as this seems to be basically exactly what I had in mind.

NumPy and SciPy use asv for benchmarks, so it might be worth looking at how they do it.

An interesting thing is that asv will go and run tests on old commits automatically so you can automatically build the performance history.

I think(?) this might be possible to do with just a repo over in the pyhf org that runs things on a cron job.

@matthewfeickert
Copy link
Member

c.f. also Is GitHub Actions suitable for running benchmarks?, where the answer is: yes.

@matthewfeickert
Copy link
Member

And pydata/xarray#5796 provides basically a template for how to do all of this!

@matthewfeickert
Copy link
Member

In glotzerlab/signac#776 @bdice mentions

We deleted the CI script for benchmarks from signac 2.0 anyway, because it's not reliable and we want to use asv instead.

@bdice I would love to talk to you about asv sometime as we've been wanting to set that up for pyhf for a while but haven't yet. If you have insights on how to get going with it I'd be quite keen to learn.

@bdice
Copy link

bdice commented Jul 3, 2022

You can see signac's benchmarks defined here: https://github.com/glotzerlab/signac/blob/master/benchmarks/benchmarks.py

And the asv config: https://github.com/glotzerlab/signac/blob/master/asv.conf.json

And here's a quick reference I wrote on how to use asv: https://docs.signac.io/projects/core/en/latest/support.html#benchmarking

I have mixed feelings about it. It can be difficult to make asv do what I want sometimes, and the project's development has been rather slow. Sometimes I wish for features that don't exist (like being able to have greater control over test setup/teardown to ensure that caches are cleared between runs without having to regenerate input data -- something like pytest fixtures would be helpful). I've run into a handful of situations while running asv that felt like bugs but were difficult to trace down. I don't know of better alternatives to asv unless you have the time and energy to roll your own Python scripts, which is what signac had done for a long time. Eventually the maintenance of those DIY scripts and their limitations were annoying enough that outsourcing to asv felt like a good decision.

edit: I read some of the thread above. I have had really mediocre experiences with running benchmarks as a part of CI or on shared servers. Dedicated local hardware is the only way I've ever gotten metrics that I really trust, especially for a project like signac that is heavy on I/O. The results from Quansight on GitHub Actions were extremely helpful for calibrating my own experience of annoyance with CI benchmarks in the past. I don't think the metrics they see for false positives and highly noisy data are good enough for what the signac project has needed in the past -- local benchmarks are much less variable in my experience.

@astrojuanlu
Copy link

Hi folks, @matthewfeickert asked me to leave my 2 cents here a few days ago. Basically 2 things:

Dedicated local hardware is the only way I've ever gotten metrics that I really trust, especially for a project like signac that is heavy on I/O.

This is 100 % correct. Here are the benchmarks we ran a few years ago in poliastro: the noisy lines are my own laptop (supposedly without doing anything else), the almost straight line is a cheap, dedicated server we rented on https://www.kimsufi.com/. Slower, but infinitely more useful.

benchmarks

I have mixed feelings about it. It can be difficult to make asv do what I want sometimes, and the project's development has been rather slow.

Recently they got a grant https://pandas.pydata.org/community/blog/asv-pandas-grant.html and managed to revamp the CI and make a release. The project has not seen more commits since then, so I agree it's not very active, but I'm not aware of any alternatives. The closest one would be https://github.com/ionelmc/pytest-benchmark/, but it's equally inactive.

@matthewfeickert
Copy link
Member

Following up on @astrojuanlu's excellent points, I was talking with @gordonwatts at the 2022 IRIS-HEP Institute Retreat about this and he mentioned that he might have some dedicated AWS machines that we could potentially use (or at least trial a demo). Gordon, if you can elaborate on this as my memory from last week isn't as clear as it was the next day.

@gordonwatts
Copy link

We have an account that is connected with IRIS-HEP for benchmarking (@masonproffitt and I were going to use this for some benchmarking for our ADL Benchmark paper work, but it didn't happen). This is still active. Only Mason and I have access. But you get a dedicated machine of a certain specific size (at least, that is what the web interface says). So if one can basically build a script that does the complete install and then runs the test, this can be a cheap-ish way to run these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research experimental stuff
Projects
None yet
Development

No branches or pull requests

5 participants