Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bigTent slo): Custom Validate function for big tent SLO schema #2016

Merged
merged 50 commits into from
Mar 11, 2025

Conversation

Leo-DiCara
Copy link
Contributor

@Leo-DiCara Leo-DiCara commented Jan 31, 2025

Please Don't merge this until all work in https://github.com/grafana/slo/issues/2697 is complete. We want to ensure that the SLO Plugin API is merged and running in prod before we merge terraform-provider changes.

This adds a custom validate function for queries that checks to ensure basic form of big SLO JSON is being followed before we allow the user to roundtrip it to the API. This checks for:

  • Basic JSON structure
  • refID field
  • datasource field
  • uid and type subfields in datasource

Closes https://github.com/grafana/slo/issues/2702, https://github.com/grafana/slo/issues/2876, https://github.com/grafana/slo/issues/2875

@Leo-DiCara Leo-DiCara requested review from a team as code owners January 31, 2025 22:52
Copy link

In order to lower resource usage and have a faster runtime, PRs will not run Cloud tests automatically.
To do so, a Grafana Labs employee must trigger the cloud acceptance tests workflow manually.

@Leo-DiCara Leo-DiCara changed the title feat(bigTent slo): Custom Validate function for big tent SLO schema [WIP] feat(bigTent slo): Custom Validate function for big tent SLO schema Feb 1, 2025
@elainevuong
Copy link
Contributor

elainevuong commented Feb 5, 2025

I'd like to see if we can remove the warnings for valid Prometheus queries when we validate the query field. It might not be a fantastic customer experience for existing users, to suddenly get warnings from the Provider for queries that they haven't changed.

Perhaps we can investigate if this DiffSuppressFunc is an option to pursue?

@elainevuong
Copy link
Contributor

Just want to circle back - I verified that if we use the Terraform Provider as is, it DOES give warnings that show up, even if there isn't a diff in the field.

My last thought is I'm not even sure validate runs if a tf plan doesnt generate a diff for that field. This would mean unchanged SLOs wouldn't generate warnings. tf validate would generate these warnings everytime.

With the way this PR is set up currently, we've see this warning in both Ratio SLOs (due to the ValidateBigTent on the success and total metrics) AND in the Freeform Prom QL SLOs. These warnings appear on both "CREATE" of a new SLO, as well as on "UPDATE" of an existing SLO.

I think we need two changes:

  • drop the ValidateBigTent() on both the success_metric and total_metric
  • remove the warning if we can't parse as JSON

screenshots if anyone is interested in viewing what the user experience is like.

CREATE - Ratio SLO

Create - Ratio SLO

CREATE - Adv Freeform Prom SLO

Create - Adv Freeform Prom SLO

UPDATE - Adv Freeform Prom SLO

Update - Existing Adv Freeform SLO

description = "Terraform Description"
query {
grafana_queries {
grafana_queries = jsonencode([
Copy link
Contributor

@elainevuong elainevuong Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: can we use a Big Tent SLO example that matches the format and RefID of how the frontend generates it? i.e. using the Success and Total RefIDs, so it's easier for users to follow along? the example currently used looks to be a non-standard SLO that we had used for initial testing

maybe something like this

[
  {
    "datasource": {
      "type": "graphite",
      "uid": "datasource-uid"
    },
    "refId": "Success",
    "target": "groupByNode(perSecond(web.*.http.2xx_success.*.*), 3, 'avg'')"
  },
  {
    "datasource": {
      "type": "graphite",
      "uid": "datasource-uid"
    },
    "refId": "Total",
    "target": "groupByNode(perSecond(web.*.http.5xx_errors.*.*), 3, 'avg')"
  },
  {
    "datasource": {
      "type": "__expr__",
      "uid": "__expr__"
    },
    "expression": "$Success / $Total",
    "refId": "Expression",
    "type": "math"
  }
]

diags = append(diags, diag.Diagnostic{
Severity: diag.Error,
Summary: "Missing Required Field",
Detail: fmt.Sprintf("expected Big Tent Query (refId:%v) to have a uid", refID),
Copy link
Contributor

@elainevuong elainevuong Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: do we want to use "Big Tent" language in our error messages? how about "supported datasource must specify a datasource: uid" and or "datasource: type"

nit: did you want to specify datasource: uid rather than just uid? same comment for datasource: type above. since the uid and type fields are part of the datasource struct. this is what it looks like in the Terraform terminal (I removed the type field and then I removed the uid field - wanted to see how the error message appeared to a user)

Screenshot 2025-02-26 at 12 08 33 PM

@Leo-DiCara Leo-DiCara changed the title [WIP] feat(bigTent slo): Custom Validate function for big tent SLO schema feat(bigTent slo): Custom Validate function for big tent SLO schema Feb 26, 2025
@elainevuong
Copy link
Contributor

Sample Unhappy grafanaquery response from the API

notice how we have three apostrophes in the Graphite query

this returns a:

{
    "code": 400,
    "error": "SLO failed validation: invalid parameters on query please check your input values"
}

[
  {
    "datasource": {
      "type": "graphite",
      "uid": "becy9yvjmuz9ca"
    },
    "refId": "Success",
    "target": "groupByNode(perSecond(web.*.http.2xx_success.*.*), 1, 'avg''')"
  },
  {
    "datasource": {
      "type": "graphite",
      "uid": "becy9yvjmuz9ca"
    },
    "refId": "Total",
    "target": "groupByNode(perSecond(web.*.http.*.*.*), 1, 'avg')"
  },
  {
    "datasource": {
      "type": "__expr__",
      "uid": "__expr__"
    },
    "expression": "$Success / $Total",
    "refId": "Expression",
    "type": "math"
  }
]

Copy link
Contributor

@ellisda ellisda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - we'll have some minor doc additions to add here with instructions to prototype queries in a dashboard panel and copy/paste here, but we can add that later as well

### Basic
### Ratio

```terraform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: we've been meaning to swap / fix this for awhile. Not part of big-tent, but a good improvement

"uid" : "datasource-uid"
},
refId : "Success",
target : "groupByNode(perSecond(web.*.http.2xx_success.*.*), 3, 'avg'')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: the extra ' char was from testing, is invalid in current form

Suggested change
target : "groupByNode(perSecond(web.*.http.2xx_success.*.*), 3, 'avg'')"
target : "groupByNode(perSecond(web.*.http.2xx_success.*.*), 3, 'avg')"

Comment on lines +145 to +146
refId : "Total",
target : "groupByNode(perSecond(web.*.http.5xx_errors.*.*), 3, 'avg')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion - 5xx errors isn't really a great "total" query, we should give a better example ... but I think we may have used this one on website too, so need to fix both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is the exact same query.

}
}
}
```

For a list of currently supported datasources review the [documentation](https://grafana.com/docs/grafana-cloud/alerting-and-irm/slo/set-up/additionaldatasources/#supported-data-sources).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit / suggestion: most of our docs seem to try to have more informative link titles than "docs" (though I say "docs" internally all the time)

Suggested change
For a list of currently supported datasources review the [documentation](https://grafana.com/docs/grafana-cloud/alerting-and-irm/slo/set-up/additionaldatasources/#supported-data-sources).
For a complete list, see [supported data sources](https://grafana.com/docs/grafana-cloud/alerting-and-irm/slo/set-up/additionaldatasources/#supported-data-sources).

Type: schema.TypeList,
MaxItems: 1,
Optional: true,
Description: "Array for holding a set of grafana queries",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Description: "Array for holding a set of grafana queries",

the other query types seem to skip this outer description


### Grafana Queries - Any supported datasource

Grafana Queries use the grafana_queries field. It expects a JSON string list of valid grafana query JSON objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Grafana Queries use the grafana_queries field. It expects a JSON string list of valid grafana query JSON objects.
Grafana Queries use the grafana_queries field. It expects a JSON string list of valid grafana query JSON objects, the same as you'll find assigned to a Grafana Dashboard panel `targets` field.

@Duologic Duologic removed the request for review from a team March 4, 2025 12:02
@Leo-DiCara Leo-DiCara merged commit 7ec4803 into main Mar 11, 2025
26 checks passed
@Leo-DiCara Leo-DiCara deleted the ld/big_tent_validate branch March 11, 2025 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants