-
-
Notifications
You must be signed in to change notification settings - Fork 307
Keywords for describing array patterns; "concat", "someOf", array repetition/kleene star #1323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Some web frameworks use arrays like this to specify query strings with duplicate key names. Do these keywords require some sort of new underlying JSON Schema feature? If not, would this be a better fit in the vocabularies repo? |
Yeah, another good example. Sometimes it's just less verbose to do an array with odd/even pairings rather than an array of vector arrays.
The "concat" keyword does not fit well in with how most implementations perform validation, and most implementations will probably implement a naive backtracking algorithm in O(n^2) time or worse, even though it can be done in O(n) with much pre-computation. The others should be fairly easy: I think "someOf" or a similar feature has been requested multiple times (though I can't find any off-hand), and it's fundamentally the same thing as "oneOf". And |
@awwright can you explain a bit more about how |
@handrews It's essentially the same thing as concatenation of characters or groups in regular expressions:
There's three elements being concatenated together here (one alternation then two character classes, the second matches exactly one character) It can get complicated when there's a repetition followed by an alternation:
If the first four characters of input is So to write a schema with odd/even items that ends with a boolean, it might look like this: {
type: "array",
concat: [
{ type: "array", prefixItems: [ {type:"string"}, {type:"integer"} ], minItems: 2, maxItems: 2, repeatMin: 0, repeatMax: null },
{ type: "array", additionalItems: { type: "boolean" }, minItems: 1, maxItems: 1 }
]
} ... there's different ways the "repeat" keywords could interact with other keywords, I haven't fully thought it out yet. Maybe they only apply to "concat" instead of any of the *items keywords. |
@awwright thanks, I think I understand this now. Definitely interesting, and yes possibly involving an underlying change in JSON Schema capabilities.
However, keywords with a more complex impact on how an applicator applies and evaluates its schemas might require a more substantial change, so I'm in favor of keeping this issue in this repository while we consider that. Ultimately I think this would fit better as an extension vocabulary than in the standard applicator vocabulary, but it's a really interesting idea for one! |
I'm going to rewrite this with formatting I find easier to follow, and in YAML because I'm lazy. type: array
concat:
- type: array
prefixItems:
- type: string
- type: integer
minItems: 2
maxItems: 2
repeatMin: 0
repeatMax: null
- type: array
additionalItems:
type: boolean
minItems: 1
maxItems: 1 OK, looking at it like this, I see where the difficulties would be in the current model. Basically, it's the block of I think this would need to be reworked a bit to make sure that any control keywords are outside of the schema object that they need to control. Which has its own wrinkles because you're trying to change the start point of |
I agree with @handrews. I think what this needs is a new keyword for the repeating content. I'd like to propose another new keyword to handle this: type: array
concat:
- type: array
repeatedItems:
- type: string
- type: integer
repeatMin: 0
repeatMax: null
- type: array
additionalItems:
type: boolean
minItems: 1
maxItems: 1 (well... Here, This also solves Henry's concern that The
The second subschema, as written, evaluates the whole array, which in this case would fail. But the One way that I can see to resolve this is to just remove the type: array
repeatedItems:
- type: string
- type: integer
repeatMin: 0
repeatMax: null
items:
type: boolean
minItems: 1
maxItems: 1 This solves the type: array
repeatedItems:
- type: string
- type: integer
repeatMin: 0
repeatMax: null
prefixItems:
type: boolean
items: false But then it's strange that "prefixItems" isn't specifying what comes first, and it removes a lot of flexibility... which leads back to using |
I think this is a really exciting idea. Array validation is definitely a place were JSON Schema is lacking.
I disagree. It's certainly not super common, but I've seen quite a few SO questions over the years where I've had to tell people that JSON Schema can't express the weird sequence they need to describe. If we can come up with a good way to improve that situation, I think it's very well worth considering for the spec. I'd like to suggest something similar to what @gregsdennis proposed, but a little inverted. The I think that would allow It's worth noting that Here are some examples. Sorry, I'm sticking with JSON because I find YAML difficult to read.
{
"type": "array",
"concat": [
{
"prefixItems": [{ "const": "a" }]
},
{
"items": { "const": "b" },
"repeatMin": 0,
"repeatMax": 3
},
{
"prefixItems": [{ "const": "c" }]
},
{ "items": false }
]
} We can also define
{
"type": "array",
"concat": [
{ "prefixItems": [{ "const": "a" }, { "const": "b" }] }
],
"repeatMax": 3,
"unevaluatedItems": false
}
{
"type": "array",
"concat": [
{ "prefixItems": [{ "const": "a" }] },
{
"concat": [
{
"items": { "const": "b" },
"repeatMin": 1,
"repeatMax": 2
},
{ "prefixItems": [{ "const": "c" }] },
],
"repeatMax": 2
},
{ "items": false }
]
} Alternation can be done with
{
"type": "array",
"concat": [
{
"anyOf": [
{ "prefixItems": [{ "const": "a" }] },
{ "prefixItems": [{ "const": "b" }] }
]
},
{ "prefixItems": [{ "const": "c" }] }
]
} I don't want to get deep into arguing about keyword naming just yet, but I'd suggest |
I think this still breaks a basic tenet of how JSON Schema works: schema objects evaluate the instance independently. For example, in the schemas above, { "prefixItems": [{ "const": "c" }] } is itself dynamic because it doesn't know where to start and relies on the evaluations of previous subschemas for this information. (Note we do have keywords that rely on evaluations from other keywords, but we don't have that behavior in subschemas.) For every other applicator that contains multiple subschemas, the schema objects operate completely independently of their siblings, thus each evaluates the entire (local) instance. This new behavior is a significant departure from the current operating model. |
@gregsdennis yes, I agree. I think this could be addressed by creating a This is where I think it's important to not have "operator" keywords like However, adjacent keywords impacting each other is within the current paradigm. It's not immediately clear to me whether this introduces complexity beyond what our current paradigm supports. I think the key thing is to first see if we can keep this within the paradigm at all. |
Oh, I agree that there is an element that doesn't fit the current architecture. Sorry if I didn't make that clear enough.
I'd have to implement it to decide if I think it's a significant departure or something that could be relatively easily incorporated. I honestly don't have a solution in mind, so it very well could end up being a significant change.
I don't know if this was specifically in response to my comments, but I think the way I defined
100% agree, but at this point I don't think it's going to be possible to support this kind of thing without breaking the mold somewhere. I think we'll have to come up with a few ideas, determine how they don't fit the model, determine how to model could be adjusted, and decide if the break is worth it. Of course, I'll be thrilled to be wrong if someone comes up with a solution that does fit the model. |
The difference is that It's only the interaction of either Currently, the applicator behavior is defined in §7.5 as:
So if a
We do not have a direct precedent for a
We do have a dynamic parent lookup behavior that |
Scope! Yes! That's what feels weird. Taking from @jdesrosiers example: {
"type": "array",
"concat": [
{
"prefixItems": [{ "const": "a" }]
},
{
"items": { "const": "b" },
"repeatMin": 0,
"repeatMax": 3
},
{
"prefixItems": [{ "const": "c" }]
},
{ "items": false }
]
} In this schema, the scope (adjacent or dynamic) of |
Yeah, Here's another way to think about Depending on what you know about the subschemas you can optimize this process, potentially as good as O(n), but worst case (when the subschemas are completely opaque), you'll need a recursive algorithm in O(n^m) time (instance length raised to the subschema count—yikes!). |
Right. I think that's effectively what @handrews was trying to say, but I didn't entirely follow the argument. Usually a keyword that uses an annotation as input would look that annotation up first and then do it's work. It's unprecedented, but is it really significantly different if it looks up annotations while it's doing it's work after it processes each sub-schema? That's the case of the sub-schemas passing the new starting point back to @handrews, I think what made it hard for me to follow your last comment was that you introduced some keywords and I wasn't clear how you intended them to be defined and used. Maybe you can provide an example. |
(scrolls back) oh, huh. I thought I'd written one out but I didn't. It's probably in a half-written comment in another tab somewhere. I'll do that shortly.
Yeah, this is a major part of it. We sort-of have an architectural channel here from The I think in addition to the individual communication channels, we need to think through the larger communication patterns. To date, we have had very simple lateral-adjacent, child-to-parent, and parent-to-child communications. Combining those into a cousin-to-cousin communication, even when implemented on existing communication channels, introduces a more difficult to visualize and reason about set of interactions. |
As far as the specification text would be concerned, there shouldn't be a need to define how communication channels work. For defining the "concat" keyword, it should suffice to say "This keyword is valid when there is any way to split the instance into segments, such that that each segment is valid against the corresponding subschema." (Note in the case of zero subschemas, only the zero-length array matches.) This is because there's different, equally legitimate ways to implement the algorithm. When defining how an implementation works, consider the non-deterministic cases: If I split apart a string into an array of one item per character, could I write a schema that matches exactly the same strings as While the idea of "passing the starting point" might be used internally by an implementation, this is only an optimization, either it must support backtracking, or you have to support nondeterministic results (that is, multiple logic paths being followed at the same time). And this technique won't be used by an O(n) algorithm at all, which will compute the union of the branches before consuming the input. |
This is not about defining how to implement the communication, it's about whether such communication channels (or other algorithmic approaches) should exist at all, as something that JSON Schema implementations should, in general, support. Which would no doubt lead to them being used in future keywords. Adding these communication channels would have implications for parallel execution and other optimizations, just as adding runtime child-to-parent communication (for This is what we need to consider before designing keywords that depend on this sort of communication channel and control flow existing. |
Haha, I just saw that I proposed something like this a couple years ago. 🤦♂️
|
This proposal randomly invaded my brain the other day while I was hiking and wouldn't leave. One problem we identified was that So, I was thinking that in order to solve these problems, all the information about what index of the array is being evaluated needs to be defined by the keyword itself. It can't delegate to I started off with a crazy verbose syntax and simplified until I ended up with something that was reasonable to use. Mostly coincidentally, I ended up with something that mimics regular expression syntax pretty closely. (I'm going to use a different keyword name because this is very different than the original proposal.) Here's a schema that represents an array with any number of "a"s but always ends with a "b". {
"$comment": "^a*b$",
"type": "array",
"sequence": [
{ "const": "a" }, "*",
{ "const": "b" },
"$"
]
} This one represents a sequence of pairs. {
"$comment": "^(ab)*",
"type": "array",
"sequence": [
[
{ "const": "a" },
{ "const": "b" },
], "*"
]
} Here's a complex one just to show what's possible. {
"$comment": "^a(b{1,2}c){0,2}$",
"type": "array",
"sequence": [
{ "const": "a" },
[
{ "const": "b" }, "{1,2}",
{ "const": "c" }
], "{0,2}",
"$"
]
} This approach is intuitive because it looks and works just like a regular expression. This would be very power and it would be easy to introduce additional operators to add more functionality as needed in the future. The main downside of this approach is that implementations of this keyword are essentially regular expression engines (although most likely with limited functionality). I'm sure that will prove difficult to implement properly. I implemented a very simple engine that seems to be working correctly and reasonably efficiently so far, but I would be surprised if I wasn't missing something. I doubt that the demand for this much power in describing array items is going to be high enough for something like this to ever make it into the main specification, but it would be interesting to have as a custom vocabulary keyword. |
I think this is a good start, but I'm not a fan of the I have an allergic reaction to magic strings. Encoding logic in strings is a step toward that. (And it's another thing we don't have precedent for.) Also having to "interpret" intent from the type of item (e.g. a string needs to be parsed, an array is a subsequence, etc.) is hard to deal with. I have to do a lot of this in my JSON Logic implementation, and it's a real headache. I think the more "JSON Schema" argument is that this is expressing "min iterations" and "max iterations" constraints, which should be keywords. Though more verbose, I'd prefer this: {
"$comment": "^a(b{1,2}c){0,2}$",
"type": "array",
"sequence": [
{ "control": "start" },
{ "schema": { "const": "a" } },
{
"schema": {
"sequence": [
{
"schema": { "const": "b" },
"maxCount": 2
},
{ "schema": { "const": "c" } }
]
},
"minCount": 0,
"maxCount": 2
},
{ "control": "end" }
]
}
I think this fits better into the JSON Schema paradigm, and it's just as expressive. |
Your alternative is close to one of the intermediate steps I went through when iterating on this problem, but we can't nest schemas like in your example. I would expect your example to describe an array as an item rather than a segment of an array ( More importantly, we would have the problem of needing to pass the starting/ending position to/from the schema evaluation which is the problem I was trying to solve in the first place. This is easily fixed by introducing a new sub-keyword for groupings rather than using {
"$comment": "^a(b{1,2}c){0,2}$",
"type": "array",
"sequence": [
{ "control": "start" },
{ "schema": { "const": "a" } },
{
"group": [
{
"schema": { "const": "b" },
"maxCount": 2
},
{ "schema": { "const": "c" } }
],
"minCount": 0,
"maxCount": 2
},
{ "control": "end" }
]
} This would be equivalent to what I proposed, but my problem with this approach is the use of the structured non-schema object to describe terms. It's something we've actively avoided in the past. We even replaced a keyword that does this ( I prefer the flat version. I think it's more user friendly. It's easier to write and to read. I tried to use keywords to describe min/max matches, but there wasn't a way to do it that didn't include some kind of wrapper and I didn't think that was worth the decreased usability the wrapper brings. One of the things I really like about the flat version is that the syntax is so burdenless, that it could reasonably be used to describe simple cases while also being powerful enough to describe complex cases. (Examples: |
I'm happy with this.
We use non-schema constructs for keyword values all the time:
The difference between those and this is that this also contains other data. (And actually,
In this example, the
This unnecessarily complicates implementation, and it ddoesn't align with JSON Schema's notion that items are independent. (I know we're in the context of a keyword here.) Everywhere else an array is used, the items in that array are independent of each other with no context shared between them. It makes sense, then, to minimize the context shared between the items for It's better if the repetition specifiers are directly associated with what's being repeated, which is what my proposal does.
This way, each item in
It may appear to be simpler to read/write, but as I mentioned, it's SO much more difficult to implement. (Maybe it's easier in a dynamic language like JS, but in a strongly typed language, it's very burdensome.) I could do it, but I would hate it. (This comes from experience; I don't want to do go down this road again.) Having strings (which are magic and have to be parsed), arrays (which imply subsequences), and booleans/objects (which signify schemas) is a very interpretive approach that we haven't even come close to with JSON Schema. There is no precedent for this. I'm strongly against including something like this in the spec. Maybe as an external vocab, but not in the spec. How would we validate the interpretive style of
|
That's not what I meant. By "structured object", I mean an object whose property names are semantic rather than values. That doesn't include arrays or objects used like dictionaries. I'm talking about objects within a keyword that have their own keywords, but aren't schemas.
Those are examples of "structured objects". There's no technical reason why a keyword can't contain structured objects. It's a usability concern. It's easier (especially for beginners) to understand a schema when all objects are either a schema or a dictionary of schemas. An object property name is always either a user provided value or JSON Schema keyword. I'm ok with making an exception if there's compelling reason, but IMO I don't think this clears that bar.
I think it's worth making things slightly more complicated for implementers if it makes authoring schemas and reading schemas easier for users. There is a good reason for it.
I think this is an accidental property. It happens that it's true, but there's no reason it needs to be preserved.
In the POC implementation I did, I compile the more user-friendly flat structure into something very similar to the more implementation-friendly structure you're arguing for. That way I get the best of both worlds, easy for users and easy to evaluate.
I grant you that it would be more annoying to implement in a dynamic language, but I don't think it's hard enough that it's worth sacrificing user experience.
True, but this is a complex feature and I think it's no surprising that we'd end up with something we've never had before. As long as it fits in JSON Schema architecture, using a more complex structure than we've used before to describe one of the most complex features ever proposed is reasonable.
I mostly agree. It's not that I'm opposed to the keyword (in either form), but that it's so rarely needed that I don't see it ever having enough demand to make it into the spec.
|
I disagree. I can't back the implicit/interpretive syntax right now. It's a headache to implement and more trouble than it's worth. Sorry.
I could say this about "an object whose property names are semantic rather than values." What I've proposed is closer to what JSON Schema already has, and is therefore more easily digestible. There have been other proposals which have aimed to do this, and we've struck them down for the very reason you're citing: keywords don't themselves have properties. If we're going to introduce something new, it should be something that is at least similar to what's already there, not something that's drastically different.
It's absolutely an intentional property of JSON Schema!
This independent evaluation has always been a foundational aspect of JSON Schema. Items like {
"schema": { "const": "b" },
"maxCount": 2
} is fully defined, so it can be processed independently. |
It seems like we aren't going to agree on this, but that's probably ok since no one is recommending it be considered for the spec. My main concern is that we seem to disagree so passionately about which properties of JSON Schema should be considered important architectural constraints, which are norms worth preserving, and which are coincidental convergences that don't need preserving. The bottom line for me is that I don't think the nested structured object approach is user-friendly enough. Neither approach breaks any important architectural constraints and both break norms or coincidental convergences (depending on who you ask). Which is preferred is purely a matter of personal opinion. I wish there was a path to compromise, but I don't think there's an in-between in this case. A couple things occurred to me yesterday that I wanted to mention even tho I don't see this keyword moving forward. It occurred to me there needs to be a way to represent alternatives. If you have something simple like The other thing that occurred to me is that error reporting for this keyword is probably going to be awkward. Generally, in an applicator like this, I'd report any sub-schema failures that occurred if the keyword fails, but that could get excessive (and likely unhelpful) quickly depending on the pattern defined in the keyword. For example, an item might be validated by several different schemas in the evaluation process. Some of those failures might be normal and don't need to be reported. I don't think that finding which failures are relevant would be an easy task, but I haven't put much thought into it yet. |
In thinking about implementing this, I tried to think in terms of regex (like in the examples we've been using), and I think I've come across a problem with my proposal. ^a*(ab)+$ The process here is:
This demonstrates that each So my item-subschema independence argument is moot. Another problem is that my model can't represent Trying to build this using my model would be {
"type": "array",
"sequence": [
{ "control": "start" },
{
"schema": { "const": "a" },
"minCount": 0,
"maxCount": ?? // if the default is 1, then how to you represent unlimited?
},
{
"schema": { "const": "ab" },
// min count defaults to 1
"maxCount": ?? // if the default is 1, then how to you represent unlimited?
},
{ "control": "end" }
]
} I'm still not really a fan of the implicit syntax because of my need for strongly-typed-everything, but if we can be explicit about the grammar, I think I might be persuaded.
I think this can be validated with a meta-schema easily enough using an That takes care of the components, but we may also need to specify which components can come in which order. Maybe. I'm not sure. I mean, I'd also like that we specify the algorithm I laid out in the first section of this comment, or something similar. Am I missing anything, @jdesrosiers? |
I think it's more complicated than that. Regular expression terms are said to match greedily, meaning they match as many as possible, not as few as possible. When I put your example into regex101.com, it describes
This example exposes a problem in my initial implementation (:cry:). It doesn't handle the "giving back as necessary" part properly. This forced me to do a little reading on the theory involved. If I understand correctly, this is the difference between deterministic finite atomata (DFA) and non-deterministic finite atomata (NFA). I reimplemented using the Thompson Construction algorithm described in that article to compile to an NFA and used the recommended search algorithm. That worked out well, but had a few consequences. I think these are general regexp properties and not specific to this algorithm, but I'm not 100% sure about that. This approach assumes that the expression is bounded, so it's not easy to implement explicit bounding with I also found the bounded repetition operators ( Leaving these things out feels lazy, but it's also a feature. Sticking to what's easy to implement makes it more likely others will choose to implement it.
This shouldn't be an issue for strongly typed languages in general, just ones that don't have support for union types. It's been a while since I've worked with C#, but I think it falls in that category.
Great! The rules you laid out was what I had in mind as well. (However, you've left out support for alternation (
That's exactly the kind of thing this keyword would be able to express! Here's what I came up with. (I thought of an alternate keyword name that I prefer over "$defs": {
"itemPattern": {
"type": "array",
"itemPattern": [
[
{
"if": { "type": "array" },
"then": { "$ref": "#/$defs/itemPattern" },
"else": { "$dynamicRef": "meta" }
},
{ "enum": ["?", "*", "+"] }, "?",
"|",
{ "const": "|" }
], "*"
]
}
}
I'd rather not specify an algorithm. I think these are well defined concepts that we just need to reference. I don't know this domain well enough to do that yet, but I think that should be sufficient. |
Wouldn't that already be supported with |
It occurred to me at one point that
Hopefully that makes sense. I didn't notice it until I wrote the meta-schema for this keyword and realized I needed something I couldn't express with |
Most regular/context-free grammars that are a subset of JSON documents have an equivalent representation in JSON Schema. However there's some kinds of JSON arrays that can be specified with a regular expression, but that has no JSON Schema equivalent. For example, an array whose items are an odd number of positive integers, then a boolean:
(Some whitespace/decimal forms supported in JSON is removed here for brevity; it is more complicated, but suffice to say it exists.)
In order to be able to convert between a regular/context-free language and JSON Schema, there needs to be some keywords that can implement the "array" counterparts to regular expressions in strings:
I was trying to spec out a hierarchy of JSON Schema features (mostly by processing & memory complexity), and I realized array items with regular patterns was missing from the O(n) complexity class.
The lack of feedback relating to this sort of thing suggests this would have limited usefulness as a core requirement, but I'd like to describe it here for completeness. The only instance I can think of are arrays where even items represent a name, and odd items represent the value. Node.js describes raw HTTP headers this way.
The text was updated successfully, but these errors were encountered: