-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Elasticsearch 2.x does not appear to fully support JSON #15404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The change is not made arbitrarily or lightly, the problem is actually to For example, given two simple JSON documents: Document A: {
"foo": {
"bar.baz": 5
}
} Document B: {
"foo": {
"bar": {
"baz": 7
}
}
} If these were in the same document, what does It's not just Elasticsearch that "doesn't like" ambiguity like this, take for
Or accessing JSON members as objects in Javascript:
Would you say neither of those tools support JSON fully? The complexity of having to remove the "." from field names is better than |
@dakrone - thank you for the response! You compare this limitation to both javascript and jq - but your assertion is incorrect. Both javascript and jq fully support JSON with periods in field names, and provide syntax for working with them, as follows:
You can use this associative array syntax for accessing fields with periods in them in javascript as well. If there now exists a subset of possible JSON documents that are in compliance with the JSON standard, will validate properly with tools that implement the standard, but cannot be inserted into Elasticsearch - then Elasticsearch no longer supports the JSON standard. While I understand that the decision was made in order avoid complexity in the query parser and to avoid some queries possibly resolving in unexpected ways, I find the decision regrettable. I'll add to this that in general, I'm very happy with Elasticsearch - I don't want my opinion on this matter to come across as a general opinion about ES or the engineers who contribute to it. Cheers, |
This appears to be the crux of your argument. Namely, you're asserting that since Elasticsearch (intentionally) returns an error on some conforming JSON texts, Elasticsearch does not support the JSON standard. JSON is a data interchange format, and the JSON standard specifies the JSON grammar; the JSON standard does not a put a requirement on applications beyond specifying what a conforming JSON text is and requirements for parsing conforming JSON text. Most importantly, there is no standard nor practical requirement that applications that accept JSON must handle without error every conforming JSON text. Consider an application that uses the JSON format for application configuration; some configurations can be conforming JSON text but will be invalid for the application; the application can reject those but still be in conformance with how it parses the configuration file. Similarly, a web server that accepts a request with media type And that is what Elasticsearch does. You can send it conforming JSON text via HTTP, but if it does not meet the requirements that Elasticsearch puts on JSON documents, Elasticsearch will give you an HTTP Bad Request. Elasticsearch will have parsed this document according to the JSON standard, and then after that it tells the client this is a bad request for Elasticsearch. If, however, you were to find a conforming JSON document that Elasticsearch does not parse and represent correctly internally according to the JSON standard, or a JSON response from Elasticsearch that is not conforming JSON text, then there would be a legitimate issue and it would be addressed appropriately. |
Elasticsearch no longer allows storage of key names that are in compliance with ECMA 404 definition of key names. "It parses the full standard in order to reject a subset of it" is a poor argument for Elasticsearch's support of JSON.
This is an apples to oranges comparison. Elasticsearch is not rejecting conforming JSON text that describes a configuration or API call that elasticsearch does not accept. Elasticsearch is rejecting the storage of JSON keys that are conforming JSON text within JSON objects that should be perfectly valid to store. In this case, the structure of the JSON object is perfectly valid, and what you are in fact rejecting is the JSON data exchange format as described by the standard. From what I'm hearing, it sounds like this decision is permanent. Since I have no control over it, I'll work around it as best as I can. |
The word "key" nor the term "key name" never appear in ECMA 404. The word "key" does appear in RFC 7159, but only in the context of referring to the RFC requirement levels; the term "key name" never appears in RFC 7159. Both documents do refer to "name/value" pairs. However, there is no requirement by either specification that an application must accept all names as being valid for that application. The concerns that you're discussing are concerns at the application layer on which the JSON standard places no restrictions.
The only valid argument that Elasticsearch does not support JSON is to provide an example of conforming JSON text that is not parsed correctly by Elasticsearch, or to provide an example of a response with media type Elasticsearch, like any other application that consumes JSON, places application logic on top of JSON. This is valid, and in conformance with the intended uses of JSON.
It is not, because the analogy is for a concern at the application layer, just as Elasticsearch rejecting fields with dots in their name is a concern at the application layer. The JSON standard places no requirements on the application layer.
Again, this is a concern at the application layer. The JSON standard does not concern itself with concepts at the application layer such as "storage". It merely specifies what conforming JSON text is, and how it is to be parsed. It does not place requirements on the application layer and rejecting fields with dots in their name is a concern at the application layer.
Such JSON is valid, and Elasticsearch correctly parses that JSON and then rejects it at the application layer. This does not violate ECMA 404 nor RFC 7159.
I'm hesitant to use a word like "permanent" but it is highly unlikely that this will change. Relates #12068. |
Hello, I've been following along, and I'd like to point out two things:
and
Both node and jq handle keys with periods in them. |
Not allowing dots in field names is a logical policy decision enforced in the application layer of Elasticsearch. Elasticsearch correctly parses the field names with dots in them, and then makes a logical policy decision to reject those at the application layer. This is not a violation of the JSON standard, and is consistent with the intended uses of JSON as a data interchange format. The JSON standard places no restrictions on the application layer of an application that consumes conforming JSON text. |
@jasontedor It seems your argument for not supporting these features is because you proactively choose not to "at the application layer". Your support for JSON is explicitly implied on the elasticsearch product page under the "Schema-Free" heading. If one cannot actually index a valid JSON file (with periods in key names) into an elasticsearch DB, surely you should not be claiming that you can? |
The claim that Elasticsearch supports JSON does not translate into Elasticsearch has to accept without any restrictions whatsoever every conforming JSON text that is handed to it. There are rules to using the system, they must be understood and followed, and having them doesn't violate any claims that Elasticsearch supports JSON. For example, if you specify a field as having type The only valid argument that Elasticsearch does not fully support JSON is if there exists conforming JSON text that Elasticsearch does not parse correctly, or if there exists a JSON response body from Elasticsearch that is not valid JSON. If either of those possibilities occur, then we have a legitimate issue and it will be addressed appropriately. But there is still no standard nor practical requirement that every application that consumes conforming JSON text can not make logical policy decisions based on the contents of the JSON text at the application layer. This is perfectly within the use cases of JSON as a data interchange format. |
Respectfully, I strongly disagree with the notion that Elasticsearch can break its promise to index any valid json documents and reject content based on rather arbitrary limitations on field names that are not part of the json standard (as defined in https://tools.ietf.org/html/rfc7159). The standard defines what is a legal field name and elasticsearch doesn't support all legal fieldnames anymore. This is a major breaking change that is deeply affecting us in multiple points in our architecture. Essentially, json compatibility was sacrificed in favor of syntactic sugar in the query language to be able to refer nested objects in an unambiguous way. That is valid of course but I'm now confronted with multiple external sources of perfectly valid json that used to index just fine that I can no longer index as is in elasticsearch as well as gigabytes of indices that I have to worry about migrating and testing. Migration to ES2 is a nightmare so far because of this. I'm months into planning the migration and still have a gazillion open issues; all revolving around finding and working around stupid dots in field names. This will likely continue to block us for some time and it is not like I haven't got more important stuff to worry about than field interpunction. Also even after I actually migrate this thing successfully I fully expect frequent regressions of dotted json slipping through and causing errors in the future as well. So, I respect that this decision was taken. Also I respect the fact that it wasn't taken lightly. But I do hope that Elasticsearch finds a way back to being a general purpose JSON document store, which it currently isn't anymore. IMHO more can be and should have been done to make this less painful. One fix that comes to mind is to simply disable dynamic index creation for fields with dots; which is probably what people would prefer rather than the entire document being rejected with some error about dots. We are talking about unmapped fields here that are being dynamically mapped. If you then want the field indexed anyway, all you need to do is rename it or copy it to some field with the dots replaced with underscores (this could even be a mapping feature: auto_convert_dots:true). A couple of new mapping features to enable/disable this behavior would probably fix things for most users and unbreak json compatibility. Any indexed fields would be guaranteed to be dot free this way and instead of fixing the data or the intake pipeline all you need to fix is your mappings. |
@jillesvangurp thank you - I was certain I couldn't be the only person this breaking change caused issues for. For what it's worth, I've written a small daemon I'm now using that changes "."'s found in field names in my syslog traffic. It was aggravating to burn engineering hours on turning JSON into "elasticsearch JSON" but the service is working for us. The library includes a golang RFC3164 compliant syslog parser and a mutator that can scan keys and change them - in case it might be useful for your current pain it's available at https://github.com/digitalocean/captainslog |
There seems to be a fundamental difference in what the commenters here think the term "support" means. One camps sees it as end-to-end support, the other sees it as an input format. I would contend that “JSON support” means end-to-end support. That is, saying Elasticsearch supports JSON implies to me that I can provide Elasticsearch with a valid, arbitrarily structured JSON document and Elasticsearch will index its contents. As long as the JSON supplied is (ECMA 404) valid, Elasticsearch’s application layer should handle it. Even more so since disallowing periods in the name of a collection's name/value pair is new to ES 2.x. Saying that Elasticsearch supports JSON is, in my mind, tantamount to saying that JSON objects should flow through the system without limitation. If that’s not the intent of the Elasticsearch dev team and/or Elasticsearch BV, then that should be clearly indicated in the project’s documentation. I’d also suggest that the team avoid describing Elasticsearch supporting JSON because of the obvious confusion associated with that phrase. Rather, the documentation should clearly state (where appropriate) that JSON is only used as a data transfer format or that Elasticsearch supports a subset of JSON. As a small addendum, I’ve been looking for Elasticsearch documentation on the character/format restrictions for field names. All I managed to find where this issue and a couple other issues on GitHub. I did find that ES 2.x ues on Lucene 5.x and as far as I can tell Lucene 5.x only requires that field names are strings. I also found some docs for Solr that clearly specifies the format of a valid field name. |
@SVincent The appeal to a formal standard was implied in the following from the first comment in this issue (emphasis on "by definition" added here):
and made explicit in the third comment in this issue (the second by the OP):
Saying that an implementation does not support a formal standard has a well-established understanding: there are requirements in the formal standard placed on all implementations that the implementation does not meet. An appeal to an actual formal standard was made in the fifth comment in this issue (the third by the OP):
As the comments in this issue continued along these lines with additional appeals to ECMA 404 and RFC 7159, the meaning of "fully supports JSON" was solidified. I think it is fair to claim that we have been talking about the same thing: whether or not Elasticsearch is in compliance with the JSON standard.
In this issue, it does not.
This is a requirement that Elasticsearch has never provided.
It's in the breaking changes for 2.0.
The requirement is not from Lucene, it's a requirement from the logic that Elasticsearch builds on top of Lucene. |
@jillesvangurp It is not arbitrary and it was, as you note, not taken lightly. The ultimate reason was to avoid ambiguity, a dangerous problem. This is covered throughly in the breaking changes for 2.0, #5972, #7112, #11337, #12068, and #14359.
Please note these key clauses from RFC 7159:
and
I maintain that even without these clauses, Elasticsearch can make a logical policy decision at the application layer to reject certain conforming JSON texts, but these clauses leave no doubt. |
During the Project Lumberjack discussions, we were talking about ways to 'flatten' references to multi-tier JSON structure elements and during that discussion we picked ! as the level delimiter, because it's a reserved character in so many languages, people are very unlikely to use it in a name. As this discussion demonstrates, using a period as the delimiter is problematic as it's a common character to use in variable names. Perhapse the easiest path forward is to tweak Elasticsearch so that the delimiter character is configurable. For ES2.0, leave the default as '.' (as currently defined and documented), and consider migrating to '!' going forward. Rsyslog uses ! as the delimiter, and it is the default on linux distros right now. so in spite of people discussing he issue and the problems using dot as the separator and agreeing on a 'standard', it seems that the different logging systems have gone in different directions (the nice thing about standards is that there are so many to choose from )-: |
Throwing exceptions on JSON field names that contain characters ( such as "." ) that are valid in JSON means that by definition Elasticsearch no longer fully supports JSON. I am finding this especially problematic. A large number of programming languages use the period to denote calling a method / function on an object / struct. In cases where logging is being used to pinpoint issues in code execution, this results in confusion when having to substitute these characters so that elasticsearch will accept them.
Consider the usefulness of a case such as:
The technical solution for an end user is of course simple - replace the characters that Elasticsearch no longer supports with another character. However, the side effect of this is that the logs I have to change in order to accomodate Elasticsearch 2.x no longer supporting valid characters become more distanced from what I am trying to communicate with them.
Am I missing something about this change? I'm hoping that I am - but in my tests it does seem as simple as "Elasticsearch no longer supports periods in field names, period" - pardon the pun! ;)
The text was updated successfully, but these errors were encountered: