Skip to content

Elasticsearch 2.x does not appear to fully support JSON #15404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
taotetek opened this issue Dec 12, 2015 · 15 comments
Closed

Elasticsearch 2.x does not appear to fully support JSON #15404

taotetek opened this issue Dec 12, 2015 · 15 comments

Comments

@taotetek
Copy link

Throwing exceptions on JSON field names that contain characters ( such as "." ) that are valid in JSON means that by definition Elasticsearch no longer fully supports JSON. I am finding this especially problematic. A large number of programming languages use the period to denote calling a method / function on an object / struct. In cases where logging is being used to pinpoint issues in code execution, this results in confusion when having to substitute these characters so that elasticsearch will accept them.

Consider the usefulness of a case such as:

{ "type":"net.Conn", "function":"Dial", "error":"could not connect"}

The technical solution for an end user is of course simple - replace the characters that Elasticsearch no longer supports with another character. However, the side effect of this is that the logs I have to change in order to accomodate Elasticsearch 2.x no longer supporting valid characters become more distanced from what I am trying to communicate with them.

Am I missing something about this change? I'm hoping that I am - but in my tests it does seem as simple as "Elasticsearch no longer supports periods in field names, period" - pardon the pun! ;)

@dakrone
Copy link
Member

dakrone commented Dec 12, 2015

Am I missing something about this change?

The change is not made arbitrarily or lightly, the problem is actually to
support various JSON nesting methods, for example:

For example, given two simple JSON documents:

Document A:

{
  "foo": {
    "bar.baz": 5
  }
}

Document B:

{
  "foo": {
    "bar": {
      "baz": 7
    }
  }
}

If these were in the same document, what does foo.bar.baz refer to? 5 or 7?
While useful to be able to have fields with periods in them, it leads to even
more confusion than not allowing them.

It's not just Elasticsearch that "doesn't like" ambiguity like this, take for
example the jq tool:

» echo '{"foo": {"bar.baz": 5}}' | jq .foo
{
  "bar.baz": 5
}
» echo '{"foo": {"bar.baz": 5}}' | jq .foo.bar
null
» echo '{"foo": {"bar.baz": 5}}' | jq .foo.bar.baz
null

Or accessing JSON members as objects in Javascript:

» node
> s = JSON.parse('{"foo": {"bar.baz": 5}}')
{ foo: { 'bar.baz': 5 } }
> s.foo
{ 'bar.baz': 5 }
> s.foo.bar.baz
TypeError: Cannot read property 'baz' of undefined
    at repl:1:11
    at REPLServer.self.eval (repl.js:110:21)
    at Interface.<anonymous> (repl.js:239:12)
    at Interface.emit (events.js:95:17)
    at Interface._onLine (readline.js:203:10)
    at Interface._line (readline.js:532:8)
    at Interface._ttyWrite (readline.js:761:14)
    at ReadStream.onkeypress (readline.js:100:10)
    at ReadStream.emit (events.js:98:17)
    at emitKey (readline.js:1096:12)
> s.foo."bar.baz"
...
... (node expects more input instead of resolving "bar.baz" as a key)

Would you say neither of those tools support JSON fully?

The complexity of having to remove the "." from field names is better than
ambiguous field resolving.

@taotetek
Copy link
Author

@dakrone - thank you for the response!

You compare this limitation to both javascript and jq - but your assertion is incorrect. Both javascript and jq fully support JSON with periods in field names, and provide syntax for working with them, as follows:

echo "{\"first.name\":\"brian\"}" | jq .'["first.name"]'
"brian"

You can use this associative array syntax for accessing fields with periods in them in javascript as well.

If there now exists a subset of possible JSON documents that are in compliance with the JSON standard, will validate properly with tools that implement the standard, but cannot be inserted into Elasticsearch - then Elasticsearch no longer supports the JSON standard.

While I understand that the decision was made in order avoid complexity in the query parser and to avoid some queries possibly resolving in unexpected ways, I find the decision regrettable.

I'll add to this that in general, I'm very happy with Elasticsearch - I don't want my opinion on this matter to come across as a general opinion about ES or the engineers who contribute to it.

Cheers,
Brian

@jasontedor
Copy link
Member

If there now exists a subset of possible JSON documents that are in compliance with the JSON standard, will validate properly with tools that implement the standard, but cannot be inserted into Elasticsearch - then Elasticsearch no longer supports the JSON standard.

This appears to be the crux of your argument. Namely, you're asserting that since Elasticsearch (intentionally) returns an error on some conforming JSON texts, Elasticsearch does not support the JSON standard.

JSON is a data interchange format, and the JSON standard specifies the JSON grammar; the JSON standard does not a put a requirement on applications beyond specifying what a conforming JSON text is and requirements for parsing conforming JSON text.

Most importantly, there is no standard nor practical requirement that applications that accept JSON must handle without error every conforming JSON text.

Consider an application that uses the JSON format for application configuration; some configurations can be conforming JSON text but will be invalid for the application; the application can reject those but still be in conformance with how it parses the configuration file.

Similarly, a web server that accepts a request with media type application/json can reject requests from clients that contain conforming JSON text in the request body, but are not valid requests for that web server. By way of a specific example, Twitter uses JSON to represent tweets and the associated metadata when it is communicating with a client; this does not mean that it must accept as a valid tweet every JSON document sent to one of its endpoints by a Twitter client.

And that is what Elasticsearch does. You can send it conforming JSON text via HTTP, but if it does not meet the requirements that Elasticsearch puts on JSON documents, Elasticsearch will give you an HTTP Bad Request. Elasticsearch will have parsed this document according to the JSON standard, and then after that it tells the client this is a bad request for Elasticsearch.

If, however, you were to find a conforming JSON document that Elasticsearch does not parse and represent correctly internally according to the JSON standard, or a JSON response from Elasticsearch that is not conforming JSON text, then there would be a legitimate issue and it would be addressed appropriately.

@taotetek
Copy link
Author

Elasticsearch no longer allows storage of key names that are in compliance with ECMA 404 definition of key names. "It parses the full standard in order to reject a subset of it" is a poor argument for Elasticsearch's support of JSON.

Consider an application that uses the JSON format for application configuration; some configurations can be conforming JSON text but will be invalid for the application

This is an apples to oranges comparison. Elasticsearch is not rejecting conforming JSON text that describes a configuration or API call that elasticsearch does not accept. Elasticsearch is rejecting the storage of JSON keys that are conforming JSON text within JSON objects that should be perfectly valid to store. In this case, the structure of the JSON object is perfectly valid, and what you are in fact rejecting is the JSON data exchange format as described by the standard.

From what I'm hearing, it sounds like this decision is permanent. Since I have no control over it, I'll work around it as best as I can.

@jasontedor
Copy link
Member

Elasticsearch no longer allows storage of key names that are in compliance with ECMA 404 definition of key names.

The word "key" nor the term "key name" never appear in ECMA 404. The word "key" does appear in RFC 7159, but only in the context of referring to the RFC requirement levels; the term "key name" never appears in RFC 7159. Both documents do refer to "name/value" pairs. However, there is no requirement by either specification that an application must accept all names as being valid for that application.

The concerns that you're discussing are concerns at the application layer on which the JSON standard places no restrictions.

"It parses the full standard in order to reject a subset of it" is a poor argument for Elasticsearch's support of JSON.

The only valid argument that Elasticsearch does not support JSON is to provide an example of conforming JSON text that is not parsed correctly by Elasticsearch, or to provide an example of a response with media type application/json that is not conforming JSON text.

Elasticsearch, like any other application that consumes JSON, places application logic on top of JSON. This is valid, and in conformance with the intended uses of JSON.

Consider an application that uses the JSON format for application configuration; some configurations can be conforming JSON text but will be invalid for the application

This is an apples to oranges comparison.

It is not, because the analogy is for a concern at the application layer, just as Elasticsearch rejecting fields with dots in their name is a concern at the application layer. The JSON standard places no requirements on the application layer.

Elasticsearch is rejecting the storage of JSON keys that are conforming JSON text within JSON objects that should be perfectly valid to store.

Again, this is a concern at the application layer. The JSON standard does not concern itself with concepts at the application layer such as "storage". It merely specifies what conforming JSON text is, and how it is to be parsed. It does not place requirements on the application layer and rejecting fields with dots in their name is a concern at the application layer.

In this case, the structure of the JSON object is perfectly valid, and what you are in fact rejecting is the JSON data exchange format as described by the standard.

Such JSON is valid, and Elasticsearch correctly parses that JSON and then rejects it at the application layer. This does not violate ECMA 404 nor RFC 7159.

From what I'm hearing, it sounds like this decision is permanent.

I'm hesitant to use a word like "permanent" but it is highly unlikely that this will change. Relates #12068.

@bryanl
Copy link

bryanl commented Dec 14, 2015

Hello, I've been following along, and I'd like to point out two things:

❯❯ node                                                                                                                                                                                                                                                                                                                                   ⏎ setup-install
> s = JSON.parse('{"foo": {"bar.baz": 5}}')
{ foo: { 'bar.baz': 5 } }
> s.foo
{ 'bar.baz': 5 }
> s.foo["bar.baz"]
5

and

echo '{"foo": {"bar.baz": 5}}' | jq '.foo["bar.baz"]'
5

Both node and jq handle keys with periods in them.

@jasontedor
Copy link
Member

Both node and jq handle keys with periods in them.

JSON.parse from ECMAScript and jq are general purpose in their handling of JSON.

Not allowing dots in field names is a logical policy decision enforced in the application layer of Elasticsearch. Elasticsearch correctly parses the field names with dots in them, and then makes a logical policy decision to reject those at the application layer. This is not a violation of the JSON standard, and is consistent with the intended uses of JSON as a data interchange format. The JSON standard places no restrictions on the application layer of an application that consumes conforming JSON text.

@doot0
Copy link

doot0 commented Dec 16, 2015

@jasontedor It seems your argument for not supporting these features is because you proactively choose not to "at the application layer". Your support for JSON is explicitly implied on the elasticsearch product page under the "Schema-Free" heading.

If one cannot actually index a valid JSON file (with periods in key names) into an elasticsearch DB, surely you should not be claiming that you can?

@jasontedor
Copy link
Member

If one cannot actually index a valid JSON file (with periods in key names) into an elasticsearch DB, surely you should not be claiming that you can?

The claim that Elasticsearch supports JSON does not translate into Elasticsearch has to accept without any restrictions whatsoever every conforming JSON text that is handed to it. There are rules to using the system, they must be understood and followed, and having them doesn't violate any claims that Elasticsearch supports JSON.

For example, if you specify a field as having type long in a mapping, and then pass Elasticsearch a document for which that field can not be parsed as a valid long, then Elasticsearch can make a logical policy decision at the application layer to reject that document. This is but one of many examples of reasons that conforming JSON text can be rejected at the application layer.

The only valid argument that Elasticsearch does not fully support JSON is if there exists conforming JSON text that Elasticsearch does not parse correctly, or if there exists a JSON response body from Elasticsearch that is not valid JSON. If either of those possibilities occur, then we have a legitimate issue and it will be addressed appropriately.

But there is still no standard nor practical requirement that every application that consumes conforming JSON text can not make logical policy decisions based on the contents of the JSON text at the application layer. This is perfectly within the use cases of JSON as a data interchange format.

@jillesvangurp
Copy link
Contributor

Respectfully, I strongly disagree with the notion that Elasticsearch can break its promise to index any valid json documents and reject content based on rather arbitrary limitations on field names that are not part of the json standard (as defined in https://tools.ietf.org/html/rfc7159). The standard defines what is a legal field name and elasticsearch doesn't support all legal fieldnames anymore.

This is a major breaking change that is deeply affecting us in multiple points in our architecture. Essentially, json compatibility was sacrificed in favor of syntactic sugar in the query language to be able to refer nested objects in an unambiguous way. That is valid of course but I'm now confronted with multiple external sources of perfectly valid json that used to index just fine that I can no longer index as is in elasticsearch as well as gigabytes of indices that I have to worry about migrating and testing. Migration to ES2 is a nightmare so far because of this. I'm months into planning the migration and still have a gazillion open issues; all revolving around finding and working around stupid dots in field names. This will likely continue to block us for some time and it is not like I haven't got more important stuff to worry about than field interpunction. Also even after I actually migrate this thing successfully I fully expect frequent regressions of dotted json slipping through and causing errors in the future as well.

So, I respect that this decision was taken. Also I respect the fact that it wasn't taken lightly. But I do hope that Elasticsearch finds a way back to being a general purpose JSON document store, which it currently isn't anymore. IMHO more can be and should have been done to make this less painful.

One fix that comes to mind is to simply disable dynamic index creation for fields with dots; which is probably what people would prefer rather than the entire document being rejected with some error about dots. We are talking about unmapped fields here that are being dynamically mapped. If you then want the field indexed anyway, all you need to do is rename it or copy it to some field with the dots replaced with underscores (this could even be a mapping feature: auto_convert_dots:true). A couple of new mapping features to enable/disable this behavior would probably fix things for most users and unbreak json compatibility. Any indexed fields would be guaranteed to be dot free this way and instead of fixing the data or the intake pipeline all you need to fix is your mappings.

@taotetek
Copy link
Author

@jillesvangurp thank you - I was certain I couldn't be the only person this breaking change caused issues for. For what it's worth, I've written a small daemon I'm now using that changes "."'s found in field names in my syslog traffic. It was aggravating to burn engineering hours on turning JSON into "elasticsearch JSON" but the service is working for us. The library includes a golang RFC3164 compliant syslog parser and a mutator that can scan keys and change them - in case it might be useful for your current pain it's available at https://github.com/digitalocean/captainslog

@dotproto
Copy link

There seems to be a fundamental difference in what the commenters here think the term "support" means. One camps sees it as end-to-end support, the other sees it as an input format.

I would contend that “JSON support” means end-to-end support. That is, saying Elasticsearch supports JSON implies to me that I can provide Elasticsearch with a valid, arbitrarily structured JSON document and Elasticsearch will index its contents. As long as the JSON supplied is (ECMA 404) valid, Elasticsearch’s application layer should handle it. Even more so since disallowing periods in the name of a collection's name/value pair is new to ES 2.x.

Saying that Elasticsearch supports JSON is, in my mind, tantamount to saying that JSON objects should flow through the system without limitation. If that’s not the intent of the Elasticsearch dev team and/or Elasticsearch BV, then that should be clearly indicated in the project’s documentation. I’d also suggest that the team avoid describing Elasticsearch supporting JSON because of the obvious confusion associated with that phrase. Rather, the documentation should clearly state (where appropriate) that JSON is only used as a data transfer format or that Elasticsearch supports a subset of JSON.


As a small addendum, I’ve been looking for Elasticsearch documentation on the character/format restrictions for field names. All I managed to find where this issue and a couple other issues on GitHub.

I did find that ES 2.x ues on Lucene 5.x and as far as I can tell Lucene 5.x only requires that field names are strings. I also found some docs for Solr that clearly specifies the format of a valid field name.

@jasontedor
Copy link
Member

One camps sees it as end-to-end support, the other sees it as an input format.

@SVincent The appeal to a formal standard was implied in the following from the first comment in this issue (emphasis on "by definition" added here):

Throwing exceptions on JSON field names that contain characters ( such as "." ) that are valid in JSON means that by definition Elasticsearch no longer fully supports JSON.

and made explicit in the third comment in this issue (the second by the OP):

Elasticsearch no longer supports the JSON standard.

Saying that an implementation does not support a formal standard has a well-established understanding: there are requirements in the formal standard placed on all implementations that the implementation does not meet.

An appeal to an actual formal standard was made in the fifth comment in this issue (the third by the OP):

Elasticsearch no longer allows storage of key names that are in compliance with ECMA 404 definition of key names. "It parses the full standard in order to reject a subset of it" is a poor argument for Elasticsearch's support of JSON.

As the comments in this issue continued along these lines with additional appeals to ECMA 404 and RFC 7159, the meaning of "fully supports JSON" was solidified. I think it is fair to claim that we have been talking about the same thing: whether or not Elasticsearch is in compliance with the JSON standard.

I would contend that “JSON support” means end-to-end support.

In this issue, it does not.

Saying that Elasticsearch supports JSON is, in my mind, tantamount to saying that JSON objects should flow through the system without limitation.

This is a requirement that Elasticsearch has never provided.

As a small addendum, I’ve been looking for Elasticsearch documentation on the character/format restrictions for field names.

It's in the breaking changes for 2.0.

I did find that ES 2.x ues on Lucene 5.x and as far as I can tell Lucene 5.x only requires that field names are strings.

The requirement is not from Lucene, it's a requirement from the logic that Elasticsearch builds on top of Lucene.

@jasontedor
Copy link
Member

Respectfully, I strongly disagree with the notion that Elasticsearch can break its promise to index any valid json documents and reject content based on rather arbitrary limitations on field names

@jillesvangurp It is not arbitrary and it was, as you note, not taken lightly. The ultimate reason was to avoid ambiguity, a dangerous problem. This is covered throughly in the breaking changes for 2.0, #5972, #7112, #11337, #12068, and #14359.

that are not part of the json standard (as defined in https://tools.ietf.org/html/rfc7159). The standard defines what is a legal field name and elasticsearch doesn't support all legal fieldnames anymore.

Please note these key clauses from RFC 7159:

An object is an unordered collection of zero or more name/value
pairs, where a name is a string and a value is a string, number,
boolean, null, object, or array.

and

An implementation may set limits on the length and
character contents of strings.

I maintain that even without these clauses, Elasticsearch can make a logical policy decision at the application layer to reject certain conforming JSON texts, but these clauses leave no doubt.

@davidelang
Copy link

During the Project Lumberjack discussions, we were talking about ways to 'flatten' references to multi-tier JSON structure elements and during that discussion we picked ! as the level delimiter, because it's a reserved character in so many languages, people are very unlikely to use it in a name. As this discussion demonstrates, using a period as the delimiter is problematic as it's a common character to use in variable names.

Perhapse the easiest path forward is to tweak Elasticsearch so that the delimiter character is configurable. For ES2.0, leave the default as '.' (as currently defined and documented), and consider migrating to '!' going forward.

Rsyslog uses ! as the delimiter, and it is the default on linux distros right now.
syslog-ng uses . as the delimiter, so it would have the same problem that ES currently has.
logstash accesses multi tier data as [level1][level2]
nxlog doesn't implement any support for multi-level variables
sumologic uses . as the delimiter, so it would have the same problem

so in spite of people discussing he issue and the problems using dot as the separator and agreeing on a 'standard', it seems that the different logging systems have gone in different directions (the nice thing about standards is that there are so many to choose from )-:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants