Add a streaming json API to libserialize #12740

nical · 2014-03-06T20:29:21Z

Hi rust enthusiasts,

With this patch I propose to add a "streaming" API to the existing json parser in libserialize.

By "streaming" I mean a parser that let you act on JsonEvents that are generated as while parsing happens, as opposed to parsing the entire source, generating a big data structure and working with this data structure. I think both approaches have their pros and cons so this pull request adds the streaming API, preserving the existing one.

The streaming API is simple: It consist into an Iterator that consumes an Iterator. JsonEvent is an enum with values such as NumberValue(f64), BeginList, EndList, BeginObject, etc.

The user would ideally use the API as follows:

for evt in StreamingParser::new(src) {
  match evt {
    BeginList => {
       // ...
    }
    // ...
  }
}

The iterator provides a stack() method returning a slice of StackNodes which represent "where we currently are" in the logical structure of the json stream (for instance at "foo.bar[3].x" you get [ Key("foo"), Key("bar"), Index(3), Key("x") ].)

I wrote "ideally" above because the current way rust expands for loops, you can't call the stack() method because the iterator is already borrowed. So for know you need to manually advance the iterator in the loop. I hope this is something we can cope with, until for loops are better integrated with the compiler.

Streaming parsers are useful when you want to read from a json stream, generate a custom data structure and you know how the json is going to be structured. For example, imagine you have to parse a 3D mesh file represented in the json format. In this case you probably expect to have large arrays of vertices and using the generic parser will be very inefficient because it will create a big list of all these vertices, which you will copy into a contiguous array afterwards (so you end up doing a lot of small allocations, parsing the json once and parsing the data structure afterwards). With a streaming parser, you can add the vertices to a contiguous array as they come in without paying the cost of creating the intermediate Json data structure. You have much fewer allocations since you write directly in the final data structure and you can be smart in how you will pre-allocate it.

I added added this directly into serialize::json rather than in its own library because it turns out I can reuse most of the existing code whereas maintaining a separate library (which I did originally) forces me to duplicate this code.

I wrote this trying to minimize the size of the patch so there may be places where the code could be nicer at the expenses of more changes (let me know what you prefer).

This is my first (potential) contribution to rust, so please let me know if I am doing something wrong (maybe I should have first introduced this proposition in the mailing list, or opened a github issue, etc.?). I work a few meters away from @pknfelix so I am not too hard to find :)

lilyball · 2014-03-06T20:45:18Z

src/libserialize/json.rs

+            '0' .. '9' | '-' => match self.parse_number() {
+                Ok(f) => Ok(Number(f)),
+                Err(e) => Err(e),
+            },


This would be more concise as self.parse_number().map(|f| Number(f))

lilyball · 2014-03-06T21:35:34Z

I only reviewed for code quality, not for correctness. I also didn't check if the test suite was comprehensive, although based on the number of tests, it seems decent but could always stand to have more. Specifically, I'm concerned that all the possible error cases aren't fully tested (because there's lots of ways to malform a json stream).

Overall, I think it looks decent, but I have one major qualm. I don't like how expected works. It's a bitfield that's trying to take the place of a state machine. I would much prefer to see the parser operate using a true state machine, where it has a state field that is an enum (rather than a bitfield). There aren't that many states a JSON parser can enter while parsing, and each state has an unambiguous set of expected tokens.

Besides being a cleaner and more understandable approach, this also solves a big problem I have with expected, which is that it's a bitfield but in a number of places you treat it as a straight value. And by that I mean if you determine that it contains, e.g. ExpectValue, you try and parse a value and if that fails, you error out. But it's a bitfield, it could conceivably have contained other possible accepted tokens that you didn't try to parse. From reading the code, whenever expected contains ExpectValue, the only other flag it may contain is ExpectEnd and you test for eof() before testing for the value. But that's an assumption your code is making that's not actually enforced anywhere.

On the other hand, the state-based approach leaves out all of these assumptions. ExpectValue would be split into 3 states: StateBegin (where a value is required), StateObjectValue (where a value is expected inside of an object), and StateArrayValue (where a value is expected inside of an array). I would also suggest using two separate states while parsing objects, StateObjectKey and StateObjectValue, rather than trying to parse the key and value in one pass. That should make the overall state machine a bit simpler.

Once you've done this, you can then just wrap your whole state machine in a loop {}, because not every state will emit an event (e.g. StateObjectKey will move to StateObjectValue without emitting anything). You can also then say self.p.parse_whitespace() at the top of the loop instead of doing it in each state, because whitespace can be ignored everywhere. The only other place you'd need to parse whitespace is after consuming a comma (because that's the only place where you need to parse whitespace before transitioning to a new state). With this done, all you have to do is ensure every single state either consumes characters, or transitions to a new state (and with the latter, make sure there's no state cycle that consumes nothing, but in your case the only state transition that doesn't involve consuming characters would be moving to StateEOF).

This also cleans up some of your state flags. For example, StateObjectKey could have a single boolean value, making it StateObjectKey(bool), and that boolean indicates whether any values have been parsed in this object so far. That value will serve double-duty, both indicating whether a comma is valid to parse at this point, and whether the stack needs to be popped if '}' is encountered.

nical · 2014-03-07T11:03:44Z

Thanks a lot for the review! I have addressed the easy parts locally and started refactoring the logic to use a proper state machine. I'll hopefully have time to sort it out this weekend.

huonw · 2014-03-07T23:35:08Z

src/libserialize/json.rs

+
+impl<T: Iterator<char>> Iterator<JsonEvent> for StreamingParser<T> {
+    fn next(&mut self) -> Option<JsonEvent> {
+        if self.expect == ExpectNothing {


Stylistically this could just be

fn next(&mut self) -> Option<JsonEvent> { if self.expect { None } else { Some(self.parse()) } }

huonw · 2014-03-07T23:43:48Z

This is cool! Would it be possible to add some short benchmarks? (e.g. decoding [1, 2, 3, 4], {"foo": null, "bar": "baz"} etc.)

huonw · 2014-03-07T23:45:15Z

Also, I wonder if taking a &str rather than a Iterator<char> would be more efficient: it would allow one use slices directly out of the &str rather than having to allocate new ~str's for everything.

lilyball · 2014-03-07T23:49:36Z

@huonw JSON supports string escapes, so &str wouldn't work as parsing these escapes generates a ~str. At best you could get a str::MaybeOwned<>, but it's not worth it.

The big benefit to taking Iterator<char> is this allows for streaming directly from a io::BufferedReader without having to read the whole JSON file into memory beforehand. This enables cool things like having a streaming JSON protocol that never actually terminates the topmost object.

huonw · 2014-03-07T23:55:17Z

Hm, that's true about escapes, but I disagree with

At best you could get a str::MaybeOwned<>, but it's not worth it.

I'd guess that almost all JSON strings will be unescaped (especially ones in object keys).

I was thinking we could have a resumable parser, so one could read a chunk out of a reader, pass it into the JSON decoder and get string slices out of that chunk. (With the leading string possibly continuing from the previous run.) Theoretically this behaviour would reduce to the same as an Iterator<char> when passing strings with char_len == 1 each time (probably less efficient, but I'd guess that it would be very rare to actually be forced to pass individual characters in a performance sensitive environment).

lilyball · 2014-03-07T23:58:48Z

@huonw I said it's not worth it because it massively complicates the ability to do streaming JSON, as then you do have to explicitly manage your buffer, and the JSON parser would also have to be extended in order to buffer incomplete tokens internally. It's certainly possible, but I doubt it's really going to be that much of a performance win (compared to whatever you're actually doing with the JSON), and it will make the API much harder to use.

robertg · 2014-03-09T04:00:23Z

Use vim-trailing-whitespace, or an equivalent to prevent your builds from failing.

erickt · 2014-03-22T01:58:24Z

This is great. I've only started reviewing this but this is what I wanted back when I wrote the first parser. I wasn't able to back then due to some long closed issues. The first major thing I see is I doubt we need two parsers. Can you merge the two?

nical · 2014-03-22T09:53:11Z

@erickt sure, I am looking into using the streaming parser to build Json objects. Basically this is just taking the existing parser's logic and making it consume JsonEvents rather than chars.
This is turning into a complete rewrite of serialize::json, I hope you guys don't mind (when we take contributions in Gecko we try to avoid big rewrite patches).

erickt · 2014-03-22T14:29:39Z

src/libserialize/json.rs

+          ListStart => { ParseList(true) }
+          ObjectStart => { ParseObject(true) }
+          _ => { ParseBeforeFinish }
+        };


This should be indented with 4 spaces, not two. Also, we tend to leave off the { ... } from simple one-expression match arms like this.

erickt · 2014-03-24T04:57:16Z

src/libserialize/json.rs

-        if self.ch_is('}') {
-          self.bump();
-          return Ok(Object(values));
+    pub fn stack<'l>(&'l self) -> &'l Stack {


Needs documentation.

erickt · 2014-03-24T05:09:03Z

Overall this is awesome. I would wait for #13107 to land first before landing this though.

My next plan for all this is that we can remove the superfluous conversion of a &str to a Json to a value. We can do this by merging your Builder struct with Decoder. Once that's done, we can also add support for decoding to a Json value. The way we do this is by having it's Decodable directly access the builder, as in something like:

impl Decodable<json::Decoder> for Json {
    fn decode(d: &mut json::Decoder) -> Result<Json, Error> {
        d.builder.build()
    }
}

If you're interested, feel free to do that part of this PR, or if you'd rather get this landed before it bitrots too much, you or I can do it in a future PR.

edwardw · 2014-03-25T07:28:23Z

cc me.

nical · 2014-03-25T17:45:12Z

@erickt I'll be happy to help with the Decoder simplification, although I would prefer to do it as a separate pull request.
No problem with waiting for #13107. In the mean time I guess I should write an RFC since this is a user-visible change to libserialize.

erickt · 2014-03-25T17:46:48Z

@nical: yeah, hold off on simplifying Decoder, there may be some issues with that approach that @nikomatsakis has brought up.

nical · 2014-03-25T18:04:16Z

Also, i'd like to eventually add the following feature to json::Stack:

fn matches(&self, pattern: &str) -> bool

which would compare the current stack with a string using a lightweight regex-like syntax like "+.foo.#.$"
'' would mean an element of any type
'#' would mean an index
'$' would mean a key
and the '+' would mean "any number of..."

I haven't given much thought about the syntax and I want to experiment with different solutions, for example it might be better to build an object from the string and the use this object to compare against the stack as the object could have a representation that is more efficient to compare against the stack.
Maybe a macro could help building the object from a string, I don't know.

Anyway I think it would be nice to be able to compare the stack with a string like:

for evt in parser {
  if parser.matches(json_stack!("mesh.$.vertices.#")) {
    // we know we are parsing a big array of floats so the logic here becomes simple... 
    // ...
  }
  // ...
}

This is something i'd like to investigate for later, any thoughts?

pnkfelix · 2014-03-26T22:18:32Z

cc me

erickt · 2014-03-28T14:48:24Z

@nical: ##13107 has landed so you are good to go!

Regarding your pattern matching on json structures, that sounds like a great idea, but I'm not sure if it should live in Rust proper, but it'd make a great module. If you start going down this road, you should check out some of these projects. They're trying to do something similar by bringing the ideas of XQuery or XPath to json: JSPath, jsoniq, and JSONPath.

nical · 2014-03-28T18:43:23Z

The rebase is quite awful because of changes that went into the error results. I don't want the parser's error enum variants to contain the decoder's especially since rust will force user to match against all variants. This leaves me with something like this:

// An error that can be generated from wrong json syntax during parsing
struct SyntaxError {
  line: u32, col: u32, reason: ErrorCode
}

// The output of the parser
enum JsonEvent {
  ListStart,
  // ...
  ParseError(SyntaxError),
  IoError(io::IoError),
}

// the Builder is implemented on top of the parser.
BuilderError {
  ParserError2(SyntaxError), // :(
  IoError2(JsonError),       // :(
}

// The decoder is implemented on top of the Builder
enum DecoderError {
  ParseError3(SyntaxError),     // :(
  IoError3(io::IoError),        // :(
  ExpectedError(~str, ~str),
  MissingFieldError(~str),
  UnknownVariantError(~str),
}

It looks like the only two ways to avoid those enum variant name collisions is to either move builder and and decoder into their own sub modules, or get very creative with the English language.
I'll go ahead with the former approach. If anyone has a better solution, please let me now.

erickt · 2014-03-28T19:13:45Z

@nical: Ugh, that is pretty ugly. Can both the Parser and Builder use the same error enum? As in:

enum ParserError {
    SyntaxError(SyntaxError),
    IoError(IoError),
}

enum JsonEvent {
    ListStart,
    ...
    Error(JsonError),
}

impl Parser {
    fn parse(&mut self, ...) -> Result<JsonEvent, JsonError> { ... }
}

impl Builder {
    fn build(&mut self, ...) -> Result<Json, JsonError> { ... }
}

enum DecoderError {
    ParserError(JsonError),
    ExpectedError(~str, ~str),
    ...
}

nical · 2014-03-28T19:38:53Z

@erickt You are right (with JsonError == ParserError I suppose). I have 13 hours to spend in a plane tomorrow so I guess I'll have plenty of time to fix that.

lilyball · 2014-03-28T20:21:54Z

If I understand this correctly, the Parser can throw SyntaxErrors and IoErrors, and the Builder is built on top of the Parser and can throw both of those errors in addition to a set of errors specific to the decoder?

@erickt's suggestion looks similar to something that will work, although JsonError is undefined and I think parse() and build() need to return different errors.

Also, if both functions are yielding a Result then JsonEvent should remove the Error variant entirely, no?

pub enum ParserError {
    SyntaxError(SyntaxError),
    IoError(IoError),
}

pub enum BuilderError {
    ParseError(ParserError),
    // decoder errors
    ExpectedError(~str, ~str),
    MissingFieldError(~str),
    UnknownVariantError(~str)
}

pub enum JsonEvent {
    ListStart,
    ... // note: no Error event, that's handled by the Result
}

impl Parser {
    pub fn parse(&mut self, ...) -> Result<JsonEvent, ParserError> { ... }
}

impl Builder {
    pub fn build(&mut self, ...) -> Result<Json, BuilderError> { ... }
}

Also, FWIW, if you do need to have the same variants in multiple enums, you can simulate scoped enum variants by using a mod named after the enum. There was a proposal a while back to add this to the language spec but it was decided at that time that it was unnecessary, but I like it. Implementing this looks something like

pub type MyEnum = MyEnum::MyEnum; // expose the enum at this scope
pub mod MyEnum {
    // unfortunately the real enum has to be public for the parent module to see it
    // which means MyEnum::MyEnum is valid, despite being weird.
    pub enum MyEnum {
        VariantOne,
        VariantTwo
    }
    // add any impls of MyEnum here
}

nical · 2014-03-30T13:06:41Z

Just for the heads up, after trying a few different ways, I ended up flattening the history of my branch and manually re-applied all of the changes that had gone into json.rs since the last rebase in one big commit here: https://github.com/nical/rust/tree/json-rebase
I'll get all that back into the json-streaming branch one way or another, for it to show up in the pull request soon.

nical · 2014-04-10T18:23:08Z

I think that the branch is ready to be merged or to go through a new round of reviews.

nical · 2014-04-16T18:14:03Z

Let me know if there is something that I should change in this pull request. At the moment I am not doing anything and the PR is slowly bitrotting.

erickt · 2014-04-16T23:29:50Z

@nical: This is awesome, sorry I missed your update a week ago. r=me once it's rebased on top of HEAD.

nical · 2014-04-24T18:31:01Z

I have been short on spare time lately but I finally got the branch rebased \o/

erickt · 2014-04-25T01:24:27Z

@nical: unfortunately it looks like the build failed with this error:

---- [run-pass] run-pass/issue-4016.rs stdout ----

error: compilation failed!

command: x86_64-unknown-linux-gnu/stage1/bin/rustc /home/travis/build/mozilla/rust/src/test/run-pass/issue-4016.rs -L x86_64-unknown-linux-gnu/test/run-pass --target=x86_64-unknown-linux-gnu -L x86_64-unknown-linux-gnu/test/run-pass/issue-4016.stage1-x86_64-unknown-linux-gnu.libaux -C prefer-dynamic -o x86_64-unknown-linux-gnu/test/run-pass/issue-4016.stage1-x86_64-unknown-linux-gnu --cfg rtopt --cfg debug -L x86_64-unknown-linux-gnu/rt

stdout:

------------------------------------------

------------------------------------------

stderr:

------------------------------------------

/home/travis/build/mozilla/rust/src/test/run-pass/issue-4016.rs:16:37: 16:48 error: use of undeclared type name `json::Error`

/home/travis/build/mozilla/rust/src/test/run-pass/issue-4016.rs:16 trait JD : Decodable<json::Decoder, json::Error> { }

^~~~~~~~~~~

error: aborting due to previous error

------------------------------------------

task '[run-pass] run-pass/issue-4016.rs' failed at 'explicit failure', /home/travis/build/mozilla/rust/src/compiletest/runtest.rs:969

failures:

[run-pass] run-pass/issue-4016.rs

nical · 2014-04-27T22:25:12Z

I fixed the issue and rebased today. make check runs without failure on my machine.

nical · 2014-04-28T15:28:38Z

The Travis CI build passed \o/

Hi rust enthusiasts, With this patch I propose to add a "streaming" API to the existing json parser in libserialize. By "streaming" I mean a parser that let you act on JsonEvents that are generated as while parsing happens, as opposed to parsing the entire source, generating a big data structure and working with this data structure. I think both approaches have their pros and cons so this pull request adds the streaming API, preserving the existing one. The streaming API is simple: It consist into an Iterator<JsonEvent> that consumes an Iterator<char>. JsonEvent is an enum with values such as NumberValue(f64), BeginList, EndList, BeginObject, etc. The user would ideally use the API as follows: ``` for evt in StreamingParser::new(src) { match evt { BeginList => { // ... } // ... } } ``` The iterator provides a stack() method returning a slice of StackNodes which represent "where we currently are" in the logical structure of the json stream (for instance at "foo.bar[3].x" you get [ Key("foo"), Key("bar"), Index(3), Key("x") ].) I wrote "ideally" above because the current way rust expands for loops, you can't call the stack() method because the iterator is already borrowed. So for know you need to manually advance the iterator in the loop. I hope this is something we can cope with, until for loops are better integrated with the compiler. Streaming parsers are useful when you want to read from a json stream, generate a custom data structure and you know how the json is going to be structured. For example, imagine you have to parse a 3D mesh file represented in the json format. In this case you probably expect to have large arrays of vertices and using the generic parser will be very inefficient because it will create a big list of all these vertices, which you will copy into a contiguous array afterwards (so you end up doing a lot of small allocations, parsing the json once and parsing the data structure afterwards). With a streaming parser, you can add the vertices to a contiguous array as they come in without paying the cost of creating the intermediate Json data structure. You have much fewer allocations since you write directly in the final data structure and you can be smart in how you will pre-allocate it. I added added this directly into serialize::json rather than in its own library because it turns out I can reuse most of the existing code whereas maintaining a separate library (which I did originally) forces me to duplicate this code. I wrote this trying to minimize the size of the patch so there may be places where the code could be nicer at the expenses of more changes (let me know what you prefer). This is my first (potential) contribution to rust, so please let me know if I am doing something wrong (maybe I should have first introduced this proposition in the mailing list, or opened a github issue, etc.?). I work a few meters away from @pknfelix so I am not too hard to find :)

alexcrichton · 2014-05-03T17:22:35Z

I think this has been causing some spurious failures on the bots which look quite odd to me. Do you know why this would happen?

http://buildbot.rust-lang.org/builders/auto-win-32-nopt-t/builds/4761/steps/test/logs/stdio

nical · 2014-05-03T17:40:34Z

That's weird,

---- json::tests::test_read_list_streaming stdout ----
task 'json::tests::test_read_list_streaming' failed at 'assertion failed: (left == right) && (right == left) (left: BooleanValue(true), right: BooleanValue(true))', C:\bot\slave\auto-win-32-nopt-t\build\src\libserialize\json.rs:3064

---- json::tests::test_read_object_streaming stdout ----
task 'json::tests::test_read_object_streaming' failed at 'assertion failed: (left == right) && (right == left) (left: BooleanValue(true), right: BooleanValue(true))', C:\bot\slave\auto-win-32-nopt-t\build\src\libserialize\json.rs:3064

it says that equality failed but also says that left and right are both BooleanValue(true)

alexcrichton · 2014-05-03T17:41:17Z

That is the odd part indeed. This has appeared multiple times as well (not sure why)

`significant_drop_in_scrutinee`: Trigger lint only if lifetime allows early significant drop I want to argue that the following code snippet should not trigger `significant_drop_in_scrutinee` (rust-lang/rust-clippy#8987). The iterator holds a reference to the locked data, so it is expected that the mutex guard must be alive until the entire loop is finished. ```rust use std::sync::Mutex; fn main() { let mutex_vec = Mutex::new(vec![1, 2, 3]); for number in mutex_vec.lock().unwrap().iter() { dbg!(number); } } ``` However, the lint should be triggered when we clone the vector. In this case, the iterator does not hold any reference to the locked data. ```diff - for number in mutex_vec.lock().unwrap().iter() { + for number in mutex_vec.lock().unwrap().clone().iter() { ``` Unfortunately, it seems that regions on the types of local variables are mostly erased (`ReErased`) in the late lint pass. So it is hard to tell if the final expression has a lifetime relevant to the value with a significant drop. In this PR, I try to make a best-effort guess based on the function signatures. To avoid false positives, no lint is issued if the result is uncertain. I'm not sure if this is acceptable or not, so any comments are welcome. Fixes rust-lang/rust-clippy#8987 changelog: [`significant_drop_in_scrutinee`]: Trigger lint only if lifetime allows early significant drop. r? `@flip1995`

lilyball reviewed Mar 6, 2014
View reviewed changes

huonw reviewed Mar 7, 2014
View reviewed changes

erickt reviewed Mar 22, 2014
View reviewed changes

erickt reviewed Mar 24, 2014
View reviewed changes

nical added 4 commits April 27, 2014 23:09

Add a streaming parser to serialize::json.

cd3b54a

Fix a code formatting issue in json.rs

18bed22

Update libworkcache with libserialize's json changes

a539b0d

Fix test issue-4016.rs with the json API change

02c45de

bors merged commit 02c45de into rust-lang:master Apr 30, 2014

Add a streaming json API to libserialize #12740

Add a streaming json API to libserialize #12740

Conversation

nical commented Mar 6, 2014

lilyball Mar 6, 2014

Choose a reason for hiding this comment

lilyball commented Mar 6, 2014

nical commented Mar 7, 2014

huonw Mar 7, 2014

Choose a reason for hiding this comment

huonw commented Mar 7, 2014

huonw commented Mar 7, 2014

lilyball commented Mar 7, 2014

huonw commented Mar 7, 2014

lilyball commented Mar 7, 2014

robertg commented Mar 9, 2014

erickt commented Mar 22, 2014

nical commented Mar 22, 2014

erickt Mar 22, 2014

Choose a reason for hiding this comment

erickt Mar 24, 2014

Choose a reason for hiding this comment

erickt commented Mar 24, 2014

edwardw commented Mar 25, 2014

nical commented Mar 25, 2014

erickt commented Mar 25, 2014

nical commented Mar 25, 2014

pnkfelix commented Mar 26, 2014

erickt commented Mar 28, 2014

nical commented Mar 28, 2014

erickt commented Mar 28, 2014

nical commented Mar 28, 2014

lilyball commented Mar 28, 2014

nical commented Mar 30, 2014

nical commented Apr 10, 2014

nical commented Apr 16, 2014

erickt commented Apr 16, 2014

nical commented Apr 24, 2014

erickt commented Apr 25, 2014

nical commented Apr 27, 2014

nical commented Apr 28, 2014

alexcrichton commented May 3, 2014

nical commented May 3, 2014

alexcrichton commented May 3, 2014