-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Add a streaming json API to libserialize #12740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
'0' .. '9' | '-' => match self.parse_number() { | ||
Ok(f) => Ok(Number(f)), | ||
Err(e) => Err(e), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be more concise as self.parse_number().map(|f| Number(f))
I only reviewed for code quality, not for correctness. I also didn't check if the test suite was comprehensive, although based on the number of tests, it seems decent but could always stand to have more. Specifically, I'm concerned that all the possible error cases aren't fully tested (because there's lots of ways to malform a json stream). Overall, I think it looks decent, but I have one major qualm. I don't like how Besides being a cleaner and more understandable approach, this also solves a big problem I have with On the other hand, the state-based approach leaves out all of these assumptions. Once you've done this, you can then just wrap your whole state machine in a This also cleans up some of your state flags. For example, |
Thanks a lot for the review! I have addressed the easy parts locally and started refactoring the logic to use a proper state machine. I'll hopefully have time to sort it out this weekend. |
|
||
impl<T: Iterator<char>> Iterator<JsonEvent> for StreamingParser<T> { | ||
fn next(&mut self) -> Option<JsonEvent> { | ||
if self.expect == ExpectNothing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically this could just be
fn next(&mut self) -> Option<JsonEvent> {
if self.expect {
None
} else {
Some(self.parse())
}
}
This is cool! Would it be possible to add some short benchmarks? (e.g. decoding |
Also, I wonder if taking a |
@huonw JSON supports string escapes, so The big benefit to taking |
Hm, that's true about escapes, but I disagree with
I'd guess that almost all JSON strings will be unescaped (especially ones in object keys). I was thinking we could have a resumable parser, so one could read a chunk out of a reader, pass it into the JSON decoder and get string slices out of that chunk. (With the leading string possibly continuing from the previous run.) Theoretically this behaviour would reduce to the same as an |
@huonw I said it's not worth it because it massively complicates the ability to do streaming JSON, as then you do have to explicitly manage your buffer, and the JSON parser would also have to be extended in order to buffer incomplete tokens internally. It's certainly possible, but I doubt it's really going to be that much of a performance win (compared to whatever you're actually doing with the JSON), and it will make the API much harder to use. |
Use vim-trailing-whitespace, or an equivalent to prevent your builds from failing. |
This is great. I've only started reviewing this but this is what I wanted back when I wrote the first parser. I wasn't able to back then due to some long closed issues. The first major thing I see is I doubt we need two parsers. Can you merge the two? |
@erickt sure, I am looking into using the streaming parser to build Json objects. Basically this is just taking the existing parser's logic and making it consume JsonEvents rather than chars. |
ListStart => { ParseList(true) } | ||
ObjectStart => { ParseObject(true) } | ||
_ => { ParseBeforeFinish } | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be indented with 4 spaces, not two. Also, we tend to leave off the { ... }
from simple one-expression match arms like this.
if self.ch_is('}') { | ||
self.bump(); | ||
return Ok(Object(values)); | ||
pub fn stack<'l>(&'l self) -> &'l Stack { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs documentation.
Overall this is awesome. I would wait for #13107 to land first before landing this though. My next plan for all this is that we can remove the superfluous conversion of a
If you're interested, feel free to do that part of this PR, or if you'd rather get this landed before it bitrots too much, you or I can do it in a future PR. |
cc me. |
@nical: yeah, hold off on simplifying |
Also, i'd like to eventually add the following feature to json::Stack: fn matches(&self, pattern: &str) -> bool which would compare the current stack with a string using a lightweight regex-like syntax like "+.foo.#.$" I haven't given much thought about the syntax and I want to experiment with different solutions, for example it might be better to build an object from the string and the use this object to compare against the stack as the object could have a representation that is more efficient to compare against the stack. Anyway I think it would be nice to be able to compare the stack with a string like:
This is something i'd like to investigate for later, any thoughts? |
cc me |
@nical: ##13107 has landed so you are good to go! Regarding your pattern matching on json structures, that sounds like a great idea, but I'm not sure if it should live in Rust proper, but it'd make a great module. If you start going down this road, you should check out some of these projects. They're trying to do something similar by bringing the ideas of XQuery or XPath to json: JSPath, jsoniq, and JSONPath. |
The rebase is quite awful because of changes that went into the error results. I don't want the parser's error enum variants to contain the decoder's especially since rust will force user to match against all variants. This leaves me with something like this:
It looks like the only two ways to avoid those enum variant name collisions is to either move builder and and decoder into their own sub modules, or get very creative with the English language. |
@nical: Ugh, that is pretty ugly. Can both the Parser and Builder use the same error enum? As in:
|
@erickt You are right (with JsonError == ParserError I suppose). I have 13 hours to spend in a plane tomorrow so I guess I'll have plenty of time to fix that. |
If I understand this correctly, the @erickt's suggestion looks similar to something that will work, although Also, if both functions are yielding a pub enum ParserError {
SyntaxError(SyntaxError),
IoError(IoError),
}
pub enum BuilderError {
ParseError(ParserError),
// decoder errors
ExpectedError(~str, ~str),
MissingFieldError(~str),
UnknownVariantError(~str)
}
pub enum JsonEvent {
ListStart,
... // note: no Error event, that's handled by the Result
}
impl Parser {
pub fn parse(&mut self, ...) -> Result<JsonEvent, ParserError> { ... }
}
impl Builder {
pub fn build(&mut self, ...) -> Result<Json, BuilderError> { ... }
} Also, FWIW, if you do need to have the same variants in multiple enums, you can simulate scoped enum variants by using a mod named after the enum. There was a proposal a while back to add this to the language spec but it was decided at that time that it was unnecessary, but I like it. Implementing this looks something like pub type MyEnum = MyEnum::MyEnum; // expose the enum at this scope
pub mod MyEnum {
// unfortunately the real enum has to be public for the parent module to see it
// which means MyEnum::MyEnum is valid, despite being weird.
pub enum MyEnum {
VariantOne,
VariantTwo
}
// add any impls of MyEnum here
} |
Just for the heads up, after trying a few different ways, I ended up flattening the history of my branch and manually re-applied all of the changes that had gone into json.rs since the last rebase in one big commit here: https://github.com/nical/rust/tree/json-rebase |
I think that the branch is ready to be merged or to go through a new round of reviews. |
Let me know if there is something that I should change in this pull request. At the moment I am not doing anything and the PR is slowly bitrotting. |
@nical: This is awesome, sorry I missed your update a week ago. r=me once it's rebased on top of HEAD. |
I have been short on spare time lately but I finally got the branch rebased \o/ |
@nical: unfortunately it looks like the build failed with this error:
|
I fixed the issue and rebased today. make check runs without failure on my machine. |
The Travis CI build passed \o/ |
Hi rust enthusiasts, With this patch I propose to add a "streaming" API to the existing json parser in libserialize. By "streaming" I mean a parser that let you act on JsonEvents that are generated as while parsing happens, as opposed to parsing the entire source, generating a big data structure and working with this data structure. I think both approaches have their pros and cons so this pull request adds the streaming API, preserving the existing one. The streaming API is simple: It consist into an Iterator<JsonEvent> that consumes an Iterator<char>. JsonEvent is an enum with values such as NumberValue(f64), BeginList, EndList, BeginObject, etc. The user would ideally use the API as follows: ``` for evt in StreamingParser::new(src) { match evt { BeginList => { // ... } // ... } } ``` The iterator provides a stack() method returning a slice of StackNodes which represent "where we currently are" in the logical structure of the json stream (for instance at "foo.bar[3].x" you get [ Key("foo"), Key("bar"), Index(3), Key("x") ].) I wrote "ideally" above because the current way rust expands for loops, you can't call the stack() method because the iterator is already borrowed. So for know you need to manually advance the iterator in the loop. I hope this is something we can cope with, until for loops are better integrated with the compiler. Streaming parsers are useful when you want to read from a json stream, generate a custom data structure and you know how the json is going to be structured. For example, imagine you have to parse a 3D mesh file represented in the json format. In this case you probably expect to have large arrays of vertices and using the generic parser will be very inefficient because it will create a big list of all these vertices, which you will copy into a contiguous array afterwards (so you end up doing a lot of small allocations, parsing the json once and parsing the data structure afterwards). With a streaming parser, you can add the vertices to a contiguous array as they come in without paying the cost of creating the intermediate Json data structure. You have much fewer allocations since you write directly in the final data structure and you can be smart in how you will pre-allocate it. I added added this directly into serialize::json rather than in its own library because it turns out I can reuse most of the existing code whereas maintaining a separate library (which I did originally) forces me to duplicate this code. I wrote this trying to minimize the size of the patch so there may be places where the code could be nicer at the expenses of more changes (let me know what you prefer). This is my first (potential) contribution to rust, so please let me know if I am doing something wrong (maybe I should have first introduced this proposition in the mailing list, or opened a github issue, etc.?). I work a few meters away from @pknfelix so I am not too hard to find :)
I think this has been causing some spurious failures on the bots which look quite odd to me. Do you know why this would happen? http://buildbot.rust-lang.org/builders/auto-win-32-nopt-t/builds/4761/steps/test/logs/stdio |
That's weird, ---- json::tests::test_read_list_streaming stdout ---- ---- json::tests::test_read_object_streaming stdout ---- it says that equality failed but also says that left and right are both BooleanValue(true) |
That is the odd part indeed. This has appeared multiple times as well (not sure why) |
`significant_drop_in_scrutinee`: Trigger lint only if lifetime allows early significant drop I want to argue that the following code snippet should not trigger `significant_drop_in_scrutinee` (rust-lang/rust-clippy#8987). The iterator holds a reference to the locked data, so it is expected that the mutex guard must be alive until the entire loop is finished. ```rust use std::sync::Mutex; fn main() { let mutex_vec = Mutex::new(vec![1, 2, 3]); for number in mutex_vec.lock().unwrap().iter() { dbg!(number); } } ``` However, the lint should be triggered when we clone the vector. In this case, the iterator does not hold any reference to the locked data. ```diff - for number in mutex_vec.lock().unwrap().iter() { + for number in mutex_vec.lock().unwrap().clone().iter() { ``` Unfortunately, it seems that regions on the types of local variables are mostly erased (`ReErased`) in the late lint pass. So it is hard to tell if the final expression has a lifetime relevant to the value with a significant drop. In this PR, I try to make a best-effort guess based on the function signatures. To avoid false positives, no lint is issued if the result is uncertain. I'm not sure if this is acceptable or not, so any comments are welcome. Fixes rust-lang/rust-clippy#8987 changelog: [`significant_drop_in_scrutinee`]: Trigger lint only if lifetime allows early significant drop. r? `@flip1995`
Hi rust enthusiasts,
With this patch I propose to add a "streaming" API to the existing json parser in libserialize.
By "streaming" I mean a parser that let you act on JsonEvents that are generated as while parsing happens, as opposed to parsing the entire source, generating a big data structure and working with this data structure. I think both approaches have their pros and cons so this pull request adds the streaming API, preserving the existing one.
The streaming API is simple: It consist into an Iterator that consumes an Iterator. JsonEvent is an enum with values such as NumberValue(f64), BeginList, EndList, BeginObject, etc.
The user would ideally use the API as follows:
The iterator provides a stack() method returning a slice of StackNodes which represent "where we currently are" in the logical structure of the json stream (for instance at "foo.bar[3].x" you get [ Key("foo"), Key("bar"), Index(3), Key("x") ].)
I wrote "ideally" above because the current way rust expands for loops, you can't call the stack() method because the iterator is already borrowed. So for know you need to manually advance the iterator in the loop. I hope this is something we can cope with, until for loops are better integrated with the compiler.
Streaming parsers are useful when you want to read from a json stream, generate a custom data structure and you know how the json is going to be structured. For example, imagine you have to parse a 3D mesh file represented in the json format. In this case you probably expect to have large arrays of vertices and using the generic parser will be very inefficient because it will create a big list of all these vertices, which you will copy into a contiguous array afterwards (so you end up doing a lot of small allocations, parsing the json once and parsing the data structure afterwards). With a streaming parser, you can add the vertices to a contiguous array as they come in without paying the cost of creating the intermediate Json data structure. You have much fewer allocations since you write directly in the final data structure and you can be smart in how you will pre-allocate it.
I added added this directly into serialize::json rather than in its own library because it turns out I can reuse most of the existing code whereas maintaining a separate library (which I did originally) forces me to duplicate this code.
I wrote this trying to minimize the size of the patch so there may be places where the code could be nicer at the expenses of more changes (let me know what you prefer).
This is my first (potential) contribution to rust, so please let me know if I am doing something wrong (maybe I should have first introduced this proposition in the mailing list, or opened a github issue, etc.?). I work a few meters away from @pknfelix so I am not too hard to find :)