-
Notifications
You must be signed in to change notification settings - Fork 144
Support RawBson objects. #133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
That's sure. Everytime we interacts with Bson's data, we need to construct a But as you have discovered, it is very hard to make So it is important to have a |
Great. I've made a little bit of progress on the |
Now with ObjectId deserialization in jcdyer/rawbson@4939f0b. This implements a proof-of-concept for zero-copy deserialization of custom types. The implementation is a little janky ATM, because the as_object_id() method on RawBson returns a bson::ObjectId, but deserialize only works with a custom (defined in tests at the moment) ObjectId type, because it requires a different intermediate representation than the current implementation which uses a hex representation. I couldn't use the existing format because that would require allocating a string for the intermediate representation (just to later cast it back to binary in the bson::ObjectId value). The current deserialize implementation is built with the pattern used for serde_toml's Datetime type, which is referenced by the serde docs as an example for implementing types outside the serde object model. I currently have an enum for deserializing bson keys and elements, but having worked through the ObjectId example, I think the better plan is to implement a separate KeyDeserializer or CStrDeserializer struct that can handle null-terminated strings. |
Proof of concept was pretty successful. I didn't figure out the best way to decode all the possible values using serde, but I think I got a good start. I'm starting to work on pulling my changes in-tree, and integrate more closely with this library. See the work in progress at https://github.com/jcdyer/bson-rs/tree/rawbson. I'll open a PR for it when it gets a little closer. |
Point of potential conflict: The legacy generic binary type (0x02) is defined in the bson spec as containing a (duplicate) i32 length specifier inside the binary data, so: Current generic: Deprecated binary: Based on what I understand the spec to be saying, the second i32 is always equal to the first i32 - 4. The existing implementation returns a binary value of |
Also, I ran into an issue in my own work, where an old database document had a string field that held a 12 byte object ID, which is obviously not utf-8, which was causing the driver to panic. The new implementation will not panic until the document is parsed to The closest solution I have in the imperative api is to call Any suggestions on a good api would be welcome. |
I'm fairly pleased with the Deserializer implementation for Binary, though. It can be deserialized either with Deserialize::deserialize_bytes() or with Deserialize::deserialize_map(). The deserialize_map implementation allows the caller to provide a Deserialize implementation that only accepts certain binary subtypes (see the Uuid example included in the tests). |
I managed to integrate it with bson-rs, and get both the rawbson and bson types acting as a deserialization source. Mongo will be returning results into an owned pub struct RawBsonDocBuf {
value: Vec<u8>
} and implement deserialize on that. I think most of the heavy lifting can be passed off to an |
That is new to me... I haven't noticed that. |
Just walk through your code. Looks awesome and function completed. |
Thank you. I'm working on some benchmarks, not quite done yet. Some early results
Given the above, I think I'd like to add the RawBsonDocBuf, then consolidate the implementations of I also think it might make sense to have RawBsonDoc do a validation pass at construction time, to make sure that what got passed looks like bson. In particular, I think it should verify that lengths and zero-terminators line up at the right places, and that ElementType bytes are correct. Parsing individual objects beyond that (verifying utf-8, for example, or checking the format of binary subtypes) can be delayed to field access time. Once all of that is done, I'll open up a PR, and we can start hammering on the API design, and answer some of the following questions together:
|
PR is up. Look forward to your feedback, @zonyitoo. |
@saghm and team: I see this repo just changed ownership. Please let me know if there are any changes needed to make this work meet the needs of the new rust driver. |
Yup. bson-rust is now managed by the MongoDB team. |
Tracked in jira.mongodb.org/browse/RUST-284 |
Hello @jcdyer! We're about to start work on implementing a raw document type, and we'd like to base it off of the work you've done. Is it all right with you if I make a pull request that starts with a commit containing the the work you've done (attributed to you, of course) and then add follow up commits as we iterate on the design and make changes? We want to make sure to credit you for the awesome work you've done on this problem, so having your code merged as a separate commit with a follow-up commit containing our changes seems like easiest way to achieve this. |
@saghm That's great news! I'll create a new PR to include the current version of rawbson in-tree. As a heads up, I gave a bit of a stab at including it in the mongo-rust-driver, and unfortunately it's not 100% straightforward, as there are a number of places where responses get deserialized for response metadata, but since rawbson isn't a serde deserialization target, it doesn't quite work out of the box. I was starting to work around that by creating a FromDoc trait that is implemented for T: Deserialize<'de>, and provides from_doc(&'de rawbson::Doc) -> T, but also lets you extract embedded documents as raw documents. Unfortunately, it requires a lot of new boilerplate, but there may be a cleaner way to do it. |
New PR: #229 Basically, I created a Note that the raw type isn't a drop in replacement for a few reasons:
|
Uh oh!
There was an error while loading. Please reload this page.
I'm working on an implementation of a zero-copy RawBson type / family of types, that store the input data as an &[u8], and operate directly on the bytes. This seems like a potentially better direct-from-the-wire format for the mongo crate, as it's more performant, and can still support many of the methods that the current
Bson
/OrderedDocument
types support. Is this something you'd be interested in incorporating into this crate?It's a WIP, but as it stands now, you can directly instantiate a raw document from a &[u8], and from that query into the document using
doc.get("key")?.as_<type>()?
, similar to the current implementation. Currently, keying into the document is a linear operation, but could be sped up by validating the json & storing offsets in an index during construction.It would not support
.get_mut()
operations, and if we wanted to add that in the future, they would need to return (parsed)Bson
objects, since in the general case you can't manipulate the data without allocating new buffers.I do envision having conversion functions from the raw types to the parsed ones, though the other direction would complicate things, because the current implementation uses &[u8], and there would need to be an owner of the raw data (
Cow<'_, str>
?).Code overview:
Still under design: How this interfaces with encode and decode functions,
impl serde::Deserialize
types, and mongo query functions.The text was updated successfully, but these errors were encountered: