Skip to content

Support RawBson objects. #133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jcdyer opened this issue Oct 23, 2019 · 18 comments
Closed

Support RawBson objects. #133

jcdyer opened this issue Oct 23, 2019 · 18 comments
Labels
tracked-in-jira Ticket filed in Mongo's Jira system

Comments

@jcdyer
Copy link
Contributor

jcdyer commented Oct 23, 2019

I'm working on an implementation of a zero-copy RawBson type / family of types, that store the input data as an &[u8], and operate directly on the bytes. This seems like a potentially better direct-from-the-wire format for the mongo crate, as it's more performant, and can still support many of the methods that the current Bson / OrderedDocument types support. Is this something you'd be interested in incorporating into this crate?

It's a WIP, but as it stands now, you can directly instantiate a raw document from a &[u8], and from that query into the document using doc.get("key")?.as_<type>()?, similar to the current implementation. Currently, keying into the document is a linear operation, but could be sped up by validating the json & storing offsets in an index during construction.

It would not support .get_mut() operations, and if we wanted to add that in the future, they would need to return (parsed) Bson objects, since in the general case you can't manipulate the data without allocating new buffers.

I do envision having conversion functions from the raw types to the parsed ones, though the other direction would complicate things, because the current implementation uses &[u8], and there would need to be an owner of the raw data (Cow<'_, str>?).

Code overview:

pub struct RawBsonDoc<'a> {
    data: &'a [u8],
}

impl<'a> RawBsonDoc<'a> {
    pub fn new(data: &'a [u8]) -> Option<RawBsonDoc<'a>>;
    pub fn get(&'a self, key: &str) -> Option<RawBson<'a>>;
    pub fn to_parsed_doc(&self) -> OrderedDocument;
}

pub struct RawBsonArray<'a> {
    data: &'a [u8],
}

impl<'a> RawBsonArray<'a> {
    pub fn new(data: &'a [u8]) -> Option<RawBsonArray<'a>>;
    pub fn get(&'a self, key: &str) -> Option<RawBson<'a>>;
    pub fn to_parsed_array(&self) -> ??;
}

pub struct RawBson<'a> {
    bsontype: BsonType,
    data: &'a [u8],
}

impl<'a> RawBson<'a> {
    // only constructed from RawBsonDoc::get(&str) or RawBsonArray::get(usize)
    pub fn as_str(&self) -> Option<&'a str>;
    pub fn as_doc(&self) -> Option<RawBsonDoc<'a>>;
    pub fn as_array(&self) -> Option<RawBsonArray<'a>>;
    pub fn to_parsed_bson(&self) -> Bson;
}

// Not publicly exposed
enum BsonType {
    String,
    Document,
    Array,
    // ...
}

Still under design: How this interfaces with encode and decode functions, impl serde::Deserialize types, and mongo query functions.

@zonyitoo
Copy link
Contributor

zonyitoo commented Oct 24, 2019

Is this something you'd be interested in incorporating into this crate?

That's sure. Everytime we interacts with Bson's data, we need to construct a Bson type, it requires many data copies. I always want to make a RawBson just like what you are currently working on.

But as you have discovered, it is very hard to make RawBson impls serde::Deserialize. And in practice, we always need to return a Bson object, which takes ownership of the data. So RawBson is limited in use cases like: you have raw bytes of bson, but you just want some of its' data. Once you want to transfer it to the other parts of your program, you still have to copy those data.

So it is important to have a to_owned(&self) method for RawBson, RawBsonDoc, RawBsonArray... as you mentioned in your example, to_parsed_bson(&self).

@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 28, 2019

Great. I've made a little bit of progress on the serde::Deserialize implementation, and currently have a working prototype that can deserialize to String, &str (with zero-copy borrowing from the original slice), *Map<T, U>, and structs. I'm doing the work in a separate repo, and will pull it in when I get a little further along, but you can see the WIP at https://github.com/jcdyer/rawbson

@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 28, 2019

Now with ObjectId deserialization in jcdyer/rawbson@4939f0b. This implements a proof-of-concept for zero-copy deserialization of custom types. The implementation is a little janky ATM, because the as_object_id() method on RawBson returns a bson::ObjectId, but deserialize only works with a custom (defined in tests at the moment) ObjectId type, because it requires a different intermediate representation than the current implementation which uses a hex representation. I couldn't use the existing format because that would require allocating a string for the intermediate representation (just to later cast it back to binary in the bson::ObjectId value).

The current deserialize implementation is built with the pattern used for serde_toml's Datetime type, which is referenced by the serde docs as an example for implementing types outside the serde object model.

I currently have an enum for deserializing bson keys and elements, but having worked through the ObjectId example, I think the better plan is to implement a separate KeyDeserializer or CStrDeserializer struct that can handle null-terminated strings.

@jcdyer
Copy link
Contributor Author

jcdyer commented Nov 2, 2019

Proof of concept was pretty successful. I didn't figure out the best way to decode all the possible values using serde, but I think I got a good start. I'm starting to work on pulling my changes in-tree, and integrate more closely with this library. See the work in progress at https://github.com/jcdyer/bson-rs/tree/rawbson. I'll open a PR for it when it gets a little closer.

@jcdyer
Copy link
Contributor Author

jcdyer commented Nov 4, 2019

Point of potential conflict: The legacy generic binary type (0x02) is defined in the bson spec as containing a (duplicate) i32 length specifier inside the binary data, so:

Current generic: (i32 length) "\x00" data

Deprecated binary: (i32 length) "\x02" (i32 length) data.

Based on what I understand the spec to be saying, the second i32 is always equal to the first i32 - 4.

The existing implementation returns a binary value of (i32 length) data, while my implementation returns data, stripping out the second length specifier.

@jcdyer
Copy link
Contributor Author

jcdyer commented Nov 4, 2019

Also, I ran into an issue in my own work, where an old database document had a string field that held a 12 byte object ID, which is obviously not utf-8, which was causing the driver to panic. The new implementation will not panic until the document is parsed to Bson, or the data from that field is accessed as a string. I'm thinking about ways to be able to work with that data without crashing, and one option (currently implemented) is to allow a string-valued Bson element to deserialize into a field of bytes (&[u8] or Vec<u8>). This gives a reasonable workaround, but is only accessible via serde.

The closest solution I have in the imperative api is to call as_bytes() on the RawBson object, which returns the pointed-to data directly (without reference to the element_type, but for a string, this returns the bytes prefixed by the four-byte length specifier, not just the bytes that make up the string. I'm not entirely pleased with it, but I haven't come up with a better API that handles all types gracefully. In theory, there might be a similar problem with document keys, but I haven't seen it happen, so I'm sticking with the &str-or-panic representation for keys.

Any suggestions on a good api would be welcome.

@jcdyer
Copy link
Contributor Author

jcdyer commented Nov 4, 2019

I'm fairly pleased with the Deserializer implementation for Binary, though. It can be deserialized either with Deserialize::deserialize_bytes() or with Deserialize::deserialize_map(). The deserialize_map implementation allows the caller to provide a Deserialize implementation that only accepts certain binary subtypes (see the Uuid example included in the tests).

@jcdyer
Copy link
Contributor Author

jcdyer commented Nov 13, 2019

I managed to integrate it with bson-rs, and get both the rawbson and bson types acting as a deserialization source. Mongo will be returning results into an owned Vec<u8>, though, and if we don't want callers to not have to create a home for the Vec, then I'll need to create an owning version of my RawBsonDoc type, along the lines of:

pub struct RawBsonDocBuf {
    value: Vec<u8>
}

and implement deserialize on that. I think most of the heavy lifting can be passed off to an as_ref() (or deref) that returns a RawBson<'a> where the 'a is the lifetime of the RawBsonDocBuf.

@zonyitoo
Copy link
Contributor

The legacy generic binary type (0x02) is defined in the bson spec as containing a (duplicate) i32 length specifier inside the binary data

That is new to me... I haven't noticed that.

@zonyitoo
Copy link
Contributor

Just walk through your code. Looks awesome and function completed.

@jcdyer
Copy link
Contributor Author

jcdyer commented Nov 13, 2019

Thank you.

I'm working on some benchmarks, not quite done yet. Some early results

  • The bulk of the time is spent building the Bson object.
  • Accessing arbitrary elements in a large flat document is faster in OrderedDocument than RawBsonDoc. (Because finding a given element in a RawBsonDoc is a linear operation).
  • Iterating over all elements is faster in RawBsonDoc.
  • Accessing a deeply-nested element is faster in RawBsonDoc.
  • Mostly as a result of the above, time from having a Vec to getting the value of late elements in a large, flat document (1000 keys, short string values) are faster for RawBsonDoc until you've accessed ~16 elements, then it's faster for OrderedDocument.
  • Converting bytes -> OrderedDocument using decode_document and converting bytes -> RawBsonDoc -> OrderedDocument take about the same amount of time.

Given the above, I think I'd like to add the RawBsonDocBuf, then consolidate the implementations of decode_document and From<RawBsonDoc> for Bson.

I also think it might make sense to have RawBsonDoc do a validation pass at construction time, to make sure that what got passed looks like bson. In particular, I think it should verify that lengths and zero-terminators line up at the right places, and that ElementType bytes are correct. Parsing individual objects beyond that (verifying utf-8, for example, or checking the format of binary subtypes) can be delayed to field access time.

Once all of that is done, I'll open up a PR, and we can start hammering on the API design, and answer some of the following questions together:

  • What does the overall workflow from bytes to raw types to parsed bson, to T: Deserialize<'de> types look like?
  • How and when should T: Serialize types produce RawBson types, and when should they produce bytes directly?
  • How much are we willing to break with this feature to clean up the API?
  • And of course, how should this be integrated into the mongodb library?

@jcdyer
Copy link
Contributor Author

jcdyer commented Dec 20, 2019

PR is up. Look forward to your feedback, @zonyitoo.

@jcdyer
Copy link
Contributor Author

jcdyer commented Jan 5, 2020

@saghm and team:

I see this repo just changed ownership. Please let me know if there are any changes needed to make this work meet the needs of the new rust driver.

@zonyitoo
Copy link
Contributor

zonyitoo commented Jan 5, 2020

PR is up. Look forward to your feedback, @zonyitoo.

Yup. bson-rust is now managed by the MongoDB team.

@pooooodles
Copy link

Tracked in jira.mongodb.org/browse/RUST-284

@saghm
Copy link
Contributor

saghm commented Feb 3, 2021

Hello @jcdyer! We're about to start work on implementing a raw document type, and we'd like to base it off of the work you've done. Is it all right with you if I make a pull request that starts with a commit containing the the work you've done (attributed to you, of course) and then add follow up commits as we iterate on the design and make changes? We want to make sure to credit you for the awesome work you've done on this problem, so having your code merged as a separate commit with a follow-up commit containing our changes seems like easiest way to achieve this.

@jcdyer
Copy link
Contributor Author

jcdyer commented Feb 5, 2021

@saghm That's great news! I'll create a new PR to include the current version of rawbson in-tree.

As a heads up, I gave a bit of a stab at including it in the mongo-rust-driver, and unfortunately it's not 100% straightforward, as there are a number of places where responses get deserialized for response metadata, but since rawbson isn't a serde deserialization target, it doesn't quite work out of the box. I was starting to work around that by creating a FromDoc trait that is implemented for T: Deserialize<'de>, and provides from_doc(&'de rawbson::Doc) -> T, but also lets you extract embedded documents as raw documents. Unfortunately, it requires a lot of new boilerplate, but there may be a cleaner way to do it.

@jcdyer
Copy link
Contributor Author

jcdyer commented Feb 7, 2021

New PR: #229

Basically, I created a pub mod raw, pulled the code into it, added the benchmarks, updated dependencies, and fixed the import paths.

Note that the raw type isn't a drop in replacement for a few reasons:

  1. It's read only. I've been thinking about ways to add limited write capability. For instance you could do a fairly ergonomic and performant append operation on documents (less performant if you check for duplicate keys), on DocBuf (but not Doc) and less performantly on Arrays, and you can modify fixed sized element types without having to rewrite the whole document, but deletion, and other updates would require rewriting the whole document in the general case.
  2. Parsing happens at access time, rather than up front, which means you can sometimes get useful data out of a malformed document, but it also means all read operations need to be able to return a MalformedDocument error.
  3. See above about deserialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tracked-in-jira Ticket filed in Mongo's Jira system
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants