Skip to content

RUST-284: Incorporate rawbson code from rawbson = "0.2.1" in mod raw #229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Oct 25, 2021

Conversation

jcdyer
Copy link
Contributor

@jcdyer jcdyer commented Feb 7, 2021

Implements #133

@jcdyer jcdyer mentioned this pull request Feb 7, 2021
@saghm
Copy link
Contributor

saghm commented Feb 8, 2021

Thanks for the pull request @jcdyer! Is it all right with you if I start pushing commits to this branch? We're interested in getting this out to users as soon as possible (specifically in our 2.0 release, which we just released an alpha for), so it will go more quickly if we can write the code for changes we want to make ourselves. We definitely would love your input as a reviewer for the code we write!

@jcdyer
Copy link
Contributor Author

jcdyer commented Feb 9, 2021

by all means! Use it as you see fit. I'm happy to answer questions, discuss implementation ideas, review code, or whatever helps.

src/raw/mod.rs Outdated
DocBuf {
data: data.into_boxed_slice(),
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't actually do anything unsafe. I marked it that way out of a suspicion that it might someday in the future, so it's marked to prevent backwards compatibility semver issues, but everything interacts with the underlying slice in a completely safe manner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting; I was wondering why it was marked as unsafe. I think if it ever does need to become unsafe, we would probably prefer to just add a separate method and deprecate this one rather than marking it as unsafe even though it currently doesn't need to be, so I've removed that annotation in my follow-up commit.

src/raw/mod.rs Outdated
/// Error to indicate that either a value was empty or it contained an unexpected
/// type, for use with the direct getters.
#[derive(Debug, PartialEq)]
pub enum RawError {
Copy link
Contributor Author

@jcdyer jcdyer Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to merge this with the primary crate Error type. The main difference is needing to be able to signal malformed bson when accessing elements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we don't actually have a single error type in the crate, but each submodule defines their own (e.g. document::ValueAccessError, oid::Error, etc.). Given the current situation, it makes sense to leave this as-is for now, and maybe in a 2.0 release we can considering unifying the crate under a single error type.

@saghm
Copy link
Contributor

saghm commented Feb 10, 2021

Thanks so much for this PR! We're going to hold off for a couple of weeks before merging the serde implementation in order to focus on the API that we need to be able to update the driver for the 2.0 release, but we definitely plan on using it once we're ready, and we'll make sure that you get proper credit for it.

I've pushed some small API and documentation changes as a separate commit and added the rest of the team as reviewers to the pull request.

Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jcdyer, thanks again for the PR on this--it looks great so far! To avoid possibly duplicating any work though, @saghm will handle making any code changes @isabelatkinson and I request going forward, but feel free to chime in during the review as you see fit.

@jcdyer
Copy link
Contributor Author

jcdyer commented Feb 14, 2021

To summarize some of my comments from the other day, I'd caution against thinking about (or advertising) RawDocument simply as a faster Document, or about trying to force the two to look exactly the same at the API level. It is a different data structure with different performance characteristics. In many cases, it is faster, but not all.

In some cases, as the benchmarks show, RawDocument is significantly slower than Document. For instance, reading an element out of a document by name is O(n) in RawDocument, since you have to start at the beginning and read each element out of the RawDocument until you find the one you want, while Document is O(1), since it is just a hashmap. If you only need a single element, RawDocument is faster since the parsing time overwhelms the measurement, but as the number of elements accessed goes up, raw document becomes O(M * N) (M= number of accesses, N=size of document), while Document is O(M * 1).

Where there is clear similarity between the two types, it makes sense to keep a similar API, but document mutation is not one of those situations. Raw document can support some mutations fairly simply, but others are quite complex, and should be discouraged:

RawDocument append

Easy. O(1) performance (assuming a reasonably sized element)

  1. Remove the trailing \0
  2. add the new element
  3. append a trailing \0
  4. rewrite the length marker at the beginning of the document.

Requires an O(n) read to guarantee non-duplicate keys, because you have to iterate over the document to compare with the existing keys.

RawArray append

Note: there currently isn't an owning RawArray type, since that's not part of the bson spec. If such a type were created, this could be implemented.

Easy, O(n) performance:

  1. Iterate the document to get the current length
  2. append as above with length as the key.

Non-size-changing element modification

(e.g., changing the value of a fixed size element, without changing its type)

  1. Access the desired element (O(n))
  2. Modify the element.

General element modification / element replacement

Slow, moderately complicated

  1. Find the desired element (O(n))
  2. Reserve all following elements to a buffer. (O(n) write)
  3. Modify / replace the element.
  4. Append the reserved Elements (O(n) write)
  5. Rewrite the length marker at the beginning of the document.

Modifying a nested document:

Slow and complicated (O(N * depth of nesting)).

For a nested key x.y.z:

  1. Find element x. Copy all following elements to a buffer as for general modification
  2. Find element y in embedded document x. Copy all following elements to a buffer as for general element modification.
  3. Find element z in embedded document y: Copy all following elements to a buffer as for general element modification.
  4. Modify element z.
  5. Append reserved elements for document y.
  6. Rewrite the length marker at the beginning of y.
  7. Append reserved elements for document x.
  8. Rewrite the length marker at the beginning of x.
  9. Append reserved elements for top-level document.
  10. Rewrite the length marker for top-level document.

Note, this algorithmic complexity is exacerbated by the fact that the data is owned by the top-level document, so inner documents aren't even capable of resizing themselves to accomodate modified elements.

Summary

I would recommend supporting the following operations:

  1. non-key-checking (O(1)) push and extend.
  2. key-checking (O(n)) push and extend.
  3. impl FromIterator<Item=(&str, &RawBson)> for RawDocument

Then document that other modifications should be made either by iterating over the existing document and collecting to a fresh RawDocument, or by deserializing to a more appropriate data structure to perform mutations (a Document or custom type), then serializing back to RawDocument.

Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed overview @jcdyer ! I definitely agree that RawDocument is not a strict improvement performance-wise for all use-cases. I think the lion's share of the perf benefits will come from within the driver actually, since it should allow us to skip an intermediate serialization step for insertions and potentially reduce a lot of heap allocations in other places. It's unclear if deserialization will improve significantly though, since that requires mostly key lookups.

I also agree with the summary / suggested operations. One additional operation that may be worth including is entry deletion, since that should be pretty straightforward as well (will be O(n) as with all other key-based operations though). This would make it easier for a user to do a replace should they have to, though again we should try to encourage them to use a different type if they plan to do mutations of that sort. I guess they could also trivially implement it via a filter + collect + non-key-checking append, though, so maybe its not necessary.

src/raw/error.rs Outdated
@@ -0,0 +1,27 @@
/// An error that occurs when attempting to parse raw BSON bytes.
#[derive(Debug, PartialEq)]
pub enum Error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of refactoring this into a separate submodule, I removed the Raw prefix from the Error and Result types defined here. This is consistent with how we define errors elsewhere in the library (e.g. bson::de::Error and bson::ser::Error), and users will need to import it with bson::raw::{Error, Result} anyhow.

As for RawDocument and friends, I do not think we should remove the prefix here. raw::Error and raw::Result are essentially semantically equivalent to other Error and Result types both within this library and externally, Document and RawDocument are semantically different in significant ways. Put another way, I don't think it would be confusing for readers of code using this library to gloss over what the exact Result or Error type is given that their expectations of how they work will be correct, but glossing over whether a regular Document or a RawDocument is used seems much more dangerous.

Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the tests aren't compiling after the module reorganization.

src/raw/array.rs Outdated
}
}

/// An iterator over borrwed raw BSON array values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo here (should be borrowed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/raw/doc.rs Outdated
/// # use bson::raw::{RawDocument, Error};
/// let doc = RawDocument::new(b"\x13\x00\x00\x00\x02hi\x00\x06\x00\x00\x00y'all\x00\x00".to_vec())?;
/// let mut iter = doc.iter();
/// let (key, value) = iter.next().unwrap()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to have both unwrap and ? here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The return type is Option<Result<,>>. It could also be .unwrap().unwrap().

@pooooodles pooooodles changed the title Incorporate rawbson code from rawbson = "0.2.1" in mod raw RUST-284: Incorporate rawbson code from rawbson = "0.2.1" in mod raw Feb 18, 2021
@pooooodles pooooodles added the tracked-in-jira Ticket filed in Mongo's Jira system label Feb 18, 2021
@jcdyer
Copy link
Contributor Author

jcdyer commented Feb 19, 2021

Tangential to this, are there any plans to derive(Copy) for ObjectId? Its current impl is stack allocated, fixed size, and smaller than the pointer for a &[u8] on 64 bit platforms, and would clean up a lot of code.

@saghm
Copy link
Contributor

saghm commented Feb 19, 2021

Tangential to this, are there any plans to derive(Copy) for ObjectId? Its current impl is stack allocated, fixed size, and smaller than the pointer for a &[u8] on 64 bit platforms, and would clean up a lot of code.

Yeah, I think this probably is worth doing. Feel free to open a separate PR with this change!

@jcdyer
Copy link
Contributor Author

jcdyer commented Apr 14, 2021

Where does this stand? From what I can see, it seems to have stalled out again. Is there anything I can do to help get it moving again?

@patrickfreed
Copy link
Contributor

We had to put this on the backburner for a little while to focus on getting the 2.0 release of the driver out as well as a few other internal priorities. We'll be picking this up again in the near term future after the 2.0 stable release is out. Thanks again for your continued patience and all the work you've put into this so far, and sorry about the delay!

@patrickfreed
Copy link
Contributor

Hey @jcdyer, just wanted to let you know that we've picked this up again now that the 2.0 has been released. As a first step, I'll be tidying things up and addressing any of the outstanding comments from the last round of reviews. After that's done, I'll open it up to the team for another review. As with last time, feel free to chime in with any comments as you see fit.

Thanks again for getting the ball rolling on this!

Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recent batch of commits made a number of API and implementation changes, most notably:

  • Rewrote RawBson as an enum. If we want to introduce an API that matches the old RawBson (i.e. a struct that wraps a reference to a slice of bytes), then I think an Entry API might be a good opportunity for that.
  • Renamed RawDocumentRef to RawDoc. This name matches the convention of String/str, and it is also slightly more correct, since the type isn't a reference itself but has to be referred to as part of one, being dynamically sized.
    • Something like RawDocumentBuf/RawDocument would also work fine IMO (matches PathBuf/Path)
  • Renamed RawArray to RawArr. This was done for similar reasons as the RawDoc rename, and it also makes room for an owned version of RawArray should we need it.
  • Baked in a bunch of bounds checking into the code that parses the raw BSON into Rust types.
  • Introduced a ValueAccessError type for the typed getters. Originally, we were planning on moving away from this paradigm, but given that we decided to keep it for the typed getters on Document, I think it made sense to reintroduce it here.
  • Broke off error enums into ErrorKind and ValueAccessErrorKind to enable storing of the key / other metadata on the actual error structs.
  • Implemented Copy for Decimal128
  • Introduced a RawDbPointer type
  • Removed the RawTimestamp type - we can just rely on the regular one since it's Copy
  • Hooked up the raw document API to the corpus tests
  • Touched up the documentation
  • Moved code into separate files (document.rs and iter.rs are new files containing mostly code that already existing in the PR, props.rs was simply moved to the test directory)
  • Marked RawDoc and RawArr as #[repr(transparent)], which provides the necessary safety guarantees for doing the casts we do in new_unchecked. This means we'll need to commit to the byte layout of these types too in addition to their APIs.

I filed a ticket to cover the work for introducing an append method to RawDocument, which can be done separately from this PR.

I think this PR is good to be reviewed again, let me know if you have any questions about the above changes.

src/raw/array.rs Outdated
}
}

/// An iterator over borrwed raw BSON array values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@abr-egn abr-egn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

src/raw/doc.rs Outdated
/// Note that the internal structure of the bytes representing the
/// BSON elements is _not_ validated at all by this method. If the
/// bytes do not conform to the BSON spec, then method calls on
/// the RawDocument will return Errors where appropriate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// the RawDocument will return Errors where appropriate.
/// the RawDoc will return Errors where appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this and a few of the other non-linkified references to types in this file.

src/raw/array.rs Outdated
inner: Iter<'a>,
}

impl<'a> Iterator for RawArrIter<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One validation we're missing here is checking to make sure that the keys in the array are well-formed (i.e. incrementing integers starting with 0). This could lead to some confusing behavior in which malformed BSON bytes that include an "\x04" tag followed by a regular document are successfully interpreted as an array. I think we should add some checks here to ensure that the document we're dealing with is a proper array-style document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, accepting array documents with messed up indexes is actually required by the bson corpus tests. e.g. the "Single Element array with index set incorrectly to ab" test is included under "valid": https://github.com/mongodb/specifications/blob/master/source/bson-corpus/tests/array.json#L23

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, that's interesting.

Cargo.toml Outdated
uuid = { version = "0.8.1", features = ["serde", "v4"] }
proptest = "0.10"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this crate is now at 1.0.0 -- should we bump the dependency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

src/raw/array.rs Outdated
inner: Iter<'a>,
}

impl<'a> Iterator for RawArrIter<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, that's interesting.

@patrickfreed
Copy link
Contributor

Sounds good, went ahead and renamed RawDocument to RawDocumentBuf and RawDoc to RawDocument. I also renamed RawArr to RawArray and rexported all three types from lib.rs as part of that.

@patrickfreed patrickfreed merged commit a5c1e20 into mongodb:master Oct 25, 2021
@patrickfreed
Copy link
Contributor

Thanks again for spearheading this @jcdyer; we're super happy to have this finally merged!

@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 26, 2021

@patrickfreed That's fantastic! Looking forward to seeing the new speediness in action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tracked-in-jira Ticket filed in Mongo's Jira system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants