Skip to content

RUST-870 Support deserializing directly from raw BSON #276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

patrickfreed
Copy link
Contributor

@patrickfreed patrickfreed commented Jul 2, 2021

RUST-870

This PR introduces two new functions to the public API which allow for deserializing a T directly from raw BSON bytes instead of going through Document first: from_reader and from_slice. The primary goal of these functions is to enable performance improvements in the driver, though they should be generally useful on their own. This PR also enables borrowed deserialization via from_slice (#231, RUST-688).

The implementation of this involved introducing a new deserializer that reads directly from raw BSON. The code for this lives in srd/de/raw.rs, and is modeled off of the Implementing a Deserializer example and serde_json's Deserializer. I encourage reading through the example and having both handy while reviewing this PR.

This also fixes RUST-880, RUST-884, and partially RUST-882.

@@ -180,9 +180,9 @@ axes:
- id: "extra-rust-versions"
values:
- id: "min"
display_name: "1.43 (minimum supported version)"
display_name: "1.48 (minimum supported version)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver is currently on 1.47, but in order to ergonomically convert between a Vec<u8> and an array of bytes for Decimal128, I needed Rust 1.48. I'm sort of thinking we should consider bumping this all the way to the latest stable right before the 2.0.0 release to give us the newest possible version to work with for a while. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm as well


use bson::{Bson, Deserializer, Serializer};

macro_rules! bson {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests used separate macros, I think because they were written before the public doc! and bson! macros existed. I updated them to use the crate's actual macros and to test both deserialization techniques.

@@ -297,7 +300,11 @@ impl From<i64> for Bson {

impl From<u32> for Bson {
fn from(a: u32) -> Bson {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was leading to integers that were too big wrapping around. See RUST-882 for more info.

@@ -583,7 +587,7 @@ impl Bson {
"$oid": v.to_string(),
}
}
Bson::DateTime(v) if v.timestamp_millis() >= 0 && v.to_chrono().year() <= 99999 => {
Bson::DateTime(v) if v.timestamp_millis() >= 0 && v.to_chrono().year() <= 9999 => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See RUST-884

@@ -345,6 +272,122 @@ pub(crate) fn deserialize_bson_kvp<R: Read + ?Sized>(
Ok((key, val))
}

impl Binary {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these impls were moved mostly as-is from the above match so that they could be reused in the raw deserializer.

where
V: MapAccess<'de>,
{
let values = DocumentVisitor::new().visit_map(visitor)?;
Ok(Bson::from_extended_document(values))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is much the same, but it was updated to make use of serde's MapAccess API for the extended JSON format parts to achieve better performance. Prior to these changes, a Document would allocated and deserialized each time, and then its format would be parsed. Instead, we deserialize the extended format directly, allowing us to go straight from Raw BSON to our BSON types where possible. If the input data doesn't match an extended JSON format, it's just deserialized to a Document.

@@ -684,52 +683,6 @@ impl<'a> OccupiedEntry<'a> {
}
}

pub(crate) struct DocumentVisitor {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was implemented directly in BsonVisitor and wasn't needed anymore.

.expect(&description);

let mut native_to_bson_bson_to_native_cv = Vec::new();
let canonical_bson = hex::decode(&valid.canonical_bson).expect(&description);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded the corpus tests to cover both deserializers, as they weren't doing so before. This uncovered a few bugs.

let bson = hex::decode(&decode_error.bson).expect("should decode from hex");
Document::from_reader(bson.as_slice()).expect_err(decode_error.description.as_str());

// the from_reader implementation supports deserializing from lossy UTF-8
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See RUST-886

@@ -172,26 +273,6 @@ fn run_test(test: TestFile) {
}
}

// native_to_bson( bson_to_native(dB) ) = cB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were shifted earlier so that they can be run even without the decimal128 flag.

@patrickfreed patrickfreed marked this pull request as ready for review July 2, 2021 20:00
@@ -180,9 +180,9 @@ axes:
- id: "extra-rust-versions"
values:
- id: "min"
display_name: "1.43 (minimum supported version)"
display_name: "1.48 (minimum supported version)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

src/de/raw.rs Outdated
/// Read the next element type and update the root deserializer with it.
///
/// Returns `Ok(None)` if the document has been fully read and has no more elements.
fn read_next_tag(&mut self) -> Result<Option<ElementType>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be easier to follow if Deserializer had a fn that read, parsed the tag, and updated current_type, and this one called that and checked/updated length_remaining.

let subtype = BinarySubtype::from(read_u8(&mut self.bytes)?);
match subtype {
BinarySubtype::Generic => {
visitor.visit_borrowed_bytes(self.bytes.read_slice(len as usize)?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure I'm following: this means that generic Binary values will deserialize to &[u8] where all other subtypes will be a {"$binary: {"subType": ..., "base64": ...}} map structure, right? Which I guess means a data structure intended to be deserialized from raw BSON must conform to either the generic shape, or the everything-else shape, and can't accept both.

src/de/raw.rs Outdated
Ok(())
}

fn read_cstr(&mut self) -> Result<&'a str> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are utf8 errors here a hard failure rather than the lossy behavior of read_str (or the other way around)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the common case for invalid UTF-8 was server error message values rather than in the keys, I opted to only do the lossy handling for those. That being said, it probably makes more sense to be consistent throughout. Once we decide a way forward on the lossy UTF-8 situation, I'll unify this with read_str.

/// - deserializing from the serialized document produces `expected_value`
/// - round trip through raw BSON:
/// - deserializing a `T` from the raw BSON version of `expected_doc` produces `expected_value`
/// - desierializing a `Document` from the raw BSON version of `expected_doc` produces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "deserializing"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -180,9 +180,9 @@ axes:
- id: "extra-rust-versions"
values:
- id: "min"
display_name: "1.43 (minimum supported version)"
display_name: "1.48 (minimum supported version)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm as well

"a": 1,
"b": 2,
};
bson::from_document::<Foo>(doc.clone()).expect_err("extra filds should cause failure");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here ("filds" -> "fields")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -182,7 +185,7 @@ impl Debug for Bson {

impl From<f32> for Bson {
fn from(a: f32) -> Bson {
Bson::Double(a as f64)
Bson::Double(a.into())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just a stylistic change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, to signify that this is in fact lossless, whereas as can be lossy depending on the conversion.

/// This is mainly useful when reading raw BSON returned from a MongoDB server, which
/// in rare cases can contain invalidly truncated strings (https://jira.mongodb.org/browse/SERVER-24007).
/// For most use cases, `bson::from_slice` can be used instead.
pub fn from_reader_utf8_lossy<R, T>(reader: R) -> Result<T>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per our discussion in slack, I added new functions for the lossy UTF-8 approach.

cc @abr-egn @isabelatkinson

@patrickfreed patrickfreed merged commit 7ccf82b into mongodb:master Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants