-
Notifications
You must be signed in to change notification settings - Fork 144
RUST-870 Support deserializing directly from raw BSON #276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUST-870 Support deserializing directly from raw BSON #276
Conversation
This reverts commit a5a08dada7bd2c465084d4dd0ae9201140d259c4.
@@ -180,9 +180,9 @@ axes: | |||
- id: "extra-rust-versions" | |||
values: | |||
- id: "min" | |||
display_name: "1.43 (minimum supported version)" | |||
display_name: "1.48 (minimum supported version)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The driver is currently on 1.47, but in order to ergonomically convert between a Vec<u8>
and an array of bytes for Decimal128
, I needed Rust 1.48. I'm sort of thinking we should consider bumping this all the way to the latest stable right before the 2.0.0 release to give us the newest possible version to work with for a while. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm as well
|
||
use bson::{Bson, Deserializer, Serializer}; | ||
|
||
macro_rules! bson { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests used separate macros, I think because they were written before the public doc!
and bson!
macros existed. I updated them to use the crate's actual macros and to test both deserialization techniques.
@@ -297,7 +300,11 @@ impl From<i64> for Bson { | |||
|
|||
impl From<u32> for Bson { | |||
fn from(a: u32) -> Bson { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was leading to integers that were too big wrapping around. See RUST-882 for more info.
@@ -583,7 +587,7 @@ impl Bson { | |||
"$oid": v.to_string(), | |||
} | |||
} | |||
Bson::DateTime(v) if v.timestamp_millis() >= 0 && v.to_chrono().year() <= 99999 => { | |||
Bson::DateTime(v) if v.timestamp_millis() >= 0 && v.to_chrono().year() <= 9999 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See RUST-884
@@ -345,6 +272,122 @@ pub(crate) fn deserialize_bson_kvp<R: Read + ?Sized>( | |||
Ok((key, val)) | |||
} | |||
|
|||
impl Binary { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these impls were moved mostly as-is from the above match
so that they could be reused in the raw deserializer.
where | ||
V: MapAccess<'de>, | ||
{ | ||
let values = DocumentVisitor::new().visit_map(visitor)?; | ||
Ok(Bson::from_extended_document(values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is much the same, but it was updated to make use of serde's MapAccess
API for the extended JSON format parts to achieve better performance. Prior to these changes, a Document
would allocated and deserialized each time, and then its format would be parsed. Instead, we deserialize the extended format directly, allowing us to go straight from Raw BSON to our BSON types where possible. If the input data doesn't match an extended JSON format, it's just deserialized to a Document
.
@@ -684,52 +683,6 @@ impl<'a> OccupiedEntry<'a> { | |||
} | |||
} | |||
|
|||
pub(crate) struct DocumentVisitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic was implemented directly in BsonVisitor
and wasn't needed anymore.
.expect(&description); | ||
|
||
let mut native_to_bson_bson_to_native_cv = Vec::new(); | ||
let canonical_bson = hex::decode(&valid.canonical_bson).expect(&description); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expanded the corpus tests to cover both deserializers, as they weren't doing so before. This uncovered a few bugs.
src/tests/spec/corpus.rs
Outdated
let bson = hex::decode(&decode_error.bson).expect("should decode from hex"); | ||
Document::from_reader(bson.as_slice()).expect_err(decode_error.description.as_str()); | ||
|
||
// the from_reader implementation supports deserializing from lossy UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See RUST-886
@@ -172,26 +273,6 @@ fn run_test(test: TestFile) { | |||
} | |||
} | |||
|
|||
// native_to_bson( bson_to_native(dB) ) = cB | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these were shifted earlier so that they can be run even without the decimal128 flag.
@@ -180,9 +180,9 @@ axes: | |||
- id: "extra-rust-versions" | |||
values: | |||
- id: "min" | |||
display_name: "1.43 (minimum supported version)" | |||
display_name: "1.48 (minimum supported version)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
src/de/raw.rs
Outdated
/// Read the next element type and update the root deserializer with it. | ||
/// | ||
/// Returns `Ok(None)` if the document has been fully read and has no more elements. | ||
fn read_next_tag(&mut self) -> Result<Option<ElementType>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be easier to follow if Deserializer
had a fn that read, parsed the tag, and updated current_type
, and this one called that and checked/updated length_remaining
.
let subtype = BinarySubtype::from(read_u8(&mut self.bytes)?); | ||
match subtype { | ||
BinarySubtype::Generic => { | ||
visitor.visit_borrowed_bytes(self.bytes.read_slice(len as usize)?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making sure I'm following: this means that generic Binary
values will deserialize to &[u8]
where all other subtypes will be a {"$binary: {"subType": ..., "base64": ...}}
map structure, right? Which I guess means a data structure intended to be deserialized from raw BSON must conform to either the generic shape, or the everything-else shape, and can't accept both.
src/de/raw.rs
Outdated
Ok(()) | ||
} | ||
|
||
fn read_cstr(&mut self) -> Result<&'a str> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are utf8 errors here a hard failure rather than the lossy behavior of read_str
(or the other way around)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the common case for invalid UTF-8 was server error message values rather than in the keys, I opted to only do the lossy handling for those. That being said, it probably makes more sense to be consistent throughout. Once we decide a way forward on the lossy UTF-8 situation, I'll unify this with read_str
.
serde-tests/test.rs
Outdated
/// - deserializing from the serialized document produces `expected_value` | ||
/// - round trip through raw BSON: | ||
/// - deserializing a `T` from the raw BSON version of `expected_doc` produces `expected_value` | ||
/// - desierializing a `Document` from the raw BSON version of `expected_doc` produces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "deserializing"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -180,9 +180,9 @@ axes: | |||
- id: "extra-rust-versions" | |||
values: | |||
- id: "min" | |||
display_name: "1.43 (minimum supported version)" | |||
display_name: "1.48 (minimum supported version)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm as well
serde-tests/test.rs
Outdated
"a": 1, | ||
"b": 2, | ||
}; | ||
bson::from_document::<Foo>(doc.clone()).expect_err("extra filds should cause failure"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here ("filds" -> "fields")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -182,7 +185,7 @@ impl Debug for Bson { | |||
|
|||
impl From<f32> for Bson { | |||
fn from(a: f32) -> Bson { | |||
Bson::Double(a as f64) | |||
Bson::Double(a.into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just a stylistic change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, to signify that this is in fact lossless, whereas as
can be lossy depending on the conversion.
/// This is mainly useful when reading raw BSON returned from a MongoDB server, which | ||
/// in rare cases can contain invalidly truncated strings (https://jira.mongodb.org/browse/SERVER-24007). | ||
/// For most use cases, `bson::from_slice` can be used instead. | ||
pub fn from_reader_utf8_lossy<R, T>(reader: R) -> Result<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per our discussion in slack, I added new functions for the lossy UTF-8 approach.
RUST-870
This PR introduces two new functions to the public API which allow for deserializing a
T
directly from raw BSON bytes instead of going throughDocument
first:from_reader
andfrom_slice
. The primary goal of these functions is to enable performance improvements in the driver, though they should be generally useful on their own. This PR also enables borrowed deserialization viafrom_slice
(#231, RUST-688).The implementation of this involved introducing a new deserializer that reads directly from raw BSON. The code for this lives in
srd/de/raw.rs
, and is modeled off of the Implementing a Deserializer example andserde_json
'sDeserializer
. I encourage reading through the example and having both handy while reviewing this PR.This also fixes RUST-880, RUST-884, and partially RUST-882.