Skip to content

RUST-1082 Implement Serialize and Deserialize for raw BSON types #318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

patrickfreed
Copy link
Contributor

RUST-1082

This PR adds Serialize and Deserialize implementations for the raw BSON types.

e.g. the following is now possible:

#[derive(Debug, Deserialize, Serialize)]
struct MyData<'a> {
    name: &'a str,

    #[serde(borrow)]
    binary: RawBinary<'a>,

    #[serde(borrow)]
    doc: &'a RawDocument,

    #[serde(borrow)]
    some_field: RawBson<'a>,
}

The implementations behave as follows:

  • &RawDocument:
    • for human readable formats: iterate over document and serialize each kvp
    • for binary formats: serialize the document as a newtype around a slice of bytes.
      • The BSON serializers handle this by interpreting the name of the newtype
    • Open Question: should this use something like { "$rawBsonDocument": <bytes> } for binary, non-BSON formats? This will allow RawBson to round trip, but that's about it.
  • RawDocumentBuf: same as RawDocument. In a future PR it will be able to deserialize from arbitrary maps, but that requires the ability to append.
  • RawBson:
    • Defer to the wrapped type
    • The types that map directly to the serde data model (e.g. integers) are deserialized as-is, similar to Bson.
    • For everything else, the extended JSON serde data model equivalent is used, also similar to Bson
      • for many types, the model needed to be changed a little bit to ensure to extra allocations are done
  • &RawArray
    • Same as &RawDocument, but uses a different newtype name to signal an array.
  • RawJavaScriptCodeWithScope
    • serialize to/from: { "$code": <borrowed string>, "$scope": <borrowed string> }
  • RawDbPointer
    • serialize to/from: { "$ref": <borrowed string>, "$id": <objectid> }
  • ObjectId
    • Same as before, but it can now also deserialize from { "$oid": <12 byte slice> }
  • RawBinary
    • On human readable platforms, serializes to ext JSON format in serde data model
    • On binary formats, serializes to { "$binary": { "bytes": <borrowed bytes>, subType: <u8> } } in the serde data model
    • In both cases, the corresponding serializer handles it and correctly encodes as BSON
    • In order for RawBinary to round trip on non-BSON formats, the separate bytes model (rather than base64 string model) was required. It is also much faster.
    • Unfortunately, we can't also update Binary to exhibit this behavior, as it would technically be a breaking change.
  • RawRegex:
    • serialize to/from extended JSON format but uses borrowed instead of owned strings
  • DateTime: updated to also deserialize from { "$date": <i64> } to avoid a string allocation during deserialization.

}

#[test]
fn all_types() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is the same, but I moved the struct definition and initialization out of the body to reuse it in other tests.

"max_key": { "$maxKey": 1 },
});

assert_eq!(serde_json::to_value(&v).unwrap(), json);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test only verifies that serialization works, since deserializing binary values from JSON is very limited (serde_json treats byte slices as arrays of integers, which Binary can't round trip from). I filed RUST-1091 to cover the work for adding more robust tests in this area though.

/// Hint provided to the deserializer via `deserialize_newtype_struct` as to the type of thing
/// being deserialized.
#[derive(Debug, Clone, Copy)]
enum DeserializerHint {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced this hint type as a way of tracking whether or not we're deserializing raw BSON types. If we are, we use slightly different visiting patterns. For example, a document will be visited as a slice of bytes rather than as a map. It's also used for the UUID serialization hinting behavior introduced in #314. The hint communicated for raw BSON the same way it is for UUIDs, i.e. via a specifically named newtype struct. See deserialize_newtype_struct for where that happens.

) -> Result<V::Value>
/// Deserialize the next element in the BSON, using the type of the element along with the
/// provided hint to determine how to visit the data.
fn deserialize_next<V>(&mut self, visitor: V, hint: DeserializerHint) -> Result<V::Value>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to borrow from the underling BSON data, we had to introduce new MapAccess implementations for the various types that didn't get them in the initial implementation (e.g. Regex).

}

#[derive(Debug, Clone, Copy)]
enum SerializerHint {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to deserialization, we use a hint here to indicate what the next call to serialize_bytes is serializing.

@@ -59,6 +65,34 @@ struct ParseError {
string: String,
}

struct FieldVisitor<'a, T>(&'a str, PhantomData<T>);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type is used to deserialize the value provided by a valid corpus test. We need to do some runtime hackery since the field name isn't the same per test and is instead provided in the field_name field.

@patrickfreed patrickfreed marked this pull request as ready for review November 2, 2021 15:57
Copy link
Contributor

@abr-egn abr-egn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Extremely minor comments only.

On a larger scope, it's difficult for me to keep all the cogs in my head to get a sense of the overall machine operation - a walkthrough would be very welcome at some point.

src/de/raw.rs Outdated
/// Whether the first key has been deserialized yet or not.
deserialized_first: bool,

/// Whether or not this document being deserialized is for anarray or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "anarray" => "an array"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/raw/bson.rs Outdated
where
E: serde::de::Error,
{
crate::de::convert_unsigned_to_signed_raw(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: might be worth just useing convert_unsigned_to_signed_raw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor Author

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, this code is pretty hard to read due to the sheer amount of boilerplate involved, especially across multiple serializer/deserializer implementations. I filed RUST-1098 to cover the work for writing up a doc that explains it all at a high level.

Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with one typo fix

RawBson::Document(d) => Ok(d),

// For non-BSON formats, RawDocument gets serialized as bytes, so we need to deserialize
// from them here too. For BSON, the deserialzier will return an error if it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// from them here too. For BSON, the deserialzier will return an error if it
// from them here too. For BSON, the deserializer will return an error if it

@patrickfreed patrickfreed merged commit e47b276 into mongodb:master Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants