Skip to content

marshmallow_dataclasses.class_schema() fails in multithreaded app due to thread-safety issue #282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vitaly-krugl opened this issue May 4, 2025 · 4 comments

Comments

@vitaly-krugl
Copy link

vitaly-krugl commented May 4, 2025

I came across a marshmallow-dataclass-related race condition in my new code that was resulting in the marshmallow exception when using class_schema() while marshalling and unmarshalling dataclasses from threads of a multi-threaded application:

marshmallow.exceptions.RegistryError: Class with name '{{a marshmallow data class that I defined}}' was not found. You may need to import the class

I root caused the exception to the following. My code was invoking marshmallow_dataclass.class_schema() from threads:

request = class_schema({{any one of a number of my dataclasses}})().load(…)  

It turns out that class_schema() may invoke marshmallow.class_registry.register(), which is not thread-safe. It performs operations on a global dict, and marshmallow.class_registry.register() is in excess of a single GIL-protected byte code.

I have not yet come up with a simple test case, but it's easy to see via simple code inspection that marshmallow.class_registry.register() (and class_schema() as a result) is not thread-safe: https://github.com/marshmallow-code/marshmallow/blob/a651fbcb85f4a2fc1bbc9df168e66476854fe178/src/marshmallow/class_registry.py#L26-L69.

I was able to work around the race condition by forcing generation of the schema at import time by calling class_schema() on all of my dataclasses at the top level of the module. However, in a large code base with multiple contributors, it's nearly impossible to ensure that a workaround like that won't be missed and show up in production, since some of the scenarios might not trigger the race condition in the test environment, but would show up in production later.

On a related note:

  1. Why is generation of the Schema deferred until .Schema() or class_schema()() is invoked? Presumably, if someone is using marshmallow_dataclass.dataclass, they will be marshalling and/or unmarshalling with that class, so pre-generating the schema wouldn't be a waste. And if they are not using it for marshalling/unmarshalling, then they should just be using the builtin dataclasses.dataclass, instead.
  2. If there is not going to be a fix for this race condition, what's a full proof workaround (that's not error-prone)?
@dairiki
Copy link
Collaborator

dairiki commented May 5, 2025

  1. Why is generation of the Schema deferred until .Schema() or class_schema()() is invoked?

Because forward references are allowed in type annotations. E.g.

from __future__ import annotations

from dataclasses import dataclass

@dataclass
class A:
    b: B

# Cannot construct schema for A yet, because type B is not yet defined.

@dataclass
class B:
    x: int

I haven't looked at this in a while, but I vaguely recall that things can get even more twisted when types are defined in function-local scopes.

@dairiki
Copy link
Collaborator

dairiki commented May 5, 2025

It turns out that class_schema() may invoke marshmallow.class_registry.register(), which is not thread-safe.

Again, I haven't looked at this for a bit, so I may be mistaken, but my gut response is:

Constructing Marshmallow schemas is not thread-safe. The metaclass for marshmallow.Schema invokes class_register.register, therefore, defining subclasses of marshmallow.Schema is not thread-safe. (That may be what you are already saying.)

That being the case, my first reaction is that we (marshmallow_dataclass) should not be expected to go to any pains to make class_schema() (our construction process for creating Marshmallow schema classes) any more thread-safe than Marshmallow is. If creating schema classes in Marshmallow is not thread-safe, it should not be surprising that using marshmallow_dataclass to construct Marshmallow schema classes is similarly un-thread-safe. (Creating instances of Marshmallow schema classes is thread-safe. At least, it should be.)

@vitaly-krugl
Copy link
Author

vitaly-krugl commented May 5, 2025 via email

@sinback
Copy link

sinback commented May 5, 2025

@dairiki I'm only becoming familiar with this issue now, but I think the Marshmallow maintainers recently (~5 months ago) might be making progress towards fixing this in their experimental module marshmallow-code/marshmallow#2707. I think first release it became available in is 4.0.0. But... I'm not yet 100% sure if the Context object they're now allowing to be a non-dict could help the class registry though since I can't tell for sure what Context is other than a project-internal context object.

Many issues involving non-threadsafety of nested Schema definitions have been closed after they merged this PR, so I'm hopeful the context object is used by the class_registry and that would mean this change would help the class_registry become thread-safe somehow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants