Skip to content

Fix metadata deserialization in async mode for PGVector #125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

shamspias
Copy link

Problem

When using asynchronous methods with PGVector (async_mode=True), the metadata field retrieved from the database may be of type Fragment (from asyncpg) or other non-dict types. This causes a ValidationError when the Document class expects metadata to be a dictionary.

Solution

This pull request modifies the _results_to_docs_and_scores method to ensure that metadata is correctly converted into a dictionary before creating Document instances. The method now handles different possible types of metadata and attempts to deserialize it into a dict.

Changes

  • Updated _results_to_docs_and_scores method in PGVector class to handle metadata deserialization for different types (e.g., dict, str, Fragment).

Testing

  • Tested with async_mode=True and confirmed that the metadata field is correctly deserialized and no longer causes validation errors.
  • Ensured that the change does not affect the behavior when async_mode=False.

Related Issues

@jaimeescano
Copy link

Facing this very same issue. Wondering when this commit could be merged.
Big credits to @shamspias for providing the fix.

Regards

@eyurtsev eyurtsev self-assigned this Oct 15, 2024
@shamspias shamspias requested a review from eyurtsev October 16, 2024 07:09
@shamspias
Copy link
Author

@eyurtsev, please let me know if there is anything else I need to do. If not, please merge it.

@simadimonyan
Copy link

Did you fix the issue? I have related: langchain-ai/langchain#28029

@shamspias
Copy link
Author

Did you fix the issue? I have related: langchain-ai/langchain#28029

yes I did fixed the issue you can use the branch if you like to

@samsiuatpurple
Copy link

Good work! It has fixed an issue in my langgraph app. Can a maintainer merge this as it's a blocker for us. Thanks!

@lucelsbyleverageAI
Copy link

Hi @shamspias @samsiuatpurple - I just ran into the same issue and used your PR code. It prevented the error but now it's just returning empty metadata. Assume you aren't having the same?

@shamspias
Copy link
Author

Hi @shamspias @samsiuatpurple - I just ran into the same issue and used your PR code. It prevented the error but now it's just returning empty metadata. Assume you aren't having the same?

It works fine, I run several project with this PR.

@fritzebner
Copy link

Can someone tell us when this change will be in a release? When is the next release for langchain-postgres?

@shamspias
Copy link
Author

Can someone tell us when this change will be in a release? When is the next release for langchain-postgres?

same questions

Copy link
Contributor

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shamspias, are you interested in iterating on this? It still needs some work:

  • The code breaks out the gate (json is not imported)
  • Most importantly, there are no tests.

The most useful thing would be a unit test or reproducible example demonstrating how to generate non-dict metadata.

elif isinstance(metadata, str):
metadata = json.loads(metadata)
elif hasattr(metadata, 'buf'):
# For Fragment types (e.g., from asyncpg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only psycopg3 is supported

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eyurtsev, thanks for the review! I understand that only psycopg3 is officially supported. However, I’ve received reports of issues in async mode that suggest some users might be encountering non‑dict metadata (perhaps inadvertently using asyncpg or similar drivers).

This patch adds defensive logic to convert metadata that isn’t already a dict (for example, when it’s a JSON string, a Fragment‑like object with a buf attribute, or an object with a decode() method) into a proper dict. This conversion will only trigger in cases where the metadata isn’t already a dict—so for psycopg3 users nothing changes.

I’ve also added unit tests to simulate these scenarios and ensure the conversion works as expected. Please let me know if you’d like any adjustments or if you think we should further restrict this behavior given our psycopg3-only support.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamspias feel free to @ me if I don't respond quickly enough.

Are you able to create a minimal reproduction against the actual vectorstore? If so, you can send me the code snippet and I'm happy to update the tests myself.

This patch adds defensive logic to convert metadata that isn’t already a dict (for example, when it’s a JSON string, a Fragment‑like object with a buf attribute, or an object with a decode() method) into a proper dict. This conversion will only trigger in cases where the metadata isn’t already a dict—so for psycopg3 users nothing changes.

Can you confirm that this is specifically from asyncpg where you're seeing the failures?

We definitely don't want to mock the results from asyncpg. If we want to support asynpcg, the way to do it is to run the full suite of tests with that driver.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eyurtsev, I’ve put together a minimal reproduction that runs against a real Postgres instance (with pgvector) using asyncpg. The test confirms that the defensive logic for non-dict metadata triggers correctly without mocking. Let me know if you’d like the code snippet or any adjustments!

@@ -0,0 +1,99 @@
import sqlalchemy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to update the unit tests to reproduce behavior against the actual driver / database rather than rely on mocks. The code in question is designed specifically to work with a database, so requires more systematic testing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eyurtsev! I’ve replaced the mocked unit tests with an integration test that runs against a live Postgres instance (with pgvector). It ensures the metadata deserialization logic is tested with the actual driver and database, just as you requested. Let me know if there’s anything else you’d like changed!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eyurtsev After testing with new version of psycopg3, I can confirm that the issue has been resolved. I no longer encounter any errors during metadata deserialization, and the problem is no longer reproducible. Thanks for your support! However, it might be beneficial to update the code to include an extra condition as a safeguard for the future, ensuring similar issues are avoided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants