Skip to content

Check write snapshot compatibility #1678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Fokko opened this issue Feb 18, 2025 · 6 comments
Open

Check write snapshot compatibility #1678

Fokko opened this issue Feb 18, 2025 · 6 comments

Comments

@Fokko
Copy link
Contributor

Fokko commented Feb 18, 2025

Feature Request / Improvement

Java and Python have a different approach here. I don't have all the historical context, but prior to Iceberg V2 tables, there was no such thing as operations:

Image

I think this is a good thing to validate against.

This should happen in the _commit method of the _SnapshotProducer. Similar to Java:

There's also a small section on conflict resolution.

- When doing an `Append`: Adding new data
  - All okay: `{Append,Replace,Overwrite,Delete}`, don't affect the operation, and we can just append
- When doing a `Replace`:  Replacing existing data (eg. compaction)
  - Ok: Append
  - Not ok: Replace, Overwrite, Delete. We should fail, and later we can see if there is any overlap (eg compare if they touch the same partitions).
- When doing a `Overwrite`: Adding and deleting data
  - Not ok: Append, Replace, Overwrite, Delete. We should fail, and later we can see if there is any overlap (eg compare if they touch the same partitions).
- When doing a `Delete`
  - Not ok: Append, Replace, Overwrite, Delete. We should fail, and later we can see if there is any overlap (eg compare if they touch the same partitions/predicate). We should also take into account the difference between MoR and CoW.

Let's only do the very simple cases at first, so we can add ones one by one to keep the PR within reasonable size.

Once we have this in place, we can also do automatic retries: #269

@Fokko
Copy link
Contributor Author

Fokko commented Feb 19, 2025

Let me know if anyone is interested in contributing this, otherwise I'll take a stab at it myself 🤗

@kaushiksrini
Copy link
Contributor

hey @Fokko, i'd be interested in contributing this if you haven't started already!

@Fokko
Copy link
Contributor Author

Fokko commented Feb 21, 2025

@kaushiksrini I haven't feel free to pick this up 👍

@Fokko
Copy link
Contributor Author

Fokko commented Mar 3, 2025

@kaushiksrini Gentle ping, updates on this? I think a lot of folks would benefit from having this. If you don't have the time, I'm also happy to take a stab at it

@kaushiksrini
Copy link
Contributor

Hey @Fokko, actively working on this - should have a PR out soon. Had a few questions:

  1. From the _SnapshotProducer class, what function should I call to refresh the table and get the latest snapshot available?
  2. To fetch snapshots between two IDs, I see there is a utility function in Iceberg that returns a list. I couldn't find it in the Python client - could you point me to where this function would exist?

Thanks!

@Fokko
Copy link
Contributor Author

Fokko commented Mar 20, 2025

Thanks @kaushiksrini for picking this up. My apologies, I missed this comment:

From the _SnapshotProducer class, what function should I call to refresh the table and get the latest snapshot available?

Looking at the PR, you already found this :)

To fetch snapshots between two IDs, I see there is a utility function in Iceberg that returns a list. I couldn't find it in the Python client - could you point me to where this function would exist?

I don't think we have this one, can add it to snapshots.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants