-
Notifications
You must be signed in to change notification settings - Fork 270
Concurrent writes failures #1084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As pointed out in the Slack channel by @sungwy this is caused by the following two issues: |
^ Second link should be this one #819 |
FYI, according to the docs, "SQLite is not built for concurrency, you should use this catalog for exploratory or development purposes." |
I know. This issue exists with both PostgreSQL and SQLite. SQLite just makes the reproduction a bit simpler. You're right pointing it out though. Other users might want to use SQLite in production otherwise. |
Here's some code that worked for me for me def append_to_table_with_retry(pa_df: pa.Table, table_name: str, catalog: Catalog) -> None:
"""Appends a pyarrow dataframe to the table in the catalog using tenacity exponential backoff."""
@retry(
wait=wait_exponential(multiplier=1, min=4, max=32),
stop=stop_after_attempt(20),
reraise=True
)
def append_with_retry():
table = catalog.load_table(table_name) # <---- If a process appends between this line ...
table.append(pa_df) # <----- and this line, then Tenacity will retry.
append_with_retry() |
This doesn't work - at least not efficiently - if you're writing rather large files with a high concurrency. e.g. Many threads uploading a 1 GB dataframe each can end up uploading every dataframe many times with this approach as it retries the entire operation. This is just a huge waste of bandwidth and performance and performs worse than implementing a GIL (Global Iceberg Lock). I ended up migrating our data to ClickHouse. It's an entirely different beast but provides way better performance for our use case anyways. I'm happy to revisit pyiceberg once commit retries are implemented. |
Apache Iceberg version
0.7.1
Please describe the bug 🐞
Summary
I'm currently trying to migrate a couple of dataframes with a custom hive-like storage scheme to Iceberg. After a lot of fiddling I managed to load the dataframes from an Azure storage, create the table in the Iceberg catalog (currently using sqlite + local fs) and append fragments from the Parquet dataset. As soon as adding a thread pool I always run into concurrency issues.
Errors
I get either of the following two error messages:
or
Sources
I use
Dataset.get_fragments
and insert the data into an iceberg table with identical partitioning.I can work around this error by using a GIL (global iceberg lock, pun intended.) which is just a
threading.Lock()
that ensures everyload_table()
+table.append
happens atomically. But that kills almost all performance gains there could be made. Also I plan on using this in some Celery runners . So using athreading.Lock()
is no option in the future anyways.azure_import.py
pyproject.toml
The text was updated successfully, but these errors were encountered: