You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: mkdocs/docs/api.md
+17-4
Original file line number
Diff line number
Diff line change
@@ -331,12 +331,25 @@ df = pa.Table.from_pylist(
331
331
table.append(df)
332
332
```
333
333
334
-
<!-- prettier-ignore-start -->
334
+
You can delete some of the data from the table by calling `tbl.delete()` with a desired `delete_filter`.
335
+
336
+
```python
337
+
tbl.delete(delete_filter="city == 'Paris'")
338
+
```
335
339
336
-
!!! example "Under development"
337
-
Writing using PyIceberg is still under development. Support for [partial overwrites](https://github.com/apache/iceberg-python/issues/268) and writing to [partitioned tables](https://github.com/apache/iceberg-python/issues/208) is planned and being worked on.
340
+
In the above example, any records where the city field value equals to `Paris` will be deleted.
|`commit.manifest.target-size-bytes`| Size in bytes | 8388608 (8MB) | Target size when merging manifest files |
69
+
|`commit.manifest.min-count-to-merge`| Number of manifests | 100 | Target size when merging manifest files |
70
+
|`commit.manifest-merge.enabled`| Boolean | False | Controls whether to automatically merge manifests on writes |
71
+
72
+
<!-- prettier-ignore-start -->
73
+
74
+
!!! note "Fast append"
75
+
Unlike Java implementation, PyIceberg default to the [fast append](api.md#write-support) and thus `commit.manifest-merge.enabled` is set to `False` by default.
76
+
77
+
<!-- prettier-ignore-end -->
78
+
64
79
# FileIO
65
80
66
81
Iceberg works with the concept of a FileIO which is a pluggable module for reading, writing, and deleting files. By default, PyIceberg will try to initialize the FileIO that's suitable for the scheme (`s3://`, `gs://`, etc.) and will use the first one that's installed.
@@ -129,19 +144,19 @@ For the FileIO there are several configuration options available:
| gcs.project-id | my-gcp-project | Configure Google Cloud Project for GCS FileIO. |
135
-
| gcs.oauth2.token | ya29.dr.AfM... |Configure method authentication to GCS for FileIO. Can be the following, 'google_default', 'cache', 'anon', 'browser', 'cloud'. If not specified your credentials will be resolved in the following order: gcloud CLI default, gcsfs cached token, google compute metadata service, anonymous.|
136
-
| gcs.oauth2.token-expires-at | 1690971805918 | Configure expiration for credential generated with an access token. Milliseconds since epoch |
137
-
| gcs.access | read_only | Configure client to have specific access. Must be one of 'read_only', 'read_write', or 'full_control' |
138
-
| gcs.consistency | md5 | Configure the check method when writing files. Must be one of 'none', 'size', or 'md5' |
139
-
| gcs.cache-timeout | 60 | Configure the cache expiration time in seconds for object metadata cache |
140
-
| gcs.requester-pays | False | Configure whether to use requester-pays requests |
141
-
| gcs.session-kwargs | {} | Configure a dict of parameters to pass on to aiohttp.ClientSession; can contain, for example, proxy settings. |
142
-
| gcs.endpoint |http://0.0.0.0:4443| Configure an alternative endpoint for the GCS FileIO to access (format protocol://host:port) If not given, defaults to the value of environment variable "STORAGE_EMULATOR_HOST"; if that is not set either, will use the standard Google endpoint. |
143
-
| gcs.default-location | US | Configure the default location where buckets are created, like 'US' or 'EUROPE-WEST3'. |
144
-
| gcs.version-aware | False | Configure whether to support object versioning on the GCS bucket. |
| gcs.project-id | my-gcp-project | Configure Google Cloud Project for GCS FileIO. |
150
+
| gcs.oauth2.token | ya29.dr.AfM... |String representation of the access token used for temporary access. |
151
+
| gcs.oauth2.token-expires-at | 1690971805918 | Configure expiration for credential generated with an access token. Milliseconds since epoch |
152
+
| gcs.access | read_only | Configure client to have specific access. Must be one of 'read_only', 'read_write', or 'full_control' |
153
+
| gcs.consistency | md5 | Configure the check method when writing files. Must be one of 'none', 'size', or 'md5' |
154
+
| gcs.cache-timeout | 60 | Configure the cache expiration time in seconds for object metadata cache |
155
+
| gcs.requester-pays | False | Configure whether to use requester-pays requests |
156
+
| gcs.session-kwargs | {} | Configure a dict of parameters to pass on to aiohttp.ClientSession; can contain, for example, proxy settings. |
157
+
| gcs.endpoint |http://0.0.0.0:4443| Configure an alternative endpoint for the GCS FileIO to access (format protocol://host:port) If not given, defaults to the value of environment variable "STORAGE_EMULATOR_HOST"; if that is not set either, will use the standard Google endpoint. |
158
+
| gcs.default-location | US | Configure the default location where buckets are created, like 'US' or 'EUROPE-WEST3'. |
159
+
| gcs.version-aware | False | Configure whether to support object versioning on the GCS bucket. |
| glue.id | 111111111111 | Configure the 12-digit ID of the Glue Catalog |
296
+
| glue.skip-archive | true | Configure whether to skip the archival of older table versions. Default to true |
297
+
| glue.endpoint | https://glue.us-east-1.amazonaws.com | Configure an alternative endpoint of the Glue service for GlueCatalog to access |
298
+
299
+
<!-- markdown-link-check-enable-->
300
+
276
301
## DynamoDB Catalog
277
302
278
303
If you want to use AWS DynamoDB as the catalog, you can use the last two ways to configure the pyiceberg and refer
@@ -305,4 +330,8 @@ PyIceberg uses multiple threads to parallelize operations. The number of workers
305
330
306
331
# Backward Compatibility
307
332
308
-
Previous versions of Java (`<1.4.0`) implementations incorrectly assume the optional attribute `current-snapshot-id` to be a required attribute in TableMetadata. This means that if `current-snapshot-id` is missing in the metadata file (e.g. on table creation), the application will throw an exception without being able to load the table. This assumption has been corrected in more recent Iceberg versions. However, it is possible to force PyIceberg to create a table with a metadata file that will be compatible with previous versions. This can be configured by setting the `legacy-current-snapshot-id` entry as "True" in the configuration file, or by setting the `PYICEBERG_LEGACY_CURRENT_SNAPSHOT_ID` environment variable. Refer to the [PR discussion](https://github.com/apache/iceberg-python/pull/473) for more details on the issue
333
+
Previous versions of Java (`<1.4.0`) implementations incorrectly assume the optional attribute `current-snapshot-id` to be a required attribute in TableMetadata. This means that if `current-snapshot-id` is missing in the metadata file (e.g. on table creation), the application will throw an exception without being able to load the table. This assumption has been corrected in more recent Iceberg versions. However, it is possible to force PyIceberg to create a table with a metadata file that will be compatible with previous versions. This can be configured by setting the `legacy-current-snapshot-id` property as "True" in the configuration file, or by setting the `PYICEBERG_LEGACY_CURRENT_SNAPSHOT_ID` environment variable. Refer to the [PR discussion](https://github.com/apache/iceberg-python/pull/473) for more details on the issue
334
+
335
+
# Nanoseconds Support
336
+
337
+
PyIceberg currently only supports upto microsecond precision in its TimestampType. PyArrow timestamp types in 's' and 'ms' will be upcast automatically to 'us' precision timestamps on write. Timestamps in 'ns' precision can also be downcast automatically on write if desired. This can be configured by setting the `downcast-ns-timestamp-to-us-on-write` property as "True" in the configuration file, or by setting the `PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE` environment variable. Refer to the [nanoseconds timestamp proposal document](https://docs.google.com/document/d/1bE1DcEGNzZAMiVJSZ0X1wElKLNkT9kRkk0hDlfkXzvU/edit#heading=h.ibflcctc9i1d) for more details on the long term roadmap for nanoseconds support
Copy file name to clipboardExpand all lines: mkdocs/docs/how-to-release.md
+15
Original file line number
Diff line number
Diff line change
@@ -23,6 +23,21 @@ The guide to release PyIceberg.
23
23
24
24
The first step is to publish a release candidate (RC) and publish it to the public for testing and validation. Once the vote has passed on the RC, the RC turns into the new release.
25
25
26
+
## Preparing for a release
27
+
28
+
Before running the release candidate, we want to remove any APIs that were marked for removal under the @deprecated tag for this release.
29
+
30
+
For example, the API with the following deprecation tag should be removed when preparing for the 0.2.0 release.
31
+
32
+
```python
33
+
34
+
@deprecated(
35
+
deprecated_in="0.1.0",
36
+
removed_in="0.2.0",
37
+
help_message="Please use load_something_else() instead",
38
+
)
39
+
```
40
+
26
41
## Running a release candidate
27
42
28
43
Make sure that the version is correct in `pyproject.toml` and `pyiceberg/__init__.py`. Correct means that it reflects the version that you want to release.
0 commit comments