Throw an exception when the data buffer size for allocation is too large #2155

kounelisagis · 2025-02-07T16:33:19Z

Let's throw an exception for the cases that the total domain size overflows a uint64, and the estimator returns the maximum amount of data that can be read at once (UINT64_MAX).

johnkerl · 2025-02-07T16:36:05Z

tiledb/core.cc

@@ -98,6 +98,10 @@ struct BufferInfo {
        try {
            dtype = tiledb_dtype(data_type, cell_val_num);
            elem_nbytes = tiledb_datatype_size(type);
+            if (data_nbytes >
+                static_cast<uint64_t>(std::numeric_limits<intptr_t>::max())) {


This is good and it is a step forward. We should merge this PR with at least this change.

However, boxes will OOM long before request sizes approach 2^63. I don't want us to hard-code assumptions about RAM sizes (AWS offers instances with a terabyte of RAM which is 2^40!). Just food for thought, maybe we can make a cap smaller than int64 max. Just a thought.

Ideally if numpy has a defined maximum size for its arrays (like .NET has) we would check that, but I couldn't find it after a search.

I was thinking and discussed with Agis that we could make this max something safe enough, like e.g. 2GB?, and make it configurable through a py. config, similar to what we are able to do today with incomplete buffer sizes through cfg["py.max_buffer_bytes"].

Then we could improve also the user experience, by suggesting to him when we throw to increase that config value or use A.query(incomplete=True) to read his data.

A 2GB default limit sounds fine, and can be overridden with an optional parameter like A.query(ignore_buffer_size_check=True). Customizing it with a config option sounds like an overkill.

Users should really use incomplete queries if possible, and way before the 2GB threshold. Allocating contiguous buffers of this size has disadvantages like memory fragmentation.

We also have the (undocumented?) py.alloc_max_bytes, as discussed with @ihnorton. I will take a look at it.

ypatia · 2025-02-20T11:54:18Z

tiledb/tests/test_fixes.py

+
+        with tiledb.open(uri, mode="r") as A:
+            with pytest.raises(OverflowError) as exc:
+                A[:]


That's probably another issue to look at, but why would we even allocate a user buffer for retrieving results from an empty array?

kounelisagis added 2 commits February 7, 2025 18:28

Add throw

c51944c

Add test

94ef25b

kounelisagis requested review from ihnorton and teo-tsirpanis February 7, 2025 16:33

johnkerl reviewed Feb 7, 2025

View reviewed changes

johnkerl approved these changes Feb 7, 2025

View reviewed changes

ypatia reviewed Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throw an exception when the data buffer size for allocation is too large #2155

Throw an exception when the data buffer size for allocation is too large #2155

kounelisagis commented Feb 7, 2025

johnkerl Feb 7, 2025

teo-tsirpanis Feb 7, 2025

ypatia Feb 20, 2025

teo-tsirpanis Feb 20, 2025

kounelisagis Feb 20, 2025

ypatia Feb 20, 2025

Throw an exception when the data buffer size for allocation is too large #2155

Are you sure you want to change the base?

Throw an exception when the data buffer size for allocation is too large #2155

Conversation

kounelisagis commented Feb 7, 2025

johnkerl Feb 7, 2025

Choose a reason for hiding this comment

teo-tsirpanis Feb 7, 2025

Choose a reason for hiding this comment

ypatia Feb 20, 2025

Choose a reason for hiding this comment

teo-tsirpanis Feb 20, 2025

Choose a reason for hiding this comment

kounelisagis Feb 20, 2025

Choose a reason for hiding this comment

ypatia Feb 20, 2025

Choose a reason for hiding this comment