-
Notifications
You must be signed in to change notification settings - Fork 3k
[azure-storage-blob] Blob Upload never finishes if a generator is passed #17418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for reaching out @mth-cbc! We'll investigate ASAP. |
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage. Issue Details
Describe the bug def data_generator():
# do something
yield data to the function The upload, however, never finishes even after the generator is exhausted. To Reproduce
Here is a minimal script with which I could tirgger the behaviour every time: from azure.storage.blob import BlobClient
# Size of raw_string combined with "range" in `test_generator` needs
# to be bigger than the chunk size (otherwise everything is put in one
# go, instead of a "stream" case)
raw_string = "a" * 1024 * 1024 * 3
def test_generator():
for it in range(0, 2):
print(it)
yield raw_string
print("Exhausted generator")
test: BlobClient = BlobClient.from_connection_string(
conn_str="<conn_str>",
container_name="ttp",
blob_name="generator_test.txt",
)
test.upload_blob(data=test_generator(), overwrite=True)
print("Finished") The corresponding output is this:
Even after waiting several minutes the program does not finish. Expected behavior Additional context import six
# Copy of the IterStreamer Class from
# https://github.com/Azure/azure-sdk-for-python/blob/af066bb6990279e5868407918d9cd82ad159e7e7/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/uploads.py#L503
class IterStreamer(object):
"""
File-like streaming iterator.
"""
def __init__(self, generator, encoding="UTF-8"):
self.generator = generator
self.iterator = iter(generator)
self.leftover = b""
self.encoding = encoding
def __len__(self):
return self.generator.__len__()
def __iter__(self):
return self.iterator
def seekable(self):
return False
def __next__(self):
return next(self.iterator)
next = __next__ # Python 2 compatibility.
def tell(self, *args, **kwargs):
raise NotImplementedError("Data generator does not support tell.")
def seek(self, *args, **kwargs):
raise NotImplementedError("Data generator is unseekable.")
def read(self, size):
data = self.leftover
# self.leftover = b"" # <-- My "bugfix"
count = len(self.leftover)
try:
while count < size:
chunk = self.__next__()
if isinstance(chunk, six.text_type):
chunk = chunk.encode(self.encoding)
data += chunk
count += len(chunk)
except StopIteration:
pass
if count > size:
self.leftover = data[size:]
return data[:size]
test_data = ["some mock data, content does not matter"] * 100
test_stream = IterStreamer(test_data)
read_size = 1024
loop_counter = 0
# Logic closely resampling the code in the function `get_chunk_streams`
# in the class `_ChunkUploader`:
# https://github.com/Azure/azure-sdk-for-python/blob/af066bb6990279e5868407918d9cd82ad159e7e7/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/uploads.py#L159
while True:
data = b""
temp = test_stream.read(read_size)
if not isinstance(temp, six.binary_type):
raise TypeError("Blob data should be of type bytes.")
data += temp or b""
# Debug Print
print(data)
# We have read an empty string and so are at the end
# of the buffer or we have read a full chunk.
if temp == b"": # or len(data) == self.chunk_size:
break
# Stop an unending loop
loop_counter += 1
if loop_counter > 30:
print("Unending Loop")
break
The output of the "unfixed" version:
If you comment in the "bugfix line" the output changes to:
|
Hi @mth-cbc great catch!! and great work on your workaround there :) Sorry for the inconvenience |
Describe the bug
I want to pass a generator of the form:
to the function
upload_blob()
to avoid the need to hold all of my (potentially huge) data in memory.The upload, however, never finishes even after the generator is exhausted.
To Reproduce
Steps to reproduce the behavior:
BlobClient
blob_client.upload_blob(data=<generator>)
Here is a minimal script with which I could tirgger the behaviour every time:
The corresponding output is this:
Even after waiting several minutes the program does not finish.
Expected behavior
The upload should succeed normally and the blob should contain all the data produced by the generator.
Additional context
I wasn't completly sure if the
upload_blob
function was supposed to work as I expected. So I investigated a bit further and I might have found the underlying problem. ThisIterStreamer
does not behave as I would expect. Here some script which visualises my observations:The output of the "unfixed" version:
If you comment in the "bugfix line" the output changes to:
The text was updated successfully, but these errors were encountered: