-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Set the chunk size in the bulk helper based in bytes #199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is definitely something I'd consider. The only reason why I didn't include this from the start is that I was trying to find a better way to deal with the serialization - right now it requires accessing If you want to take a stab at it go ahead, otherwise I am happy to implement it myself. |
In addition, helpers.streaming_bulk() blindly tries to post the entire chunk's worth of data regardless of the maximum size ES will accept. (I am occasionally getting org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException: HTTP content length exceeded 104857600 bytes.). By chunking on bytes, streaming_bulk() would never issue a call to client.bulk() that would raise this exception. |
@honzakral I would like to give it a try. chunk_size='100mb' / chunk_size=100 ? |
I think |
Hi,
Is it possible to have an option to set the chunk size in bytes for the bulk indexing helper?
The reason is that we want to bulk index documents, but our documents are not all necessarily the same size. Therefore if we have a mix of document lengths, it makes more sense to bulk index in chunks of bytes.
E.g. I might set the chunk size to 1MB (this could be 100 documents of 10K, or 5 documents of 200k).
This would prevent us from encountering memory issues with Elasticsearch.
Is this an enhancement you would consider?
Thanks,
Prash
The text was updated successfully, but these errors were encountered: