Skip to content

Set the chunk size in the bulk helper based in bytes #199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pmajmudar opened this issue Feb 17, 2015 · 4 comments
Closed

Set the chunk size in the bulk helper based in bytes #199

pmajmudar opened this issue Feb 17, 2015 · 4 comments

Comments

@pmajmudar
Copy link

Hi,

Is it possible to have an option to set the chunk size in bytes for the bulk indexing helper?

The reason is that we want to bulk index documents, but our documents are not all necessarily the same size. Therefore if we have a mix of document lengths, it makes more sense to bulk index in chunks of bytes.

E.g. I might set the chunk size to 1MB (this could be 100 documents of 10K, or 5 documents of 200k).

This would prevent us from encountering memory issues with Elasticsearch.

Is this an enhancement you would consider?

Thanks,

Prash

@honzakral
Copy link
Contributor

This is definitely something I'd consider. The only reason why I didn't include this from the start is that I was trying to find a better way to deal with the serialization - right now it requires accessing client.transport.serializer. But I guess that is ok.

If you want to take a stab at it go ahead, otherwise I am happy to implement it myself.

@friedmans
Copy link

In addition, helpers.streaming_bulk() blindly tries to post the entire chunk's worth of data regardless of the maximum size ES will accept. (I am occasionally getting org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException: HTTP content length exceeded 104857600 bytes.). By chunking on bytes, streaming_bulk() would never issue a call to client.bulk() that would raise this exception.

@kpanic
Copy link

kpanic commented Sep 10, 2015

@honzakral I would like to give it a try. chunk_size='100mb' / chunk_size=100 ?
or a different param like bulk_size=100 (in mb?) -- your thoughts?

@honzakral
Copy link
Contributor

I think max_chunk_bytes would be a good name. Then the chunk should be at most that size and contain at most bulk_size documents

rciorba added a commit to rciorba/elasticsearch-py that referenced this issue Mar 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants