-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Reuse buffer to reads gRPC messages from the underlying reader #3220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alextomaili could you please run some benchmarks measuring the performance improvement? I'm particularly interested in the bytes allocated, QPS and latency stats. The reason I ask is, while the change may seem innocuous, I found a similar change to produce a lower memory footprint at the cost of a QPS/latency regression for a lot of common use-cases (e.g. small messages < 1 KiB in size) in my work to reuse Marshal byte buffers: #3167 - this is because of the overhead introduced by But a threshold similar to #3167 might not be straightforward because we don't know how big the message is going to be? |
I suggest measuring these stats for messages of size 2^n for n in [0, 15]. See |
You are right about overhead of the But if go runtime doesn't have high gc pressure, if memory footprint per second is not so large - overhead introduced by maxConcurrentCalls_6-reqSize_1024B-respSize_1024B - degradation
maxConcurrentCalls_6-reqSize_32768B-respSize_32768B - degradation
but for high GC pressure (in this test - high concurrency) i've got progress for large request/response maxConcurrentCalls_64-reqSize_1024B-respSize_1024B - still degradation
maxConcurrentCalls_64-reqSize_8192B-respSize_8192B - progress
maxConcurrentCalls_64-reqSize_32768B-respSize_32768B - progress
So balance between pool and memory allocation is depends of GC pressure (how many bytes application allocates and frees per second). In this test we can tune GC pressure not only by size of messages but also by setup maxConcurrentCalls and network parameters which will simulate real world (for example latency). Latency will keep more data 'on fly' between sender and receiver and GC pressure will be more high Also i've tried to run test with workload whihc is more close to my real project and i've got progress.
I think it is very difficult to know GC pressure of any real application and therefore calculate some fine threshold to use pool instead of memory allocation. But you are right, some limit can be used to allocate small buffers without pool. I think it is better to make some setting and allow user to define this limit. 1024 looks like good default value for this threshold. In general user should be able
|
@alextomaili thanks for the detailed benchmarks! These look very promising indeed. Do you have any idea why the QPS benchmark is very erratic? It goes from 8.76% to -9.17% to 10.84% between 2^10 and 2^12. Similar issues around 2^13, 2^14 and 2^15. Could this just be measurement error? Does running the benchmark for longer than 10s make the variance go away? Also, do you have any benchmarks with latency=0ms and max concurrency=1 for 2^{0..15} with reqSize=respSize? Using reqSize=512B in the tests skews it. For example, if there is a regression with reqSize=1B and respSize=1B, but an improvement with reqSize=2048B, we'll know that the threshold is somewhere in between. The 1024B threshold for the send/Marshal case might not necessarily apply to the recv case that you're dealing with. We'll want to find the lowest threshold possible :) |
I'll try to provide more test in the near future. As far as i understand in "Go world" overhead of sync.Pool is acceptable for most cases. Thats why Go doesn't have some "lock free" pools in standard library. What do you think about this ? |
@adtac how about use "striped locks" pattern to reduce overhead of sync.Pool
This way can increase probability to take pooled object by fast path, also it can reduce contention on mutex inside sync.Pool if fast path will be failed. Of course this way also increases probability to call memory allocation when selected pool is empty. I've try used 32 instance of sync.Pool and got positive results for some cases for example maxConcurrentCalls_6-reqSize_8192B-respSize_8192B - one sync.Pool
maxConcurrentCalls_6-reqSize_8192B-respSize_8192B - 32 instance of sync.Pool
significant profit here but for some cases i've got small degradation maxConcurrentCalls_6-reqSize_1024B-respSize_1024B - one sync.Pool
maxConcurrentCalls_6-reqSize_1024B-respSize_1024B - 32 sync.Pool
small degradation here I think this is not bad idea if we can't use sync.Pool as attribute of some high level abstraction such as "connection", "session" etc by easy way. (each "connection" or "session" can use own pool and we will have less contentions on pool rather than we have one global pool) If we want to use pool deep inside low level abstraction such as "parser" or "codec" we can use ring of several pools do reduce overhead. What do you think about this ? |
@alextomaili did you considered to use multiple sync.Pools for each desired size? |
Solution with 2^n sized pools will work same as single pool if most part of requests or responses will have same size (or they sizes will be close each other and can be represented by same 2^n). Same pool will be selected for most part of RPC calls and we will have problem with contention on this pool. So performance test will be failed. In real world application may have workload with different or with same sizes of request/response. It depends of particular application. I’m not sure, but i think best way to have some "allocator" abstraction. |
Sorry for the late reply. More benchmark numbers would be good. One idea is to run with Varying Payload Sizes, and come up with a threshold to enable buffer reuse. Ideally we should have a good default value, and also make it configurable (Similar to this TODO). Also, the sibling PR to reuse buffer on the sending side was reverted a while ago (#3338), because of a race detected. And we haven't got time to roll it forward yet. The team has been busy with other priorities, and unfortunately will probably still not have much time to review this (sorry again). For now, don't feel urged to take actions. We will send an update here once some of us have time. Thanks! |
Is this pooling something that could be opt-in via |
Closing this PR as it's unlikely to get reviews soon (#3220 (comment)). I added a reference to this PR in the performance tracking issue, so we can get back to this in future performance work (#2816 (comment)) |
A way to reuse memory on read path. Partial resolve of #1455.
sync.Pool used to reuse []byte which used by recvMsg method of parser to read data from i/o layer. I've tried to make minimal changes in code to provide this.