Skip to content

Reuse buffer to reads gRPC messages from the underlying reader #3220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

alextomaili
Copy link

@alextomaili alextomaili commented Dec 2, 2019

A way to reuse memory on read path. Partial resolve of #1455.
sync.Pool used to reuse []byte which used by recvMsg method of parser to read data from i/o layer. I've tried to make minimal changes in code to provide this.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Dec 2, 2019

CLA Check
One or more committers are not authorized under a signed CLA as indicated below. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

@alextomaili alextomaili changed the title Reuse recv buffer Reuse receive buffer Dec 2, 2019
@alextomaili alextomaili changed the title Reuse receive buffer Reuse buffer to reads gRPC messages from the underlying readerreader Dec 2, 2019
@alextomaili alextomaili changed the title Reuse buffer to reads gRPC messages from the underlying readerreader Reuse buffer to reads gRPC messages from the underlying reader Dec 2, 2019
@adtac
Copy link
Contributor

adtac commented Dec 2, 2019

@alextomaili could you please run some benchmarks measuring the performance improvement? I'm particularly interested in the bytes allocated, QPS and latency stats. The reason I ask is, while the change may seem innocuous, I found a similar change to produce a lower memory footprint at the cost of a QPS/latency regression for a lot of common use-cases (e.g. small messages < 1 KiB in size) in my work to reuse Marshal byte buffers: #3167 - this is because of the overhead introduced by sync.Pool.

But a threshold similar to #3167 might not be straightforward because we don't know how big the message is going to be?

@adtac
Copy link
Contributor

adtac commented Dec 2, 2019

I suggest measuring these stats for messages of size 2^n for n in [0, 15]. See benchmark/benchmain for more details.

@alextomaili
Copy link
Author

@alextomaili could you please run some benchmarks measuring the performance improvement? I'm particularly interested in the bytes allocated, QPS and latency stats. The reason I ask is, while the change may seem innocuous, I found a similar change to produce a lower memory footprint at the cost of a QPS/latency regression for a lot of common use-cases (e.g. small messages < 1 KiB in size) in my work to reuse Marshal byte buffers: #3167 - this is because of the overhead introduced by sync.Pool.

But a threshold similar to #3167 might not be straightforward because we don't know how big the message is going to be?

You are right about overhead of the sync.Pool
Before do this pool request I've looked at my project where i have large memory footprint from large requests and responses and several thousands gRPC requests at same time.
For those conditions memory allocation is more costly operation than pool access.
Mutator thread should help GC to perform some part of work. Also runtime.mallocgc function can call grow function of mcentral and therefore new memory page allocation etc.

But if go runtime doesn't have high gc pressure, if memory footprint per second is not so large - overhead introduced by sync.Pool is more significant than memory allocation.
Fot this case pool is reason of performance degradation

maxConcurrentCalls_6-reqSize_1024B-respSize_1024B - degradation

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_40s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_6-reqSize_1024B-respSize_1024B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       734872       658622   -10.38%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     20943.92     18722.08   -10.60%
           Allocs/op       182.38       180.19    -1.10%
             ReqT/op 150501785.60 134885785.60   -10.38%
            RespT/op 150501785.60 134885785.60   -10.38%
            50th-Lat    295.356µs    314.253µs     6.40%
            90th-Lat    433.068µs    526.701µs    21.62%
            99th-Lat    834.935µs   1.088915ms    30.42%
             Avg-Lat    325.959µs     363.72µs    11.58%

maxConcurrentCalls_6-reqSize_32768B-respSize_32768B - degradation

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_40s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_6-reqSize_32768B-respSize_32768B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps        62670        55289   -11.78%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    387553.53    324762.04   -16.20%
           Allocs/op       204.25       203.03    -0.49%
             ReqT/op 410714112.00 362341990.40   -11.78%
            RespT/op 410714112.00 362341990.40   -11.78%
            50th-Lat   1.191082ms   1.134289ms    -4.77%
            90th-Lat   1.876727ms   1.853473ms    -1.24%
            99th-Lat 103.441306ms 103.502044ms     0.06%
             Avg-Lat   3.828599ms   4.350153ms    13.62%

but for high GC pressure (in this test - high concurrency) i've got progress for large request/response

maxConcurrentCalls_64-reqSize_1024B-respSize_1024B - still degradation

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_40s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_64-reqSize_1024B-respSize_1024B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      1664220      1227733   -26.23%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     20391.44     18176.40   -10.86%
           Allocs/op       164.91       163.56    -0.61%
             ReqT/op 340832256.00 251439718.40   -26.23%
            RespT/op 340832256.00 251439718.40   -26.23%
            50th-Lat   1.355563ms   1.424376ms     5.08%
            90th-Lat   2.118802ms   2.260912ms     6.71%
            99th-Lat    3.30048ms    3.64462ms    10.43%
             Avg-Lat   1.537601ms   2.084266ms    35.55%

maxConcurrentCalls_64-reqSize_8192B-respSize_8192B - progress

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_40s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_64-reqSize_8192B-respSize_8192B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       539952       615913    14.07%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    105707.60     86726.28   -17.96%
           Allocs/op       167.26       165.33    -1.20%
             ReqT/op 884657356.80 1009111859.20    14.07%
            RespT/op 884657356.80 1009111859.20    14.07%
            50th-Lat   3.072226ms   2.797397ms    -8.95%
            90th-Lat   4.450878ms   4.080753ms    -8.32%
            99th-Lat  105.25502ms 103.807479ms    -1.38%
             Avg-Lat   4.742041ms   4.155658ms   -12.37%

maxConcurrentCalls_64-reqSize_32768B-respSize_32768B - progress

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_40s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_64-reqSize_32768B-respSize_32768B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       217214       442775   103.84%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    369395.87    287347.15   -22.21%
           Allocs/op       180.83       174.78    -3.32%
             ReqT/op 1423533670.40 2901770240.00   103.84%
            RespT/op 1423533670.40 2901770240.00   103.84%
            50th-Lat   7.512864ms   5.323876ms   -29.14%
            90th-Lat  10.593854ms   6.636968ms   -37.35%
            99th-Lat  112.86715ms  10.449085ms   -90.74%
             Avg-Lat  11.795354ms   5.781082ms   -50.99%

So balance between pool and memory allocation is depends of GC pressure (how many bytes application allocates and frees per second). In this test we can tune GC pressure not only by size of messages but also by setup maxConcurrentCalls and network parameters which will simulate real world (for example latency). Latency will keep more data 'on fly' between sender and receiver and GC pressure will be more high

Also i've tried to run test with workload whihc is more close to my real project and i've got progress.

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_1B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       319563       371869    16.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     11986.34     11444.61    -4.52%
           Allocs/op       163.58       161.87    -1.22%
             ReqT/op 130893004.80 152317542.40    16.37%
            RespT/op    255650.40    297495.20    16.37%
            50th-Lat  14.563851ms  13.454534ms    -7.62%
            90th-Lat  18.634558ms  16.081431ms   -13.70%
            99th-Lat  36.288989ms  22.068679ms   -39.19%
             Avg-Lat  16.021279ms  13.776757ms   -14.01%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_2B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       341904       377656    10.46%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     11980.38     11434.52    -4.56%
           Allocs/op       163.55       162.12    -0.61%
             ReqT/op 140043878.40 154687897.60    10.46%
            RespT/op    547046.40    604249.60    10.46%
            50th-Lat  14.416785ms  13.356723ms    -7.35%
            90th-Lat  18.095852ms   15.85057ms   -12.41%
            99th-Lat  23.307488ms  19.795119ms   -15.07%
             Avg-Lat  14.979136ms  13.560647ms    -9.47%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_4B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       342419       376522     9.96%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     11984.90     11440.06    -4.54%
           Allocs/op       163.79       162.06    -0.61%
             ReqT/op 140254822.40 154223411.20     9.96%
            RespT/op   1095740.80   1204870.40     9.96%
            50th-Lat  14.533073ms  13.378471ms    -7.94%
            90th-Lat   17.92624ms  15.723916ms   -12.29%
            99th-Lat  22.202938ms  19.503239ms   -12.16%
             Avg-Lat  14.953979ms  13.601179ms    -9.05%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_8B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       340796       376465    10.47%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     12018.16     11460.66    -4.64%
           Allocs/op       163.76       162.03    -0.61%
             ReqT/op 139590041.60 154200064.00    10.47%
            RespT/op   2181094.40   2409376.00    10.47%
            50th-Lat   14.59463ms  13.403286ms    -8.16%
            90th-Lat  18.185229ms  15.782412ms   -13.21%
            99th-Lat  22.176101ms  19.410377ms   -12.47%
             Avg-Lat  15.031515ms  13.605416ms    -9.49%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_16B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       339114       374975    10.57%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     12069.80     11493.27    -4.77%
           Allocs/op       163.74       162.00    -1.22%
             ReqT/op 138901094.40 153589760.00    10.57%
            RespT/op   4340659.20   4799680.00    10.57%
            50th-Lat  14.614218ms  13.387205ms    -8.40%
            90th-Lat    18.2816ms  16.043594ms   -12.24%
            99th-Lat  24.156692ms  19.860276ms   -17.79%
             Avg-Lat  15.107326ms  13.656498ms    -9.60%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_32B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       346584       359070     3.60%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     12141.70     11581.28    -4.61%
           Allocs/op       163.69       161.86    -1.22%
             ReqT/op 141960806.40 147075072.00     3.60%
            RespT/op   8872550.40   9192192.00     3.60%
            50th-Lat  14.177739ms  13.521948ms    -4.63%
            90th-Lat  17.701229ms  16.476963ms    -6.92%
            99th-Lat  22.281718ms  24.634706ms    10.56%
             Avg-Lat  14.775105ms  14.257454ms    -3.50%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_64B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       348289       384524    10.40%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     12308.63     11687.16    -5.05%
           Allocs/op       163.91       162.27    -0.61%
             ReqT/op 142659174.40 157501030.40    10.40%
            RespT/op  17832396.80  19687628.80    10.40%
            50th-Lat  14.256468ms  13.217909ms    -7.28%
            90th-Lat  17.536147ms  15.187689ms   -13.39%
            99th-Lat  21.222931ms  18.310036ms   -13.73%
             Avg-Lat  14.704441ms  13.319408ms    -9.42%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_128B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       330041       374420    13.45%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     12648.78     11957.80    -5.46%
           Allocs/op       163.54       162.17    -0.61%
             ReqT/op 135184793.60 153362432.00    13.45%
            RespT/op  33796198.40  38340608.00    13.45%
            50th-Lat   14.92007ms  13.407448ms   -10.14%
            90th-Lat  18.911377ms  15.958007ms   -15.62%
            99th-Lat  28.221499ms  21.039375ms   -25.45%
             Avg-Lat  15.514947ms   13.67655ms   -11.85%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_256B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       323812       369991    14.26%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     13369.58     12508.79    -6.44%
           Allocs/op       163.66       162.01    -0.61%
             ReqT/op 132633395.20 151548313.60    14.26%
            RespT/op  66316697.60  75774156.80    14.26%
            50th-Lat   15.40051ms  13.581621ms   -11.81%
            90th-Lat  19.433097ms  16.140994ms   -16.94%
            99th-Lat  24.836988ms  19.319386ms   -22.22%
             Avg-Lat  15.812784ms  13.840626ms   -12.47%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_512B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       328578       355582     8.22%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     14792.76     13632.71    -7.84%
           Allocs/op       163.90       162.19    -0.61%
             ReqT/op 134585548.80 145646387.20     8.22%
            RespT/op 134585548.80 145646387.20     8.22%
            50th-Lat  15.237658ms  13.589742ms   -10.81%
            90th-Lat  18.900834ms  16.589147ms   -12.23%
            99th-Lat  24.081753ms   26.64683ms    10.65%
             Avg-Lat  15.585984ms  14.403926ms    -7.58%


unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_1024B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       318251       346125     8.76%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     17730.37     15954.75   -10.02%
           Allocs/op       163.56       161.74    -1.22%
             ReqT/op 130355609.60 141772800.00     8.76%
            RespT/op 260711219.20 283545600.00     8.76%
            50th-Lat  15.615728ms  14.126063ms    -9.54%
            90th-Lat  19.666395ms  17.766488ms    -9.66%
            99th-Lat  24.868109ms  22.269329ms   -10.45%
             Avg-Lat  16.090872ms   14.79186ms    -8.07%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_2048B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       331330       300934    -9.17%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     23565.03     20627.22   -12.47%
           Allocs/op       163.45       161.98    -1.22%
             ReqT/op 135712768.00 123262566.40    -9.17%
            RespT/op 542851072.00 493050265.60    -9.17%
            50th-Lat  15.048113ms  15.895586ms     5.63%
            90th-Lat  18.305359ms  19.945852ms     8.96%
            99th-Lat   24.00916ms   32.78962ms    36.57%
             Avg-Lat    15.4572ms   17.01176ms    10.06%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_4096B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       281440       311958    10.84%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     36364.35     30506.04   -16.11%
           Allocs/op       163.13       161.47    -1.23%
             ReqT/op 115277824.00 127777996.80    10.84%
            RespT/op 922222592.00 1022223974.40    10.84%
            50th-Lat  17.843261ms  15.846567ms   -11.19%
            90th-Lat  21.142938ms  19.578716ms    -7.40%
            99th-Lat  25.764803ms  26.690153ms     3.59%
             Avg-Lat  18.194985ms  16.522516ms    -9.19%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_8192B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       168330       181644     7.91%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     58966.56     48349.30   -18.01%
           Allocs/op       163.08       160.74    -1.84%
             ReqT/op  68947968.00  74401382.40     7.91%
            RespT/op 1103167488.00 1190422118.40     7.91%
            50th-Lat  30.069821ms  27.667035ms    -7.99%
            90th-Lat  35.946447ms  32.715381ms    -8.99%
            99th-Lat  42.342304ms   37.44183ms   -11.57%
             Avg-Lat  30.436577ms    28.2138ms    -7.30%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_16384B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps        98405        97621    -0.80%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    103040.74     83020.55   -19.43%
           Allocs/op       166.33       162.15    -2.40%
             ReqT/op  40306688.00  39985561.60    -0.80%
            RespT/op 1289814016.00 1279537971.20    -0.80%
            50th-Lat  50.842655ms  51.712057ms     1.71%
            90th-Lat  55.893491ms  60.824624ms     8.82%
            99th-Lat  75.816268ms  73.247985ms    -3.39%
             Avg-Lat  52.158855ms  52.539495ms     0.73%

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_10s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqSize_512B-respSize_32768B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps        52926        59300    12.04%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    184143.16    142777.34   -22.46%
           Allocs/op       167.09       164.99    -1.80%
             ReqT/op  21678489.60  24289280.00    12.04%
            RespT/op 1387423334.40 1554513920.00    12.04%
            50th-Lat   94.25564ms  83.992549ms   -10.89%
            90th-Lat 117.684338ms 109.912643ms    -6.60%
            99th-Lat 152.575617ms 158.677828ms     4.00%
             Avg-Lat  97.116674ms  86.566325ms   -10.86%

I think it is very difficult to know GC pressure of any real application and therefore calculate some fine threshold to use pool instead of memory allocation. But you are right, some limit can be used to allocate small buffers without pool. I think it is better to make some setting and allow user to define this limit. 1024 looks like good default value for this threshold.

In general user should be able

  • enable/disable memory pooling
  • define range of buffer sizes which should be pooled
    low range to prevent sync.Pool overhead and high range to prevent holds of very large buffers

@adtac
Copy link
Contributor

adtac commented Dec 4, 2019

@alextomaili thanks for the detailed benchmarks! These look very promising indeed.

Do you have any idea why the QPS benchmark is very erratic? It goes from 8.76% to -9.17% to 10.84% between 2^10 and 2^12. Similar issues around 2^13, 2^14 and 2^15. Could this just be measurement error? Does running the benchmark for longer than 10s make the variance go away?

Also, do you have any benchmarks with latency=0ms and max concurrency=1 for 2^{0..15} with reqSize=respSize? Using reqSize=512B in the tests skews it. For example, if there is a regression with reqSize=1B and respSize=1B, but an improvement with reqSize=2048B, we'll know that the threshold is somewhere in between.

The 1024B threshold for the send/Marshal case might not necessarily apply to the recv case that you're dealing with. We'll want to find the lowest threshold possible :)

@alextomaili
Copy link
Author

@alextomaili thanks for the detailed benchmarks! These look very promising indeed.

Do you have any idea why the QPS benchmark is very erratic? It goes from 8.76% to -9.17% to 10.84% between 2^10 and 2^12. Similar issues around 2^13, 2^14 and 2^15. Could this just be measurement error? Does running the benchmark for longer than 10s make the variance go away?

Also, do you have any benchmarks with latency=0ms and max concurrency=1 for 2^{0..15} with reqSize=respSize? Using reqSize=512B in the tests skews it. For example, if there is a regression with reqSize=1B and respSize=1B, but an improvement with reqSize=2048B, we'll know that the threshold is somewhere in between.

The 1024B threshold for the send/Marshal case might not necessarily apply to the recv case that you're dealing with. We'll want to find the lowest threshold possible :)

I'll try to provide more test in the near future.
Meanwhile i want ask about common approach of use sync.Pool into this project.
This pool request #2813 brings pool of []byte and doesn't worries about size of buffered slices.
In my project i can see high contention on this pool, delay overhead is equals with
delay on mutext which protects controlBuffer (provide message flow serialization)

image

As far as i understand in "Go world" overhead of sync.Pool is acceptable for most cases. Thats why Go doesn't have some "lock free" pools in standard library.

What do you think about this ?

@alextomaili
Copy link
Author

@adtac how about use "striped locks" pattern to reduce overhead of sync.Pool
we can have several sync.Pool instances instead one global instance and select pool by ring.

type (
	recvBufferPools struct {
		current             uint32
		bucketCount         uint32
		initialBufferSize   int
		maxPooledBufferSize int
		pools               []*sync.Pool
	}
)

var recvBufferPool = newRecvBufferPools(recvBufferPoolBucketsCount, initialRecvBufferSize, maxPooledRecvBufferSize)

func newRecvBufferPools(bucketCount uint32, initialBufferSize int, maxPooledBufferSize int) *recvBufferPools {
	p := &recvBufferPools{
		bucketCount:         bucketCount,
		initialBufferSize:   initialBufferSize,
		maxPooledBufferSize: maxPooledBufferSize,
		pools:               make([]*sync.Pool, 0, bucketCount),
	}

	var i uint32
	for i = 0; i < p.bucketCount; i++ {
		p.pools = append(p.pools, &sync.Pool{
			New: func() interface{} {
				return &recvBuffer{
					buf: make([]byte, p.initialBufferSize),
				}
			},
		})
	}

	return p
}

func (p *recvBufferPools) getBuffer() *recvBuffer {
	i := atomic.AddUint32(&p.current, 1) % p.bucketCount
	b := p.pools[i].Get().(*recvBuffer)
	b.poolIndex = i
	return b
}

func (p *recvBufferPools) freeBuffer(b *recvBuffer) {
	if b != nil && b.cap() < p.maxPooledBufferSize && b.poolIndex < p.bucketCount {
		p.pools[b.poolIndex].Put(b)
	}
}

This way can increase probability to take pooled object by fast path, also it can reduce contention on mutex inside sync.Pool if fast path will be failed. Of course this way also increases probability to call memory allocation when selected pool is empty.

I've try used 32 instance of sync.Pool and got positive results for some cases

for example

maxConcurrentCalls_6-reqSize_8192B-respSize_8192B - one sync.Pool

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_40s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_6-reqSize_8192B-respSize_8192B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       229956       214257    -6.83%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    107606.77     90527.60   -15.87%
           Allocs/op       187.89       186.29    -0.53%
             ReqT/op 376759910.40 351038668.80    -6.83%
            RespT/op 376759910.40 351038668.80    -6.83%
            50th-Lat    475.679µs    459.293µs    -3.44%
            90th-Lat    912.167µs    922.482µs     1.13%
            99th-Lat   1.699435ms   1.868889ms     9.97%
             Avg-Lat   1.042933ms   1.119393ms     7.33%

maxConcurrentCalls_6-reqSize_8192B-respSize_8192B - 32 instance of sync.Pool

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_40s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_6-reqSize_8192B-respSize_8192B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       229956       370150    60.97%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    107606.77    113474.69     5.45%
           Allocs/op       187.89       191.77     2.13%
             ReqT/op 376759910.40 606453760.00    60.97%
            RespT/op 376759910.40 606453760.00    60.97%
            50th-Lat    475.679µs    455.601µs    -4.22%
            90th-Lat    912.167µs    967.353µs     6.05%
            99th-Lat   1.699435ms   2.158552ms    27.02%
             Avg-Lat   1.042933ms    647.638µs   -37.90%

significant profit here

but for some cases i've got small degradation

maxConcurrentCalls_6-reqSize_1024B-respSize_1024B - one sync.Pool

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_40s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_6-reqSize_1024B-respSize_1024B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       734872       658622   -10.38%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     20943.92     18722.08   -10.60%
           Allocs/op       182.38       180.19    -1.10%
             ReqT/op 150501785.60 134885785.60   -10.38%
            RespT/op 150501785.60 134885785.60   -10.38%
            50th-Lat    295.356µs    314.253µs     6.40%
            90th-Lat    433.068µs    526.701µs    21.62%
            99th-Lat    834.935µs   1.088915ms    30.42%
             Avg-Lat    325.959µs     363.72µs    11.58%

maxConcurrentCalls_6-reqSize_1024B-respSize_1024B - 32 sync.Pool

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_40s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_6-reqSize_1024B-respSize_1024B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       734872       652726   -11.18%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     20943.92     21520.63     2.75%
           Allocs/op       182.38       182.10     0.00%
             ReqT/op 150501785.60 133678284.80   -11.18%
            RespT/op 150501785.60 133678284.80   -11.18%
            50th-Lat    295.356µs    310.461µs     5.11%
            90th-Lat    433.068µs    519.426µs    19.94%
            99th-Lat    834.935µs   1.131208ms    35.48%
             Avg-Lat    325.959µs    366.967µs    12.58%

small degradation here

I think this is not bad idea if we can't use sync.Pool as attribute of some high level abstraction such as "connection", "session" etc by easy way. (each "connection" or "session" can use own pool and we will have less contentions on pool rather than we have one global pool)

If we want to use pool deep inside low level abstraction such as "parser" or "codec" we can use ring of several pools do reduce overhead.

What do you think about this ?

@tigrato
Copy link

tigrato commented Dec 9, 2019

@alextomaili did you considered to use multiple sync.Pools for each desired size?
If the desired capacity is bigger than pool object your are polling the object from pool and after checking the size and allocating the new slice with the desired size.
One simple way to solve it is to use 2^n sized pools where given the desired cap you choose the pool that can have the object.

@alextomaili
Copy link
Author

@alextomaili did you considered to use multiple sync.Pools for each desired size?
If the desired capacity is bigger than pool object your are polling the object from pool and after checking the size and allocating the new slice with the desired size.
One simple way to solve it is to use 2^n sized pools where given the desired cap you choose the pool that can have the object.

Solution with 2^n sized pools will work same as single pool if most part of requests or responses will have same size (or they sizes will be close each other and can be represented by same 2^n). Same pool will be selected for most part of RPC calls and we will have problem with contention on this pool. So performance test will be failed. In real world application may have workload with different or with same sizes of request/response. It depends of particular application.

I’m not sure, but i think best way to have some "allocator" abstraction.
Parser or codec or similar components can acquire buffer from "allocator" and return
buffer when it doesn't needed anymore. User can provide allocator with own logic, which will be
acceptable for his application. Default allocator can just allocate byte buffer and never reuse it.
So default allocator will not break existing behavior, but user is able to provide custom
logic to reuse memory.

@menghanl menghanl assigned menghanl and unassigned adtac Jan 9, 2020
@menghanl
Copy link
Contributor

menghanl commented Apr 3, 2020

Sorry for the late reply.

More benchmark numbers would be good. One idea is to run with Varying Payload Sizes, and come up with a threshold to enable buffer reuse. Ideally we should have a good default value, and also make it configurable (Similar to this TODO).

Also, the sibling PR to reuse buffer on the sending side was reverted a while ago (#3338), because of a race detected. And we haven't got time to roll it forward yet.
This PR is a little simpler, but we should try avoid that as well.

The team has been busy with other priorities, and unfortunately will probably still not have much time to review this (sorry again). For now, don't feel urged to take actions. We will send an update here once some of us have time. Thanks!

@alextomaili
Copy link
Author

alextomaili commented May 3, 2020

Benchmark with Varying Payload Sizes
I've used this histogram as -reqPayloadCurveFiles and -respPayloadCurveFiles

1,128,20
129,512,20
513,2048,20
2049,16384,20
16385,65536,20

bench time 60s

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqPayloadCurve_1dbe98c2-respPayloadCurve_1dbe98c2-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps       776112       909811    17.23%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    140347.89    115157.40   -17.95%
           Allocs/op       166.54       163.61    -1.80%
             ReqT/op         0.00         0.00      NaN%
            RespT/op         0.00         0.00      NaN%
            50th-Lat  37.618762ms  30.611973ms   -18.63%
            90th-Lat  48.710401ms  41.595845ms   -14.61%
            99th-Lat  78.041103ms 141.249535ms    80.99%
             Avg-Lat  39.584339ms  33.793715ms   -14.63%
           GoVersion     go1.12.1     go1.12.1
         GrpcVersion   1.30.0-dev   1.30.0-dev

results looks like good, but we have degradation in 99th percentile, this is result of additional contention on sync.Pool

bench-time 300s

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_5m0s-trace_false-latency_5ms-kbps_0-MTU_0-maxConcurrentCalls_512-reqPayloadCurve_1dbe98c2-respPayloadCurve_1dbe98c2-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      5245298      5569002     6.17%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op    136012.60    113086.90   -16.86%
           Allocs/op       165.71       163.31    -1.21%
             ReqT/op         0.00         0.00      NaN%
            RespT/op         0.00         0.00      NaN%
            50th-Lat  27.976065ms  26.186599ms    -6.40%
            90th-Lat  35.392111ms  33.768794ms    -4.59%
            99th-Lat  49.906611ms  51.804034ms     3.80%
             Avg-Lat  29.281373ms  27.578538ms    -5.82%
           GoVersion     go1.12.1     go1.12.1
         GrpcVersion   1.30.0-dev   1.30.0-dev

Result is also looks like good. 99th percentile of latency is also worse.

Aboout 99th percentile of latency - of course sync.Pool brings additional contention
and it may affects 99th percentile, but this contention is not a major for given benchmark.
pool-contention

I've obtained mutex profile during this benchmark
A - contention on my pool
B - contention on existing pool in transport layer

@kevinconaway
Copy link

Is this pooling something that could be opt-in via DialOption or something similar so that it can be enabled when needed but off by default?

@menghanl
Copy link
Contributor

menghanl commented May 3, 2021

Closing this PR as it's unlikely to get reviews soon (#3220 (comment)). I added a reference to this PR in the performance tracking issue, so we can get back to this in future performance work (#2816 (comment))
Thanks for the contribution, and sorry again for the trouble.

@menghanl menghanl closed this May 3, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 31, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants