-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Grizzly provider TimeoutException making async requests #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Investigating now... |
Root cause is the request-in-flight idle timeout is being triggered when a connection is idle within the pool. Good catch and thanks for logging the issue. |
Interesting. Just so it's clear: when the connection is currently not being used for a request (so it's sitting in a pool of available connections), that's when the idle timeout is getting called? Thanks for looking into it. I believe the Netty provider may have the same problem. |
Well, that's one part of it. The wrong timeout is being called, but digging more, it's actually not the root cause. Will be getting back to investigating this morning, |
…se positives are present however with these changes and recent updates to Grizzly.
Okay cleaned the code that would yield some false positives. That said, the issue persisted however, I believe I've finally narrowed the problem down. The same test case doesn't fail with the Netty provider and this appears to be due to the fact that, by default, netty doesn't do async connections where as Grizzly does. If I enable async connections on the Netty provider, then I see the same result as with the Grizzly provider. I'll be making some changes shortly to support configurable connection behavior. |
- Introduce async connect mode configuration option within AsyncHttpClientConfig. Defaults to false meaning connections will be made synchronously. - Removed Netty-specific version of async connect mode configuration option. - Updated both providers to respect new configuration option.
Mind pulling the latest updates from Master and verifying resolution on your end? |
Sure, I'm trying it now.. |
It doesn't build.. looks like you need to change: providers/netty/src/test/java/org/asynchttpclient/providers/netty/RetryNonBlockingIssue.java from: to: |
Fixed, thanks. |
Beginning warmup... .. then, in another benchmark run, I saw: java.io.IOException: Remotely Closed Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "AsyncHttpClient-Callback" ... I can give you a copy of the oome-heap-dump.hprof file if you'd like (if so, how?). I'm going to inspect it as well.. |
Interesting. I'd recommend using something like dropbox to share the dump. |
It's possible there are memory leak(s), having to do with Grizzly BufferChunk, ByteChunk, CharChunk, and/or DataChunk.. Instance Counts for All Classes (excluding platform): 156126 instances of class org.glassfish.grizzly.http.util.BufferChunk Heap Histogram: Class Instance Count Total Size |
Okay, thanks. I'll be looking into it shortly. |
…done with it. When testing 8000 connections, this reduces the count of *Chunk instances from the 300K count to the low 20Ks.
Committed some changes that should improve memory profile. |
Unfortunately, there's still some thrash with the HttpRequestPacketImpl. The ThreadCache it uses to reduce the overhead creating these instances doesn't work with AHC as the thread that it's created on is different from the one it's returned to the cache on, so when the main thread goes to create a new one, it's always a cache miss. Need to look into a workaround. |
… we can't rely on Grizzly ThreadLocal caching.
So, thinking about this more, this did find a problem with the underlying objects not being recycled properly, however, I think the test case could be re-written to provide a more realistic memory/GC profile. Maybe I'm wrong, but I don't think any application is going to execute thousands of requests, stash them in a list, and then later loop through the list. Example, if I use your test code and try to execute 200,000 requests, the test will OOM. However, if I re-write the test like so: public static void main(String[] args) {
AsyncHttpClientConfig config = new AsyncHttpClientConfig.Builder()
.setMaximumConnectionsTotal(-1)
.setMaximumConnectionsPerHost(4500)
.setCompressionEnabled(false)
.setAllowPoolingConnection(true /* keep-alive connection */)
// .setAllowPoolingConnection(false /* no keep-alive connection */)
.setConnectionTimeoutInMs(9000).setRequestTimeoutInMs(9000)
.setIdleConnectionInPoolTimeoutInMs(3000).build();
final AsyncHttpClient client =
new AsyncHttpClient(new GrizzlyAsyncHttpProvider(config), config);
final BlockingQueue<Future<Response>> requestQueue =
new LinkedBlockingQueue<Future<Response>>();
final int warmupRequests = 200000;
final CountDownLatch latch = new CountDownLatch(2);
final Runnable producer = new Runnable() {
@Override
public void run() {
final String testUrl = "http://localhost:8080";
for (int i = 0; i < warmupRequests; i++) {
try {
requestQueue.add(client.prepareGet(testUrl).execute());
} catch (IOException e) {
System.err
.println(
"Failed to execute get at iteration #" + i);
}
}
latch.countDown();
}
};
final Runnable consumer = new Runnable() {
@Override
public void run() {
int counter = 0;
while (counter < warmupRequests) {
try {
Future<Response> responseFuture =
requestQueue.take();
counter++;
responseFuture.get();
} catch (Exception e) {
System.err
.println(
"Failed to execute get at iteration #" + counter);
e.printStackTrace();
client.close();
System.exit(1);
}
}
latch.countDown();
}
};
Thread t1 = new Thread(producer);
Thread t2 = new Thread(consumer);
t1.start();
t2.start();
try {
latch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Test complete...");
client.close();
} I'm able to execute all of the requests without issue, no major GCs, and the heap profile over time looks much better. |
Just checking back if we're good on this or not. |
I didn't have time to try this today, so I'll try it in the morning. But, it appears like it is another potential way to benchmark it, and if it works, it should probably get documented somewhere. I will let you know as soon as I try it (thanks!). |
Sorry this took me a while to thoroughly try and get back to you, but here's what it's doing now, with your new test client code.. Using 100 threads, 10 requests per thread, 10 batches, all of the requests consistently complete with zero errors. Using 100 threads, 40 requests per thread, 10 batches, I first get lots of these: java.util.concurrent.ExecutionException: java.net.ConnectException: Cannot assign requested address .. then I get lots of these: java.net.ConnectException: Connection refused .. then a bunch of these: java.io.IOException: Remotely Closed .. then, the benchmark test does not complete, it just hangs. Then when I do a full JVM stack dump, I find lots of these: "pool-1-thread-3" prio=10 tid=0x00007fdbd0005000 nid=0x5815 waiting on condition [0x00007fdbd4da9000] "pool-1-thread-2" prio=10 tid=0x00007fdbd0003000 nid=0x5814 waiting on condition [0x00007fdbd4eaa000] "pool-1-thread-1" prio=10 tid=0x00007fdbd0002000 nid=0x5813 waiting on condition [0x00007fdbd4fab000] Just for comparison, I tried the same kind of a test client using the Netty provider, and it also has trouble with the same number of threads and requests, hanging all of its threads like this: "New I/O worker #1" daemon prio=10 tid=0x00007fc514371800 nid=0x7162 runnable [0x00007fc50ad64000] |
Also, after your recent changes to async-http-client, I get a lot of these: java.io.IOException: Remotely Closed .. followed by these: java.io.IOException: closed .. followed by these: java.io.IOException: The pool is closed .. followed by these: Failed to execute get at iteration #211 .. followed by these: java.util.concurrent.TimeoutException: Timeout exceeded .. followed by these: Failed to execute get at iteration #1878 .. and the test hangs there. Looking at a full JVM stack dump, I find threads doing this: "pool-1-thread-2" prio=10 tid=0x00007f11ec037000 nid=0x1a9f waiting on condition [0x00007f11eb1f0000] "pool-1-thread-1" prio=10 tid=0x00007f11ec035800 nid=0x1a9e waiting on condition [0x00007f11eb2f1000] "connection-pool-delays-thread-pool(1)" daemon prio=10 tid=0x00007f122c57e000 nid=0x1a87 waiting on condition [0x00007f1220ab0000] "Grizzly-IdleTimeoutFilter-IdleCheck" daemon prio=10 tid=0x00007f122c417800 nid=0x1a86 waiting on condition [0x00007f1222283000] |
Sorry for the delay. I've been on vacation. Will be returning to this issue this week. |
Quite alright. I'm still here. Welcome back. |
Well, after looking at the wire traffic, I've come to the conclusion that my hardware can't handle 4500 simultaneous connections. I've captured the wire traffic for the a timed out connection from two locations. From Grizzly's perspective and Wireshark's. From Grizzly's perspective, there is one complete request/response cycle on the connection followed by another request, but the entire response isn't received. Wireshark's view of the traffic confirms Grizzly's. However, Wireshark shows a lot of congestion (retransmissions from the server to the client). The server stops sending data for 9 seconds and then the client closes the connection (per the timeout settings). I've uploaded a tcpdump which can be loaded in Wireshark and shows the packets for the stream in question to dropbox: https://www.dropbox.com/s/455mmno6jnuymyk/dump.pcap.gz My testing is being done over a wireless connection to a remote server. I plan to move my testing to a two machine setup on our corporate LAN here shortly. I'm curious, can you share details of your testing environment? |
First, I agree you can't successfully perform 4500 simultaneous concurrent HTTP requests, each over their own connection over wireless. :) Next, over a wired connection that I believe has plenty of bandwidth, with AHC+Grizzly I see the above problems, with AHC+Netty I see similar problems, but with Jetty 8 client or Jetty 9 client I do not see those problems. That same level of concurrency, that same number of requests, to the same server on another machine that I was using with AHC+Grizzy and AHC+Netty, the requests all succeed with a 200 status code, and the average response time is pretty low (~4ms). I can share any details you're wondering about the test environment. Here's a brief description, but let me know what else you're wondering: 2 OpenStack KVM VMs, each running 64-bit Ubuntu Linux 12.04. Lots of time/attention was spent by our internal cloud team to make sure that the network connectivity between these VMs performs well and has no shortage of bandwidth. On the server VM:
On the client VM:
I switched back and forth between Tomcat and Apache HTTPD as the server, benchmarking and comparing times just to be sure that the Tomcat server is responding optimally over HTTP, and it is. Incidentally, I ended up with the same client performance regardless of which of these two servers I had answering. With the Jetty 8 client and Jetty 9 client doing async requests on OpenJDK 1.6 and 1.7, with 100 threads, 40 requests per thread, 10 batches, Jetty client is consistently getting through all requests successfully (no errors) and an average response time of ~4ms. Then, running that same test with AHC+Grizzly or AHC+Netty, I consistently get at least hundreds of errors and an average response time of anywhere from ~6ms to ~12ms. Nothing changed here other than the client being tested. |
Hi, Would it be possible for you to share your whole client code (both AHC and Jetty), please? In #328 (comment), you wrote that you had stacktraces coming from JDKFuture which should not be possible (JDKFuture should only be used by the JDKProvider), is it a mistake? Cheers, Stéphane |
Any news on the release of the benchmark code? |
I'm busy with end of sprint activities over here until Monday evening, but I have done the bulk of the work to prepare the code to be released. Once I'm done preparing it, it has to go through Legal, and that usually takes about a day. So, I'm guessing I can show it to you about mid-week next week. |
Okay, eBay's version of the client benchmark suite is now available here: https://github.com/eBay/web-client-benchmark Let me know if you have any questions about it.. |
Okay, my counterpart and I have had time to run this benchmark on some 'beefier' equipment. The benchmarks, as they currently stand, aren't quite apples-to-apples right now. The main problem we see is that the Jetty/Apache benchmarks use around 64 connections. The Grizzly/Netty benchmarks are configured to use 4500. We do see the timeout occur with Grizzly+4500/Connections when requesting the 500k.json content. However, if we reduce the number of active connections to be inline with Jetty (i.e., 64), then the benchmark is successful. Unfortunately, this is not the case for the Netty provider. The Jetty client is around 150 req/s faster in our tests (in the large request test, it's about a 12 request difference). I suspect part of the difference is due to the fact that we have to spend some cycles translating Grizzly objects to the AHC equivalent (headers, cookies, etc). However, watching the benchmark, there is a lot of GCs with several full GCs (same for the Netty provider when it's running). This is something we'll be investigating more in-depth in the near future. Thoughts? Comments? |
Stéphane: It looks like the Netty provider is ignoring the max connections per host configuration. When we set it to 64, there was 6000+ open connections. |
Another difference I'm observing is that the response isn't being consumed in the Jetty9 largeAsyncResponse test case. The test uses a listener that is only notified when the request/response is complete. If you look at the largeSyncResponse test, it's using ContentResponse which will consume/buffer the response. In AHC's case, the response will be buffered by default unless a custom listener is used. |
@rlubke Yep, the Netty pool is quite buggy... @jbrittain I agree with Ryan, not storing the response chunks could make a big difference on GC activity as |
I refactored my copy of the benchmark tests a bit so that each one created a String containing the response body once the response was complete. I believe that would make all of the benchmark tests consume the response. Now I'm re-running the tests, and the main thing I'm seeing different right now is: after I updated to the latest async-http-client 2.0.0-SNAPSHOT sources, Grizzly 2.3.6-SNAPSHOT, and using Netty 3.6.6.Final, I am now getting enough exceptions when I try to use either Grizzly or Netty that my benchmark tests hang, and don't complete. I did try setting different timeouts, and setting different max number of connections settings and the ones I had earlier appear to work best, but the benchmarks still hang. In the same setup, the Jetty 9 client (now also building a response String) completes all benchmarks, and encounters only a small number of errors in the testSyncLargeResponses test. During the Jetty 9 benchmark runs, I am now seeing quite a bit more GC activity, which is what I'd expect now that it's building and dereferencing response strings. This is on Linux with OpenJDK 1.7.0-03. Running with the Grizzly provider, here are a couple of the exceptions I'm now seeing: java.util.concurrent.TimeoutException: Timeout exceeded |
Hi Jason, Could you push your updates on your Github repo, please? |
@slandelle : Okay, it's now committed here: |
I think it's sufficient to update the Jetty9 test to use the BufferingResponseListener without having to update all of the tests to actually produce a body. The key difference was that in the Jetty9 test cases, the sync version internally buffered the response chunks and the async version did. So there was a massive difference in GC behaviors between the two tests. In the case of AHC, unless you tell AHC otherwise, it will buffer the content internally. My two cents - remove the creation of the body strings. |
Regarding this:
We're not able to reproduce this on our end - though we're not running the Grizzly2Benchmark - which I think should be excluded or re-written. It was intended as a quick and dirty example on how to run thousands of requests without storing them in a Collection. However, if an error occurs on the consumer, it will close the client - and the producer isn't checking if the client is open or not, so you see IOException about the pool being closed. Regarding the Grizzly side of things, we'll need more info. Have you verified that there is only 64 connections being opened? What are some of the other errors you're seeing or are they all timeouts? Have you tried adding headers to the request so you can correlate the request/response via wireshark? Any additional information to help us out? |
Okay, that's very interesting. If there's that much difference between your setup and my setup, then I'm interested in your average benchmark response times of the testAsyncLargeResponses test of AHC vs. Jetty 9, and I'm also interested in hearing about what OS and hardware that is (two separate machines: one server and one client? Bare metal or VMs? CPUs, RAM, etc). I will have a look at the things you mentioned also. |
Jason, There's a small snag on my end and I can't access the machines to gather the sys info. As soon as said snag has been cleared, I'll follow up. |
System info: Box A: Box B: |
Ryan: Okay, and neither of these is a virtual machine? Bare metal running Linux? Also, you're running the benchmark client on one, then a web server on the other (separate) machine? If all of that is the case, then the main difference appears to be that we're running on OpenStack KVM virtualized VMs. |
Correct.
Also correct.
I'll have to see if there are some VMs we can run the tests on here. |
Hi, |
@vmalladi are you running on a VM? @jbrittain Out of curiosity, have you contacted the OpenStack folks to see if there's any VM tuning that may be performed to increase I/O performance? |
@rlubke I have not yet. Our next step here is to run the same benchmark on our vmWare hypervised VMs. I'm within a couple of days of getting the results there. That would show us whether it's one specific type of VM or not. Our OpenStack VMs aren't that well tuned at this point, and our vmWare VMs are. I will update you here as soon as I know the results. |
@rlubke No not in a VM, its a Ubuntu desktop used for development, things have come done with latest Grizzly 2.3.6-snapshot versions. Now that release version is here we are going to try with 2.3.6 release of grizzly. |
@vmalladi Grizzly 2.3.6 was released yesterday (integrated into AHC today). No firm date for AHC 2.0. Will see if we can do an AHC 2.0 beta release. |
@rlubke Congrats on Grizzly 2.3.6! |
@jbrittain Thanks! |
@jbrittain Hey Jason, since you were active on the mailing list recently, I thought I'd follow up on the VM issue. Any update? |
@rlubke Hi again. I spent some time trying to get the benchmark suite running on our other VMs and it ended up not working due to a combination of Maven bugs plus a lack of network connectivity to the Nexus repositories from those specific VMs. So, I wasn't able to test it there. But, there have also been new releases of Netty & Grizzly, so I can re-test with those in the original VMs. Haven't done that quite yet. But I think some of the bugs causing what I initially reported are better understood now (by you guys). Improvements in the provider code and in both Netty and Grizzly will eventually fix it I believe. |
Old one, no idea what it's still worth. |
Yeah, this is too far out of date. Thanks for all your help with it earlier (Ryan and you)! |
Motivation: Currently when either we or the server send Connection: close, we correctly do not return that connection to the pool. However, we rely on the server actually performing the connection closure: we never call close() ourselves. This is unnecessarily optimistic: a server may absolutely fail to close this connection. To protect our own file descriptors, we should make sure that any connection we do not return the pool is closed. Modifications: If we think a connection is closing when we release it, we now call close() on it defensively. Result: We no longer leak connections when the server fails to close them. Fixes AsyncHttpClient#324.
When making async requests using the Grizzly provider (from AHC 2.0.0-SNAPSHOT), I get some TimeoutExceptions that should not occur. The server is serving these requests very rapidly, and the JVM isn't GCing very much. The requests serve in a fraction of a second, but the Grizzly provider says they timed out after 9 seconds. If I set the Grizzly provider's timeout to a higher number of seconds then it times out after that many seconds instead..
Some stack trace examples:
java.util.concurrent.TimeoutException: Timeout exceeded
at org.asynchttpclient.providers.grizzly.GrizzlyAsyncHttpProvider.timeout(GrizzlyAsyncHttpProvider.java:485)
at org.asynchttpclient.providers.grizzly.GrizzlyAsyncHttpProvider$3.onTimeout(GrizzlyAsyncHttpProvider.java:276)
at org.glassfish.grizzly.utils.IdleTimeoutFilter$DefaultWorker.doWork(IdleTimeoutFilter.java:382)
at org.glassfish.grizzly.utils.IdleTimeoutFilter$DefaultWorker.doWork(IdleTimeoutFilter.java:362)
at org.glassfish.grizzly.utils.DelayedExecutor$DelayedRunnable.run(DelayedExecutor.java:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
another stack trace:
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timeout exceeded
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
at java.util.concurrent.FutureTask.get(FutureTask.java:111)
at org.asynchttpclient.providers.grizzly.GrizzlyResponseFuture.get(GrizzlyResponseFuture.java:165)
at org.ebaysf.webclient.benchmark.NingAhcGrizzlyBenchmarkTest.asyncWarmup(NingAhcGrizzlyBenchmarkTest.java:105)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.ebaysf.webclient.benchmark.AbstractBenchmarkTest.doBenchmark(AbstractBenchmarkTest.java:168)
at org.ebaysf.webclient.benchmark.NingAhcGrizzlyBenchmarkTest.testAsyncLargeResponses(NingAhcGrizzlyBenchmarkTest.java:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:45)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:119)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:101)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
at com.sun.proxy.$Proxy0.invoke(Unknown Source)
at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150)
at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:91)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)
Caused by: java.util.concurrent.TimeoutException: Timeout exceeded
at org.asynchttpclient.providers.grizzly.GrizzlyAsyncHttpProvider.timeout(GrizzlyAsyncHttpProvider.java:485)
at org.asynchttpclient.providers.grizzly.GrizzlyAsyncHttpProvider$3.onTimeout(GrizzlyAsyncHttpProvider.java:276)
at org.glassfish.grizzly.utils.IdleTimeoutFilter$DefaultWorker.doWork(IdleTimeoutFilter.java:382)
at org.glassfish.grizzly.utils.IdleTimeoutFilter$DefaultWorker.doWork(IdleTimeoutFilter.java:362)
at org.glassfish.grizzly.utils.DelayedExecutor$DelayedRunnable.run(DelayedExecutor.java:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Here's what my asyncWarmup() method looks like:
And here's how the client is initialized:
The text was updated successfully, but these errors were encountered: