Skip to content

Spurious ReactiveStreamsTest#testConnectionDoesNotGetClosed failure #1380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
slandelle opened this issue Mar 29, 2017 · 18 comments
Closed

Spurious ReactiveStreamsTest#testConnectionDoesNotGetClosed failure #1380

slandelle opened this issue Mar 29, 2017 · 18 comments
Assignees
Labels
Milestone

Comments

@slandelle
Copy link
Contributor

slandelle commented Mar 29, 2017

Reproducer: org.asynchttpclient.reactivestreams.ReactiveStreamsTest#main (might have to run several times):

testConnectionDoesNotGetClosed(org.asynchttpclient.reactivestreams.ReactiveStreamsTest)  Time elapsed: 0.342 sec  <<< FAILURE!
java.lang.AssertionError:  expected [[B@95f1422] but found [[B@31b7112d]
	at org.asynchttpclient.reactivestreams.ReactiveStreamsTest.testConnectionDoesNotGetClosed(ReactiveStreamsTest.java:66)
@farmdawgnation
Copy link
Contributor

farmdawgnation commented Jul 3, 2017

I checked the most recent build from Travis.

I found this buried in the logs.

Results :

Failed tests: 
  BasicHttpsTest.multipleSequentialPostRequestsOverHttps:95->lambda$multipleSequentialPostRequestsOverHttps$5:96->lambda$null$4:105 expected [hello there] but found [<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 500 No handler enqueued</title>
</head>
<body><h2>HTTP ERROR 500</h2>
<p>Problem accessing /foo/bar. Reason:
<pre>    No handler enqueued</pre></p><hr><a href="http://eclipse.org/jetty">Powered by Jetty:// 9.4.6.v20170531</a><hr/>
</body>
</html>
]

  ReactiveStreamsTest.testConnectionDoesNotGetClosed:256 Invalid response byte at position 12000 expected [111] but found [45]

I'm not sure, but it looks like there's more than one issue going on here...

@slandelle
Copy link
Contributor Author

I'm pretty sure I got the BasicHttpsTest.multipleSequentialPostRequestsOverHttps failure pinned, see #1434

But sadly, it's completely unrelated to the ReactiveStreamsTest.testConnectionDoesNotGetClosed one :(

@slandelle
Copy link
Contributor Author

I can get ReactiveStreamsTest#main to fail almost every time on my machine:

Request DefaultHttpRequest(decodeResult: success, version: HTTP/1.1)
PUT /foo/test HTTP/1.1
X-content-length: 265495
X-content-md5: sBFA+zEHvFHWTKspPX86Gg==
transfer-encoding: chunked
host: localhost:62655
accept: */*
user-agent: AHC/2.1

Response DefaultHttpResponse(decodeResult: success, version: HTTP/1.1)
HTTP/1.1 200 
X-x-content-length: 265495
X-x-content-md5: sBFA+zEHvFHWTKspPX86Gg==
X-transfer-encoding: chunked
X-host: localhost:62655
X-accept: */*
X-user-agent: AHC/2.1
X-pathInfo: /foo/test
X-KEEP-ALIVE: 127.0.0.1:62713
X-content-length: 265495
content-md5: BJhrrx4cp/H3htzVPIJGIg==
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Date: Wed, 05 Jul 2017 11:25:16 GMT

2017-07-05 13:25:17,842 [AsyncHttpClient-60-1] DEBUG org.asynchttpclient.netty.channel.ChannelManager - Adding key: http://localhost:62655 for channel [id: 0xbafe3c5f, L:/127.0.0.1:62713 - R:localhost/127.0.0.1:62655]
java.lang.AssertionError: Server side payload MD5 invalid expected [sBFA+zEHvFHWTKspPX86Gg==] but found [BJhrrx4cp/H3htzVPIJGIg==]
	at org.testng.Assert.fail(Assert.java:94)
	at org.testng.Assert.failNotEquals(Assert.java:513)
	at org.testng.Assert.assertEqualsImpl(Assert.java:135)
	at org.testng.Assert.assertEquals(Assert.java:116)
	at org.testng.Assert.assertEquals(Assert.java:190)
	at org.asynchttpclient.reactivestreams.ReactiveStreamsTest.testConnectionDoesNotGetClosed(ReactiveStreamsTest.java:260)
	at org.asynchttpclient.reactivestreams.ReactiveStreamsTest.main(ReactiveStreamsTest.java:278)
2017-07-05 13:25:17,853 [main] INFO org.apache.coyote.http11.Http11NioProtocol - Pausing ProtocolHandler ["http-nio-auto-1-62655"]
2017-07-05 13:25:17,861 [main] INFO org.apache.catalina.core.StandardService - Stopping service Tomcat
2017-07-05 13:25:17,871 [main] INFO org.apache.coyote.http11.Http11NioProtocol - Stopping ProtocolHandler ["http-nio-auto-1-62655"]
Exception in thread "main" java.lang.AssertionError: Invalid response byte at position 12000 expected [111] but found [45]
	at org.testng.Assert.fail(Assert.java:94)
	at org.testng.Assert.failNotEquals(Assert.java:513)
	at org.testng.Assert.assertEqualsImpl(Assert.java:135)
	at org.testng.Assert.assertEquals(Assert.java:116)
	at org.testng.Assert.assertEquals(Assert.java:326)
	at org.asynchttpclient.reactivestreams.ReactiveStreamsTest.testConnectionDoesNotGetClosed(ReactiveStreamsTest.java:266)
	at org.asynchttpclient.reactivestreams.ReactiveStreamsTest.main(ReactiveStreamsTest.java:278)

@farmdawgnation
Copy link
Contributor

I've been thinking about this failure for a few days now and still can't figure out what would be causing this. What's the exact command you're running?

@slandelle
Copy link
Contributor Author

I just run main method from IntelliJ.

@farmdawgnation
Copy link
Contributor

Ah, there's a main method in the unit test. I was just running the test from mvn on the command line. Let me try that.

@farmdawgnation
Copy link
Contributor

Okay, I was able to reproduce the failure. I don't have a solution here yet but I think I've learned some interesting things:

  • It doesn't behave like a race condition. I can't fix it using arbitrary Thread.sleeps in the test or in the test server.
  • The failure is always the same:
    • It occurs at offset 12,000 in the byte array
    • It always expects byte 111 in that position but finds byte 45.
    • Bytes after this position appear to be wrong too.

Position 12000 is exactly at 12 kilobytes. Is that a magic number of a buffer somewhere important?

@farmdawgnation
Copy link
Contributor

Oh. Also. When the entire content fails it fails the same way. The MD5 of the bad response is the same each time.

@slandelle
Copy link
Contributor Author

It occurs at offset 12,000 in the byte array

I've seen it happen at offset 24,000 too

12000 bytes is not 12kb :)

The problem is not server side as I was able to reproduce with Jetty, Netty and Tomcat.

12000 has not be some buffer size, either in Netty or kernel but I wasn't able to spot it. It happens on OSX (my machine) and Linux (Travis).

No idea what happens. Looks like buffer corruption.

@farmdawgnation
Copy link
Contributor

I'm inclined to say the same. The fact that it's the same failure every time on my machine is just weird though.

In any event I'm skeptical that the bug is in AHC. Might be worthwhile to try playing with the underlying Netty version to see if the problem disappears. Have you done that yet?

@slandelle
Copy link
Contributor Author

Might be worthwhile to try playing with the underlying Netty version to see if the problem disappears. Have you done that yet?

I can reproduce with Netty 4.1.0.Final. I didn't bother digging further.

In any event I'm skeptical that the bug is in AHC.

This feature involves many libs:

  • AHC itself
  • netty
  • netty-reactive-streams
  • rxjava
  • rxjava-reactive-streams

Hard to figure out where issue might happen.
I've checked netty-reactive-streams sources and couldn't spot an issue.

@farmdawgnation
Copy link
Contributor

Can we backport this test to an earlier version of AHC?

My thought on this is that if AHC 2.0 exhibits the same bug then this isn't a regression and we should feel free to move forward releasing AHC 2.1.

@slandelle
Copy link
Contributor Author

Can we backport this test to an earlier version of AHC?

It fails too on AHC 2.0

My thought on this is that if AHC 2.0 exhibits the same bug then this isn't a regression and we should feel free to move forward releasing AHC 2.1.

I currently have no idea if this bug is only limited to reactivestreams based features or if it's a critical issue that just happened to be spotted here. I don't feel comfortable releasing the next stable version with such doubt lurking around.

@farmdawgnation
Copy link
Contributor

I currently have no idea if this bug is only limited to reactivestreams based features or if it's a critical issue that just happened to be spotted here.

I don't know the codebase well enough to answer this question for sure, but the surface area of AHC itself on the critical path for this feature seems relatively small to me. The fact that it fails, too, on AHC 2.0 means that in your shoes I'd probably release the code with a Known Bug caveat in the release notes. That is just me, however.

I don't have time this weekend, but perhaps next weekend I'll have time to dig in and see if I can play with dependencies and determine whether or not changing netty versions and such causes the problem to disappear.

@slandelle
Copy link
Contributor Author

@farmdawgnation Please give it a try. I'm pretty sure it was a RxJava 1 bug that was fixed in RxJava 2. There are other foobar stuffs related to reactive streams that I want to clean up before releasing 2.1. I'll work on this when I'll be back from vacation.

@farmdawgnation
Copy link
Contributor

Huzzah! It seems to resolved the issue on my end, too!

@slandelle
Copy link
Contributor Author

Great, thanks for the feedback!

@slandelle
Copy link
Contributor Author

Issue still happens almost every time on Travis.
I think I have a clue: playframework/netty-reactive-streams#37

I think Subscription is being sometimes requested from 2 different threads: from the calling thread and from the channel's event loop. When such switch happen, I think chunks can be written out of order as the ones read from the calling thread have to jump into the event loop.

@slandelle slandelle self-assigned this Dec 12, 2017
@slandelle slandelle added this to the 2.1.0 milestone Dec 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants