Skip to content

Subscription stops working doing automated tests #499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gklijs opened this issue Dec 4, 2019 · 20 comments
Closed

Subscription stops working doing automated tests #499

gklijs opened this issue Dec 4, 2019 · 20 comments
Labels
type: bug Something isn't working

Comments

@gklijs
Copy link
Contributor

gklijs commented Dec 4, 2019

Library Version
1.4.2

Describe the bug
When running a load test, I very early on get time-outs. I have a comparable implementation using Java, and it runs a whole lot longer. It's about 2 minutes vs. 30 minutes. I don't see any error in the logs. What will happen is that every second a couple of times a subscription will be started, a message will be received, through Kafka and Webflux, and then the subscription will be stopped again.
A preliminary comparison can be found here.

To Reproduce
At kafka-graphql-examples the implementation can be found. It's possible on mac/linux to run the whole test using the scripts.
I know this is quite a lot, and the test is doing a lot more than needed. I might try to create a test in Kotlin, using the example project that produces a similar result.

Expected behavior
To be at least on par with the java version, or be faster because of the use of coroutines.

I saw there was already an open issue for changing the subscription implementation, #358. It might fix the problem.

@gklijs gklijs added the type: bug Something isn't working label Dec 4, 2019
@smyrick
Copy link
Contributor

smyrick commented Dec 4, 2019

@gklijs Thank you for running tests against our library! It is helpful to see performance metrics outside of @ExpediaGroup

Do you think you could run the same perf test against this application? https://github.com/graphql-java-kickstart/graphql-spring-boot/tree/master/example-graphql-subscription

This is the graphql-java subscriptions example also using spring boot. If this is showing similar numbers than this issue really should be opened on graphql-java instead as we are not implementing much logic on top of the subscription execution.

@gklijs
Copy link
Contributor Author

gklijs commented Dec 5, 2019

The Java test is already based on that, it's in https://github.com/openweb-nl/kafka-graphql-examples/tree/ge-java. It's performing much better than this one, so the problems seems not entirely with graphql-java.

@smyrick
Copy link
Contributor

smyrick commented Dec 5, 2019

Ah ok, I apologise. I did not see it was a separate branch. This definitely is worth further investigation then. I don't have time to investigate this week but this is definitely high priority. Of course any community members are open to help!

@gklijs
Copy link
Contributor Author

gklijs commented Dec 8, 2019

Tried some things, using Flow instead of WebFlux to pass as publisher. But it keeps the same. Shat after the first generator is created it stopped.
I did notice sometimes the web socket fails, where a refresh is enough to get it back working. Also as soon as a subscription is stop, there aren't any keep alive messages been received after that. This is different from java-kickstart which send a "ka" message every 15 seconds.

@gklijs
Copy link
Contributor Author

gklijs commented Dec 8, 2019

I think this line, keepAliveSubscriptions[session.id]?.cancel() should be removed. I will try that out tomorrow.

@gklijs
Copy link
Contributor Author

gklijs commented Dec 8, 2019

Getting closer to a solution, I'm sure that problem for failing the test is that the messages from the websocket don't arrive anymore, without the client reconnecting.

@gklijs gklijs changed the title Poor performance using subscriptions Subscription stops working doing automated tests Dec 8, 2019
@gklijs
Copy link
Contributor Author

gklijs commented Dec 8, 2019

I'm able to reproduce it by starting and ending a subscription 128 times, after that the server doesn't send the data back, but dus keep sending the ak (with the pr version).

@gklijs
Copy link
Contributor Author

gklijs commented Dec 8, 2019

There seem to be some kind of congestions of multiple Flux that aren't properly released. I added some debug statements and called the subscription till it broke. It ended with:

2019-12-08 20:45:06 [reactor-http-nio-2] INFO  c.e.g.s.e.ApolloSubscriptionProtocolHandler - Handle method called with {"id":"deposit-f4b53533-d610-48af-a49c-d69ed179c8b9","type":"start","payload":{"query":"subscription ($amount: Int! $uuid: String! $username: String! $iban: String!){\n                                            money_transfer(amount: $amount username: $username  descr: \"deposit by re-graph\"\n                                           from: \"cash\" token: \"cash\" to: $iban uuid: $uuid)\n                                           {reason success uuid}}","variables":{"amount":1000,"username":"fxdgdgdgfgdfgd","iban":"NL51OPEN0807758846","uuid":"f4b53533-d610-48af-a49c-d69ed179c8b9"}}}
2019-12-08 20:45:06 [reactor-http-nio-2] INFO  c.e.g.s.e.ApolloSubscriptionProtocolHandler - Start subscription called with SubscriptionOperationMessage(type=start, id=deposit-f4b53533-d610-48af-a49c-d69ed179c8b9, payload={query=subscription ($amount: Int! $uuid: String! $username: String! $iban: String!){
                                            money_transfer(amount: $amount username: $username  descr: "deposit by re-graph"
                                           from: "cash" token: "cash" to: $iban uuid: $uuid)
                                           {reason success uuid}}, variables={amount=1000, username=fxdgdgdgfgdfgd, iban=NL51OPEN0807758846, uuid=f4b53533-d610-48af-a49c-d69ed179c8b9}})
2019-12-08 20:45:06 [reactor-http-nio-2] INFO  c.e.g.s.e.SimpleSubscriptionHandler - executeSubscription method called with variable {amount=1000, username=fxdgdgdgfgdfgd, iban=NL51OPEN0807758846, uuid=f4b53533-d610-48af-a49c-d69ed179c8b9}
2019-12-08 20:45:06 [reactor-http-nio-2] TRACE c.e.g.s.e.ApolloSubscriptionProtocolHandler - WebSocket GraphQL subscription subscribe, WebSocketSessionID=58a7f94c OperationMessageID=deposit-f4b53533-d610-48af-a49c-d69ed179c8b9
2019-12-08 20:45:06 [reactor-http-nio-2] INFO  c.e.g.s.e.ApolloSubscriptionProtocolHandler - Handle method called with {"id":"deposit-f4b53533-d610-48af-a49c-d69ed179c8b9","type":"stop"}
2019-12-08 20:45:06 [reactor-http-nio-2] INFO  c.e.g.s.e.ApolloSubscriptionProtocolHandler - Stop subscription called with SubscriptionOperationMessage(type=stop, id=deposit-f4b53533-d610-48af-a49c-d69ed179c8b9, payload=null)
2019-12-08 20:45:06 [reactor-http-nio-2] INFO  c.e.g.s.e.ApolloSubscriptionProtocolHandler - Handle method called with {"id":"deposit-f8d15699-2435-44ca-948e-7fb827739c2d","type":"start","payload":{"query":"subscription ($amount: Int! $uuid: String! $username: String! $iban: String!){\n                                            money_transfer(amount: $amount username: $username  descr: \"deposit by re-graph\"\n                                           from: \"cash\" token: \"cash\" to: $iban uuid: $uuid)\n                                           {reason success uuid}}","variables":{"amount":1000,"username":"fxdgdgdgfgdfgd","iban":"NL51OPEN0807758846","uuid":"f8d15699-2435-44ca-948e-7fb827739c2d"}}}
2019-12-08 20:45:06 [reactor-http-nio-2] INFO  c.e.g.s.e.ApolloSubscriptionProtocolHandler - Start subscription called with SubscriptionOperationMessage(type=start, id=deposit-f8d15699-2435-44ca-948e-7fb827739c2d, payload={query=subscription ($amount: Int! $uuid: String! $username: String! $iban: String!){
                                            money_transfer(amount: $amount username: $username  descr: "deposit by re-graph"
                                           from: "cash" token: "cash" to: $iban uuid: $uuid)
                                           {reason success uuid}}, variables={amount=1000, username=fxdgdgdgfgdfgd, iban=NL51OPEN0807758846, uuid=f8d15699-2435-44ca-948e-7fb827739c2d}})
2019-12-08 20:45:06 [reactor-http-nio-2] INFO  c.e.g.s.e.SimpleSubscriptionHandler - executeSubscription method called with variable {amount=1000, username=fxdgdgdgfgdfgd, iban=NL51OPEN0807758846, uuid=f8d15699-2435-44ca-948e-7fb827739c2d}
2019-12-08 20:45:06 [reactor-http-nio-2] TRACE c.e.g.s.e.ApolloSubscriptionProtocolHandler - WebSocket GraphQL subscription subscribe, WebSocketSessionID=58a7f94c OperationMessageID=deposit-f8d15699-2435-44ca-948e-7fb827739c2d
2019-12-08 20:45:06 [reactor-http-nio-2] INFO  c.e.g.s.e.ApolloSubscriptionProtocolHandler - Handle method called with {"id":"deposit-f8d15699-2435-44ca-948e-7fb827739c2d","type":"stop"}
2019-12-08 20:45:06 [reactor-http-nio-2] INFO  c.e.g.s.e.ApolloSubscriptionProtocolHandler - Stop subscription called with SubscriptionOperationMessage(type=stop, id=deposit-f8d15699-2435-44ca-948e-7fb827739c2d, payload=null)

Somehow the handle method is not called anymore. Only after refreshing the page the situation seems to restore itself, and after some seconds, the cancelation of all the all subscriptions pass by in the logging.
Weirdly enough there is also no data send to the server from that point anymore, so it might be a problem in the client, but that's weird since with the other two servers it is working.

@smyrick
Copy link
Contributor

smyrick commented Dec 10, 2019

@gklijs Does your GraphQL client send the terminate message?

https://github.com/openweb-nl/kafka-graphql-examples/blob/dd96117a08d8abd241a52077661423b6be065ee9/test/src/nl/openweb/test/graphql_client.clj#L43

If not then the session will not close and it will still keep sending keep alive messages:

@gklijs
Copy link
Contributor Author

gklijs commented Dec 11, 2019

That client is only used for testing, and they stay open till the end of the test. It might still be a good idea to add them through, and most likely the reason of some error logging at the and of the test.
It's also happening from the frontend through, which does have a complete implementation.
So while

is called, that subscription is not really stopped. It might go wrong because the publisher is concatenated with the ok message, which somehow prevents the cancel? The java implementation of the subscription start is pretty similar otherwise.

@gklijs
Copy link
Contributor Author

gklijs commented Dec 11, 2019

Thanks for merging, although it was only a small improvement. Removing the concat

.concatWith(Flux.just(SubscriptionOperationMessage(type = GQL_COMPLETE.type, id = operationMessage.id)))
changes the behaviour, so I'm on the right track, but removing it of course isn't. I'm going to tread up on what is supposed to happen according to the spec's and/or compare with other libraries to make a proper fix.

@dariuszkuc
Copy link
Collaborator

Just a side note -> I just noticed that you are using Kotlin Channel primitives which if not handled correctly can be troublesome and lead to deadlocks.

@gklijs
Copy link
Contributor Author

gklijs commented Dec 13, 2019

Good to know, I changed that part several times now. I could do it similar to the java version, using webflux. But even then it had the same problem. I also changed it to be a bit less experimental, using broadcast.

@smyrick
Copy link
Contributor

smyrick commented Dec 16, 2019

Running the perf tests against the example app in graphql-kotlin and this example app in graphql-java

GraphQL Java

Test Config

config:
  target: 'ws://localhost:3000/stockticker'
  ws:
    rejectUnauthorized: false
  phases:
    # Duration in seconds
    - duration: 600
      arrivalRate: 20
scenarios:
  - name: "Run GraphQL Subscription"
    engine: "ws"
    flow:
      - send:
          query: |-
            subscription TestSubscription {
              stockQuotes {
                dateTime
                stockCode
                stockPrice
                stockPriceChange
              }
            }
      - think: 2

Results

All virtual users finished
Summary report @ 13:44:46(-0800) 2019-12-16
  Scenarios launched:  12000
  Scenarios completed: 12000
  Requests completed:  12000
  RPS sent: 19.93
  Request latency:
    min: 0
    max: 2
    median: 0.1
    p95: 0.2
    p99: 0.4
  Scenario counts:
    Run GraphQL Subscription: 12000 (100%)
  Codes:
    0: 12000

✨  Done in 604.24s.

Screen Shot 2019-12-16 at 1 45 40 PM


GraphQL Kotlin

Test Config

config:
  target: 'ws://localhost:8080/subscriptions'
  ws:
    rejectUnauthorized: false
    subprotocols:
      - graphql-ws
  phases:
    - duration: 600
      arrivalRate: 20
scenarios:
  - name: "Run GraphQL Subscription"
    engine: "ws"
    flow:
      - send:
          type: "connection_init"
      - send:
          type: "start"
          id: "1"
          payload:
            query: |-
              subscription TestSubscription {
                counter
              }
      - think: 2
      - send:
          type: "connection_terminate"

Results

All virtual users finished
Summary report @ 13:59:46(-0800) 2019-12-16
  Scenarios launched:  12000
  Scenarios completed: 12000
  Requests completed:  36000
  RPS sent: 59.81
  Request latency:
    min: 0
    max: 3.5
    median: 0.1
    p95: 0.2
    p99: 0.3
  Scenario counts:
    Run GraphQL Subscription: 12000 (100%)
  Codes:
    0: 36000

✨  Done in 603.64s.

Screen Shot 2019-12-16 at 2 01 04 PM

@smyrick
Copy link
Contributor

smyrick commented Dec 16, 2019

So graphql-kotlin is using a lot more memory. It is freeing up on GC but that is orders larger. graphql-java was jumping back and forth from 10MB to ~40MB. graphql-kotlin was jumping from 30MB to 200MB.

The difference in request count is that the kotlin implementation uses the graphql-ws subprotocol so it needs to send more messages

@dariuszkuc
Copy link
Collaborator

So graphql-kotlin is using a lot more memory. It is freeing up on GC but that is orders larger. graphql-java was jumping back and forth from 10MB to ~40MB. graphql-kotlin was jumping from 30MB to 200MB.

We are using SpringBoot which is much heavier than the basic Jetty server started from graphql-java-subscription-example.

@gklijs
Copy link
Contributor Author

gklijs commented Dec 17, 2019

Might be related, especially since the graphql-java was already having to less memory available in the setup, and using the sand for graphql-kotlin. Although I tried at least once with just running the Kotlin part locally, with more resources.

@gklijs
Copy link
Contributor Author

gklijs commented Dec 17, 2019

Good news, it seems fixed. I don't know if it's a small changes I did on the current master, or it's just the latest master + enough memory. Hope to get that clear tonight, so I can do a whole performance run overnight. I think there is also nothing there now to prevent the same client subscribing with the same id twice, without stopping between. That's on of the things I changed locally. Will create a pr if needed.

@gklijs
Copy link
Contributor Author

gklijs commented Dec 18, 2019

Tested with #520 overnight and of the 5 runs, 3 ran for over on hour. So once the pr is merged, this can be closed.

@gklijs
Copy link
Contributor Author

gklijs commented Dec 18, 2019

Fixed by #520 (and maybe partly by some of the fixes before it)

@gklijs gklijs closed this as completed Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Development

No branches or pull requests

3 participants