Skip to content

"java.net.SocketException: Socket closed" when in a cluster mode + Docker + acquireHostList enabled #384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wajda opened this issue Apr 20, 2021 · 8 comments · Fixed by #385
Assignees

Comments

@wajda
Copy link
Contributor

wajda commented Apr 20, 2021

The issue was first discovered here AbsaOSS/spline#869

The error occurs in the combination of circumstances: Cluster mode + Docker + acqureHostList=true

My understanding of what is happening is the following.
When the VST connection is established the respective HostHandler asks VstCommunication class to refresh the host list from the server. When the new hosts are added to the set, the old ones (unless are pointing to exactly the same ip:port) are immediately discarded along with all associated connection pools and sockets.
The problem is that the connection instance, that has just been created and triggered the host list refreshing process in the first place, the one that is being returned from the VstCommunication.connect() method holds a pointer to the host that might have just been discarded (and the associated socket closed) during this host list refreshing routine. As a result in this circumstances the VstCommunication.connect() method returns a connection that is dead on the moment of creation, with all the consequences.

This is exactly what happens when ArangoDB runs in a virtualized environment (Docker in our case) when the networking is organized in a way that the client process addresses the server via a different IP (or a host name) that the server sees from inside its network.

The issue is reproducible by spinning up a DB cluster via arangodb-starter in a Docker, and run ArangoDBTest.execute_acquireHostList_enabled() test method against it.

@wajda
Copy link
Contributor Author

wajda commented Apr 20, 2021

The solution would be to simply check if the connection instance is still alive before returning it from the VstCommunication.connect() method. If not, simply keep re-getting the connection from the host handler until a usable one is received.

@rashtao
Copy link
Collaborator

rashtao commented Apr 26, 2021

Hi @wajda ,
how do you exactly start the cluster? And how do you exactly access it?

@wajda
Copy link
Contributor Author

wajda commented Apr 26, 2021

This happens on multiple environments. First we run into this issue on Kubernetes on AWS, then my colleague reproduced it locally using Docker 20, while it worked for me on Fedora's docker 19 (moby-engine). AfetreAfter installing Docker-ce 20 the issue occurred to me as well.

On my localhost I use the following setup to reproduce it:

  • Linux (Fedora 33),
  • Docker-ce 20 (everything is by default, no customization at all)
  • I start arangodb cluster via the arangodb-starter using the following command:
docker run -it --rm \
  --name=adb
  -p 8528:8528 \
  -v /var/run/docker.sock:/var/run/docker.sock arangodb/arangodb-starter \
  --starter.local \
  --starter.address=172.17.0.1 \
  --docker.container=adb

(not sure if I used -v or not on my last tests, I tried different combinations.... but it doesn't affect the way the error occurs, it's consistently reproducible either way)

For the driver config, enable acquireHostList. Otherwise nothing special.

Then I access it via VST on localhost

@wajda
Copy link
Contributor Author

wajda commented Apr 26, 2021

@rashtao
Copy link
Collaborator

rashtao commented Apr 28, 2021

Hi @wajda ,
I think the error is caused by the fact that you connect the driver to localhost, but since you have acqureHostList=true, in the returned host list the same host would have a different name (eg. tcp://172.17.0.1:8529).

Can you please try connecting the driver to 172.17.0.1:8529 instead of localhost?

@wajda
Copy link
Contributor Author

wajda commented Apr 28, 2021

On a local Docker yes, that would work (as I mentioned in AbsaOSS/spline#869 (comment)). The problem is that it's not always possible in a real prod environments with a more complicated networking. where for instance, IPs are auto generated and aren't stable enough to be put in a config file for example. So the first client connection needs to be done on alias, for example.

@dvagapov
Copy link

I have ArangoDB cluster on Kubernetes.

Connection string to arangodb: arangodb-cluster.arango-namespace.svc.cluster.local:8529

Kubernetes doesn't have stable IP and operates via dns-names:

Arango pods names:
arangodb-cluster-agnt-260nvfct-87c535 20.0.63.123
arangodb-cluster-agnt-glwzb7hd-87c535 20.0.63.81
arangodb-cluster-agnt-optv4hel-87c535 20.0.62.19

arangodb-cluster-crdn-c4eas9vw-044b59 20.0.62.185
arangodb-cluster-crdn-hrf4bdbi-044b59 20.0.63.232
arangodb-cluster-crdn-xraip1sl-044b59 20.0.63.99

arangodb-cluster-prmr-9njqj9u2-044b59 20.0.63.161
arangodb-cluster-prmr-aotybl68-044b59 20.0.62.167
arangodb-cluster-prmr-wdmsmksh-044b59 20.0.63.147

If I delete pod "arangodb-cluster-prmr-9njqj9u2-044b59 20.0.63.161" - kubermnetes will create new pod with new IP address

@rashtao
Copy link
Collaborator

rashtao commented Apr 29, 2021

Thanks for clarifying, it makes sense to me now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants