Skip to content

kdc vagrant fixture time out while booting #39977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
andyb-elastic opened this issue Mar 12, 2019 · 11 comments
Closed

kdc vagrant fixture time out while booting #39977

andyb-elastic opened this issue Mar 12, 2019 · 11 comments
Assignees
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@andyb-elastic
Copy link
Contributor

This has happened in a couple recent builds, it may be another more general issue with vagrant or this image but I'm not sure

Looks like

  config.vm.synced_folder '/host/path', '/guest/path', SharedFoldersEnableSymlinksCreate: false
==> krb5kdc: Clearing any previously set network interfaces...
==> krb5kdc: Preparing network interfaces based on configuration...
    krb5kdc: Adapter 1: nat
==> krb5kdc: Forwarding ports...
    krb5kdc: 88 (guest) => 60088 (host) (adapter 1)
    krb5kdc: 88 (guest) => 60088 (host) (adapter 1)
    krb5kdc: 22 (guest) => 2222 (host) (adapter 1)
==> krb5kdc: Booting VM...
==> krb5kdc: Waiting for machine to boot. This may take a few minutes...
    krb5kdc: SSH address: 127.0.0.1:2222
    krb5kdc: SSH username: vagrant
    krb5kdc: SSH auth method: private key
Timed out while waiting for the machine to boot. This means that
Vagrant was unable to communicate with the guest machine within

> Task :plugins:repository-hdfs:krb5kdcFixture FAILED
:plugins:repository-hdfs:krb5kdcFixture (Thread[Execution worker for ':' Thread 9,5,main]) completed. Took 6 mins 18.541 secs.
the configured ("config.vm.boot_timeout" value) time period.

:plugins:repository-hdfs:krb5kdcFixture#stop (Thread[Execution worker for ':' Thread 9,5,main]) started.
If you look above, you should be able to see the error(s) that
Vagrant had when attempting to connect to the machine. These errors
are usually good hints as to what may be wrong.

If you're using a custom box, make sure that networking is properly
working and you're able to connect to the machine. It is a common
problem that networking isn't setup properly in these boxes.
Verify that authentication configurations are also setup properly,
as well.

If the box appears to be booting properly, you may want to increase
the timeout ("config.vm.boot_timeout") value.

> Task :plugins:repository-hdfs:krb5kdcFixture#stop
Task ':plugins:repository-hdfs:krb5kdcFixture#stop' is not up-to-date because:
  Task has not declared any outputs despite executing actions.
Custom actions are attached to task ':plugins:repository-hdfs:krb5kdcFixture#stop'.
Starting process 'command 'vagrant''. Working directory: /var/lib/jenkins/workspace/elastic+elasticsearch+master+intake/plugins/repository-hdfs Command: vagrant halt krb5kdc
Successfully started process 'command 'vagrant''
==> krb5kdc: Attempting graceful shutdown of VM...
    krb5kdc: Guest communication could not be established! This is usually because
    krb5kdc: SSH is not running, the authentication information was changed,
    krb5kdc: or some other networking issue. Vagrant will force halt, if
    krb5kdc: capable.
==> krb5kdc: Forcing shutdown of VM...
:plugins:repository-hdfs:krb5kdcFixture#stop (Thread[Execution worker for ':' Thread 9,5,main]) completed. Took 19.844 secs.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/2498/console
build-intake-master-2498.txt.zip

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=immutable&&linux&&docker/165/console

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=immutable&&linux&&docker/165/console

@andyb-elastic andyb-elastic added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Mar 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch ywelsch added the :Delivery/Build Build or test infrastructure label Mar 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@ywelsch ywelsch removed the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Mar 13, 2019
@alpar-t
Copy link
Contributor

alpar-t commented Mar 13, 2019

@elasticdog this is happening because our ubuntu image now has vagrant and we are picking that up but it's apparently not working.

@elasticdog
Copy link
Contributor

@atorok that is interesting...both VirtualBox and Vagrant were added into the Ubuntu 18.04 image to test nested virtualization, specifically I've been looking at running the packaging tests in GCP. The general gist is that it works, but is very slow. I've not experienced any direct failures after bringing up a VM like is shown in the output here, but would expect more of our images to have VirtualBox/Vagrant installed moving forward if we can figure out where the performance hit comes from.

I'm open to suggestions on the approach you'd like to take here to prevent unexpected runs.

@alpar-t
Copy link
Contributor

alpar-t commented Mar 15, 2019

@elasticdog I'we been thinking about having a way to label images as "experiemntal"/ "beta" etc. to allow for experimentation in a way that limits the overall impact without putting a lot of burden on the team and infrastructure, so it should be something simple, setting up a separate image build or some complex versioning scheme would not help us.

I think we should add a "stable" label and require it for critical jobs that pick from a pool of workers like intake and PR builds. Then when we do something like this, we could remove the "stable" label from Ubuntu 18.04 so it will no longer be used for those jobs, and add a job that runs only on this platform. It may be a bit of overkill for this situation, but I imagine there will be others we will want to try in the future, like adding more cores once we better palatalize the build and changes in general that require some experimentation to get right.

In the mean time, we have been migrating away from Vagrant based test fixtures, this is the last one, and we are inching towards having a setup that would allow us to run packaging tests against GCP with each running on it's own VM to make them much faster. If we adapt the build to make it easier to do so we would still need packer 2.0 based images to do so. I think that won't be too hard to port from the existing ones, probably the more time consuming bit is to also generate identical VirtualBox images.
VMX might give us some benefits right now, but on the long term I think running the tests in GCP VMs is what we should aim for.

@alpar-t
Copy link
Contributor

alpar-t commented Mar 15, 2019

This would be fixed by #34095

@elasticdog
Copy link
Contributor

elasticdog commented Mar 15, 2019

I definitely agree that it would be good to come up with some mechanism to have experimental images without affecting the regular ones. It can be tricky to balance our own Packer build times and complexity, but I'll think about how we can introduce things like this more cleanly (I hadn't expected there to be any side effects to having this software baked into the images).

@jaymode
Copy link
Member

jaymode commented Mar 18, 2019

@romseygeek
Copy link
Contributor

@ywelsch
Copy link
Contributor

ywelsch commented Jul 29, 2019

More recent failure of this (not sure if related, or a different failure):

:test:fixtures:krb5kdc-fixture:composeUp FAILED
Building peppa
Building hdfs
Recreating 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__peppa_1 ... 
Recreating 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__hdfs_1  ... 
Recreating 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__peppa_1 ... error

ERROR: for 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__peppa_1  Cannot start service peppa: unable to remount dir as readonly: device or resource busy
Recreating 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__hdfs_1  ... done

ERROR: for peppa  Cannot start service peppa: unable to remount dir as readonly: device or resource busy
Encountered errors while bringing up the project.
Stopping 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__hdfs_1 ... 
Stopping 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__hdfs_1 ... done
Removing 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__hdfs_1               ... 
Removing 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__peppa_1              ... 
Removing a75dff4beb91_276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__peppa_1 ... 
Removing 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__peppa_1              ... done
Removing a75dff4beb91_276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__peppa_1 ... done
Removing 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__hdfs_1               ... done
Removing network 276cd8b2e6fb66d48075e05cd8bd231c_krb5kdc-fixture__default

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test:fixtures:krb5kdc-fixture:composeUp'.
> Exit-code 1 when calling /usr/local/bin/docker-compose, stdout: N/A

Build scan: https://gradle.com/s/zhozfy4xnifjs

@alpar-t
Copy link
Contributor

alpar-t commented Sep 5, 2019

We switched to docker based fixtures so this is no longer applicable

@alpar-t alpar-t closed this as completed Sep 5, 2019
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

9 participants