Skip to content

Docker build task fails on CentOS 8 ARM CI workers #71138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mark-vieira opened this issue Mar 31, 2021 · 10 comments · Fixed by #71199
Closed

Docker build task fails on CentOS 8 ARM CI workers #71138

mark-vieira opened this issue Mar 31, 2021 · 10 comments · Fixed by #71199
Assignees
Labels
:Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts Team:Delivery Meta label for Delivery team

Comments

@mark-vieira
Copy link
Contributor

I'm seeing a very strange thing. Building our Docker images on the new CentOS ARM workers is failing due to a missing command, specifically addgroup. Sure enough, this utility doesn't exist on those machines. My question is, why does it matter? This command is being executed inside the container so why would the setup on the host machine make any difference? Building the image on an Ubuntu host where addgroup exists works just fine. Am I missing something here?

https://gradle-enterprise.elastic.co/s/ee3bnyz42wxei/console-log?task=:distribution:docker:buildAarch64DockerImage#L8552

@mark-vieira mark-vieira added the :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts label Mar 31, 2021
@elasticmachine elasticmachine added the Team:Delivery Meta label for Delivery team label Mar 31, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@mark-vieira
Copy link
Contributor Author

@pugnascotia There's something obvious I'm missing, right?

@mark-vieira
Copy link
Contributor Author

I already checked to see if possibly the templated Dockerfile differed in some way between the two environments and they are identical with the exception of timestamps.

$ diff Dockerfile Dockerfile.ubuntu 
236c236
< LABEL org.label-schema.build-date="2021-03-31T17:56:58.918160148Z" \
---
> LABEL org.label-schema.build-date="2021-03-31T16:07:17.857400544Z" \
246c246
<   org.opencontainers.image.created="2021-03-31T17:56:58.918160148Z" \
---
>   org.opencontainers.image.created="2021-03-31T16:07:17.857400544Z" \

Running docker build manually in the generated context gives me the same result as well, so it's nothing funny to do with the Gradle integration.

@pugnascotia
Copy link
Contributor

Uh...that's weird. I guess if busybox --list-full on ARM didn't include addgroup then we wouldn't create a link for it. But that would be really peculiar, and would have failed before now, surely.

@mark-vieira
Copy link
Contributor Author

mark-vieira commented Mar 31, 2021

Right, and wouldn't it also blow up on the ubuntu arm machine as well?

@pugnascotia
Copy link
Contributor

You'd think so. I also just build the image on my Mac M1 and it was fine.

@mark-vieira
Copy link
Contributor Author

Ok, it's not due to the command missing on the host. Changing addgroup to groupadd results in the same error despite the later being on the host. Ok, so that makes sense.

My guess is the whole copying from rootfs stuff isn't behaving the way we want. Comparing docker info between centos and ubuntu this is the only meaningful part.

---
>   Backing Filesystem: extfs

So CentOS is using XFS whereas Ubuntu extfs.

@mark-vieira
Copy link
Contributor Author

The /rootfs filesystem on the centos variant just seems.... empty.

Step 25/47 : RUN ls -al /rootfs/usr/bin
 ---> Running in ad5037fb36e1
total 7104
dr-xr-xr-x. 1 root root      18 Mar 31 22:34 .
drwxr-xr-x. 1 root root      17 Mar 31 22:34 ..
-rwxr-xr-x. 1 root root      29 Jul 21  2020 alias
-rwxr-xr-x. 1 root root 1235400 Jul 21  2020 bash
lrwxrwxrwx. 1 root root      10 Jul 21  2020 bashbug -> bashbug-64
-rwxr-xr-x. 1 root root    7316 Jul 21  2020 bashbug-64
-rwxr-xr-x. 1 root root      26 Jul 21  2020 bg
-rwxr-xr-x. 1 root root 1148524 Mar 31 22:34 busybox
-rwxr-xr-x. 1 root root    3283 Jul 20  2020 catchsegv
-rwxr-xr-x. 1 root root      26 Jul 21  2020 cd
-rwxr-xr-x. 1 root root      31 Jul 21  2020 command
-rwxr-xr-x. 1 root root 2720904 Mar 31 22:33 curl
-rwxr-xr-x. 1 root root      26 Jul 21  2020 fc
-rwxr-xr-x. 1 root root      26 Jul 21  2020 fg
-rwxr-xr-x. 1 root root   69728 Apr  7  2020 funzip
-rwxr-xr-x. 1 root root   80352 Jul 20  2020 gencat
-rwxr-xr-x. 1 root root   76472 Jul 20  2020 getconf
-rwxr-xr-x. 1 root root   84608 Jul 20  2020 getent
-rwxr-xr-x. 1 root root      31 Jul 21  2020 getopts
-rwxr-xr-x. 1 root root      28 Jul 21  2020 hash
-rwxr-xr-x. 1 root root   96368 Jul 20  2020 iconv
-rwxr-xr-x. 1 root root      28 Jul 21  2020 jobs
-rwxr-xr-x. 1 root root    5342 Jul 20  2020 ldd
-rwxr-xr-x. 1 root root   98032 Jul 20  2020 locale
-rwxr-xr-x. 1 root root   82152 Jul 20  2020 makedb
-rwxr-xr-x. 1 root root   78584 Jul 20  2020 pldd
-rwxr-xr-x. 1 root root      28 Jul 21  2020 read
lrwxrwxrwx. 1 root root       4 Jul 21  2020 sh -> bash
-rwxr-xr-x. 1 root root    4281 Jul 20  2020 sotruss
-rwxr-xr-x. 1 root root   79824 Jul 20  2020 sprof
-rwxr-xr-x. 1 root root   23904 Mar 31 22:34 tini
-rwxr-xr-x. 1 root root      28 Jul 21  2020 type
-rwxr-xr-x. 1 root root   15370 Jul 20  2020 tzselect
-rwxr-xr-x. 1 root root      30 Jul 21  2020 ulimit
-rwxr-xr-x. 1 root root      29 Jul 21  2020 umask
-rwxr-xr-x. 1 root root      31 Jul 21  2020 unalias
-rwxr-xr-x. 2 root root  202096 Apr  7  2020 unzip
-rwxr-xr-x. 1 root root  136144 Apr  7  2020 unzipsfx
-rwxr-xr-x. 1 root root      28 Jul 21  2020 wait
-rwxr-xr-x. 1 root root  278696 May 11  2019 zip
-rwxr-xr-x. 1 root root  141544 May 11  2019 zipcloak
-rwxr-xr-x. 1 root root    2953 Oct 10  2008 zipgrep
-rwxr-xr-x. 2 root root  202096 Apr  7  2020 zipinfo
-rwxr-xr-x. 1 root root  140456 May 11  2019 zipnote
-rwxr-xr-x. 1 root root  140440 May 11  2019 zipsplit

That's all that's there. Should be loads more. I'm wondering if there is weird symlink stuff happening implicitly that we're relying on.

@mark-vieira
Copy link
Contributor Author

Aha!

/bin/sh: /rootfs/bin/busybox: cannot execute binary file: Exec format error

Are we downloading the wrong busybox? This is typically the error you get when you try to run x86 on aarch64 or vice versa. Investingating...

@mark-vieira
Copy link
Contributor Author

mark-vieira commented Mar 31, 2021

Yeah, so something is busted with their public build on CentOS 8. CentOS is a much older kernel than ubuntu, so maybe that's the incompatibility? Interestingly enough, I can run the busybox docker image just fine. So maybe a workaround would be to instead of downloading the busybox binary directly, do a FROM busybox to copy the bits we need? Assuming the docker image is all statically compiled (which I believe it is since it's based on scratch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts Team:Delivery Meta label for Delivery team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants