Improve test runner performance by restoring from CRaC checkpoint #147

fapdash · 2025-03-12T15:59:30Z

Hey there, this PR is the result of some work I did for the OPPSEE project.
As part of that work I was evaluating the Exercism java-test-runner and was unhappy with the performance.
CRaC provides a significant performance boost, but it has some tradeoffs, listed below.
I also explain which alternatives where evaluated and why I think CRaC provides the most benefits.
Let me know what you think. :)

The test runner suffers from the slow startup time of the JVM. My goal with this PR was to significantly improve the performance of the Java test runner.

There are at the time of writing several options to improve JVM startup time:

Graal native image (AOT)
Shared AppCDS
Project Leyden
CRaC

Native image can't be used for the test runner as it has to dynamically load classes at runtime and Graal AOT depends on a closed world assumption.

Shared AppCDS improves performance. In my testing a test run did go down from ~4s to ~3s.

Project Leyden is very promising, as it tries to do as much work as possible ahead of time without making a closed world assumption. At the time of this writing the project is only in early access, so it's probably going to take a while before it lands in a LTS release.

CRaC brings the best performance improvements, but with some caveats:

build gets more complicated
the CRIU based engine needs to run with two additional capabilities:
- CHECKPOINT_RESTORE
- SYS_PTRACE
performance speedup depends on how well the JVM gets warmed up before the checkpoint is taken

bin/run-tests-in-docker.sh had to be changed to start a new container for each test. The restored JVM needs to be run as a specific PID, so it can only be restored once per container life cycle. The test run still finishes faster than before.

This commit uses the CRIU engine as I had some issues getting the warp engine to work properly. The warp engine is also only supported by Azul right now and isn't compatible with musl / Alpine yet. In my tests the runtime of my example test did go down from ~4s to >1s.

By switching to an Alpine based image this change also reduces to size of the container (exercism/java-test-runner-crac-checkpoint) to 271MB, down from previously 464MB.

CRaC documentation:

fapdash · 2025-03-12T16:04:47Z

I assume that my changes currently break the deploy?
Currently I'm using exercism/java-test-runner-crac-restore as image name and I introduced an extra script for the container image build.

SleeplessByte · 2025-03-13T16:07:07Z

I assume that my changes currently break the deploy? Currently I'm using exercism/java-test-runner-crac-restore as image name and I introduced an extra script for the container image build.

Why do you think your changes will break the deploy?

fapdash · 2025-03-13T16:26:35Z

@SleeplessByte The GH action builds the image directly from the Dockerfile, but with this change it has to call bin/build-crac-checkpoint-image.sh to create the image. This is necessary because we first want to run the initial image to create a checkpoint, the actual production image is created by committing the container of that run to an image.

SleeplessByte · 2025-03-13T16:44:18Z

Ahhh yes. @fapdash that makes sense.

@ErikSchierboom is there already a pattern for this in another track? IMO it makes sense what's being proposed here...

ErikSchierboom · 2025-03-13T19:06:32Z

@ErikSchierboom is there already a pattern for this in another track? IMO it makes sense what's being proposed here...

Not really. I'll have to take some time to look at this, but maybe @iHiD can chip in to see if he wants to support those extra flags needing to be passed in.

ErikSchierboom · 2025-03-14T07:51:47Z

So if I'm getting this correctly, the Docker build will actually build another Docker image, correct?l

- callers can pass in the optional variable `docker_build_script` - script should produce a Docker image - if the script variable is set the default Docker image build step isn't executed See exercism/java-test-runner#147 for context

fapdash · 2025-03-14T16:49:13Z

@ErikSchierboom Not sure what you mean with "another Docker image", but I'll try and describe the build process.

The build consists of 3 steps:

Building an image from the Dockerfile

java-test-runner/bin/build-crac-checkpoint-image.sh

Line 10 in 524f74a

docker build -t exercism/java-test-runner-crac-checkpoint .

Starting a container from image created in the previous step and running some tests to warm up the JVM.

java-test-runner/bin/build-crac-checkpoint-image.sh

Line 35 in 37de823

    
           docker commit --change='ENTRYPOINT ["sh", "/opt/test-runner/bin/run-restore-from-checkpoint.sh"]' java-test-runner-crac exercism/java-test-runner-crac-restore

Committing the container from the previous step to an image

java-test-runner/bin/build-crac-checkpoint-image.sh

Line 35 in 37de823

    
           docker commit --change='ENTRYPOINT ["sh", "/opt/test-runner/bin/run-restore-from-checkpoint.sh"]' java-test-runner-crac exercism/java-test-runner-crac-restore

My suggestion would be to add an additional optional step to the GHA. It's only ran when the caller passes in the variable `docker-build-script`. In that case the [default build](https://github.com/exercism/github-actions/blob/0b4b469c208ceda80022b2f90b0a2c39d3b5c88c/.github/workflows/docker-build-push-image.yml#L98-L109) wouldn't be run. I've opened a PR here: https://github.com/exercism/github-actions/pull/208

fapdash · 2025-03-16T02:45:57Z

@ErikSchierboom I think I found a way to build the image without changing exercism/github-actions/.github/workflows/docker-build-push-image.yml. It's inspired by Micronauts CRaC Gradle plugin.

The new deploy action works as follows:

Build an image with an entrypoint that creates a CRaC checkpoint. This is now done via docker/build-push-action, so the image gets cached. Uses Dockerfile.createCheckpoint
Run the image as a container to create the checkpoint. The checkpoint gets written into a directory on the host.
Run Exercisms docker-build-push-image.yml action. The Docker build copies the checkpoint generated in step 2 from the host file system. Uses a different Dockerfile than step 1

I'm unsure if it'll be a problem that we set up buildx twice, once in this repos deploy action and then inside the docker-build-push-image action that we're calling.

Please don't do a detailed review yet, there are still some minor todos: cleanup, documentation, strict version pinning in the GHAs.. But I'm interested in feedback regarding the new build approach. :)

ErikSchierboom · 2025-03-16T10:14:46Z

I like it!

fapdash · 2025-03-16T20:21:09Z

Cool, thanks for the feedback. :)
I'll try to ready this PR up for review next weekend!

iHiD · 2025-03-19T00:12:38Z

Do you still need me on this?

SleeplessByte · 2025-03-19T01:18:01Z

Do you still need me on this?

Not right now I think. If #147 (comment) works, no flags are needed.

fapdash · 2025-04-23T16:32:21Z

@ErikSchierboom @SleeplessByte Ready for review!

The test runner suffers from the slow startup time of the JVM. My goal with this PR was to significantly improve the performance of the Java test runner. There are at the time of writing several options to improve JVM startup time: 1. Graal native image (AOT) 2. Shared AppCDS 3. Project Leyden 4. CRaC Native image can't be used for the test runner as it has to dynamically load classes at runtime and Graal AOT depends on a closed world assumption. Shared AppCDS improves performance. In my testing a test run did go down from ~4s to ~3s. Project Leyden is very promising, as it tries to do as much work as possible ahead of time without making a closed world assumption. At the time of this writing the project is only in early access, so it's probably going to take a while before it lands in a LTS release. CRaC brings the best performance improvements, but with some caveats: - build gets more complicated - the CRIU based engine needs to run with two additional capabilities: - CHECKPOINT_RESTORE - SYS_PTRACE - performance speedup depends on how well the JVM gets warmed up before the checkpoint is taken bin/run-tests-in-docker.sh had to be adjusted to start a new container for each test. The restored JVM needs to be run as a specific PID, so it can only be restored once per container life cycle. The test run still finishes faster than before. This commit uses the CRIU engine as I had some issues getting the warp engine to work properly. The warp engine is also only supported by Azul right now and isn't compatible with musl / Alpine yet. In my tests the runtime of my example test did go down from ~4s to >1s. By switching to an Alpine based image this change also reduces to size of the container (exercism/java-test-runner-crac-checkpoint) to 271MB, down from previously 464MB. CRaC documentation: - https://crac.org/ - https://docs.azul.com/core/crac/crac-introduction - https://openjdk.org/projects/crac/

Not needed for Exercism

https://docs.docker.com/reference/dockerfile/#copy---link

I evaluated `getopt` and `getopts` and then decided to keep it simple and just force clients to pass --no-build as the fourth argument.

… docs

fapdash · 2025-04-23T16:59:09Z

Meh, just realized that the deploy job already broke last month, it just doesn't get reported inside of the PR.
So my idea in #147 (comment) doesn't work. :(
Latest run: https://github.com/exercism/java-test-runner/actions/runs/14623350771

It's not possible to call a reusable workflow inside of steps, it has to be called by a job, which in turn can't have any steps.

There's the possibility to split the build into two jobs, but the jobs don't share a common workspace. We'd have to first upload the artifacts (jar+checkpoint) and download them in the job calling the reusable workflow: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow#passing-data-between-jobs-in-a-workflow
So we'd still have to change the GitHub action upstream.
This could be an alternative to exercism/github-actions#208, but I'm not sure if it offers any benefits? Wdyt?

ErikSchierboom · 2025-04-24T08:53:45Z

I think the Docker build script is actually a nicer alternative here, but I'll let @iHiD chime in too.

fapdash requested a review from a team as a code owner March 12, 2025 15:59

fapdash mentioned this pull request Mar 14, 2025

Allow building Docker image via script exercism/github-actions#208

Open

fapdash added 8 commits April 23, 2025 18:33

Remove temp file system for JavaFX

d071b61

Not needed for Exercism

Fix deploy / restructure build process

28e6975

Fix Gradle build in CI job?

cad5cc5

Use COPY --link for better caching

011c65e

https://docs.docker.com/reference/dockerfile/#copy---link

Also use setup-gradle action for caching in ci.yml

469ed11

Get rid of duplication by adding an optional argument

22c00c1

I evaluated `getopt` and `getopts` and then decided to keep it simple and just force clients to pass --no-build as the fourth argument.

Fold create-checkpoint.sh into build-crac-checkpoint-image.sh and add…

020c813

… docs

fapdash force-pushed the use-crac branch from 4ba4a87 to 020c813 Compare April 23, 2025 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve test runner performance by restoring from CRaC checkpoint #147

Improve test runner performance by restoring from CRaC checkpoint #147

fapdash commented Mar 12, 2025

fapdash commented Mar 12, 2025

SleeplessByte commented Mar 13, 2025

fapdash commented Mar 13, 2025 •

edited

Loading

SleeplessByte commented Mar 13, 2025

ErikSchierboom commented Mar 13, 2025

ErikSchierboom commented Mar 14, 2025

fapdash commented Mar 14, 2025 •

edited

Loading

fapdash commented Mar 16, 2025

ErikSchierboom commented Mar 16, 2025

fapdash commented Mar 16, 2025

iHiD commented Mar 19, 2025

SleeplessByte commented Mar 19, 2025

fapdash commented Apr 23, 2025

fapdash commented Apr 23, 2025 •

edited

Loading

ErikSchierboom commented Apr 24, 2025

Improve test runner performance by restoring from CRaC checkpoint #147

Are you sure you want to change the base?

Improve test runner performance by restoring from CRaC checkpoint #147

Conversation

fapdash commented Mar 12, 2025

fapdash commented Mar 12, 2025

SleeplessByte commented Mar 13, 2025

fapdash commented Mar 13, 2025 • edited Loading

SleeplessByte commented Mar 13, 2025

ErikSchierboom commented Mar 13, 2025

ErikSchierboom commented Mar 14, 2025

fapdash commented Mar 14, 2025 • edited Loading

fapdash commented Mar 16, 2025

ErikSchierboom commented Mar 16, 2025

fapdash commented Mar 16, 2025

iHiD commented Mar 19, 2025

SleeplessByte commented Mar 19, 2025

fapdash commented Apr 23, 2025

fapdash commented Apr 23, 2025 • edited Loading

ErikSchierboom commented Apr 24, 2025

fapdash commented Mar 13, 2025 •

edited

Loading

fapdash commented Mar 14, 2025 •

edited

Loading

fapdash commented Apr 23, 2025 •

edited

Loading