Skip to content

Improve test runner performance by restoring from CRaC checkpoint #147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

fapdash
Copy link
Contributor

@fapdash fapdash commented Mar 12, 2025

Hey there, this PR is the result of some work I did for the OPPSEE project.
As part of that work I was evaluating the Exercism java-test-runner and was unhappy with the performance.
CRaC provides a significant performance boost, but it has some tradeoffs, listed below.
I also explain which alternatives where evaluated and why I think CRaC provides the most benefits.
Let me know what you think. :)


The test runner suffers from the slow startup time of the JVM. My goal with this PR was to significantly improve the performance of the Java test runner.

There are at the time of writing several options to improve JVM startup time:

  1. Graal native image (AOT)
  2. Shared AppCDS
  3. Project Leyden
  4. CRaC

Native image can't be used for the test runner as it has to dynamically load classes at runtime and Graal AOT depends on a closed world assumption.

Shared AppCDS improves performance. In my testing a test run did go down from ~4s to ~3s.

Project Leyden is very promising, as it tries to do as much work as possible ahead of time without making a closed world assumption. At the time of this writing the project is only in early access, so it's probably going to take a while before it lands in a LTS release.

CRaC brings the best performance improvements, but with some caveats:

  • build gets more complicated
  • the CRIU based engine needs to run with two additional capabilities:
    • CHECKPOINT_RESTORE
    • SYS_PTRACE
  • performance speedup depends on how well the JVM gets warmed up before the checkpoint is taken

bin/run-tests-in-docker.sh had to be changed to start a new container for each test. The restored JVM needs to be run as a specific PID, so it can only be restored once per container life cycle. The test run still finishes faster than before.

This commit uses the CRIU engine as I had some issues getting the warp engine to work properly. The warp engine is also only supported by Azul right now and isn't compatible with musl / Alpine yet. In my tests the runtime of my example test did go down from ~4s to >1s.

By switching to an Alpine based image this change also reduces to size of the container (exercism/java-test-runner-crac-checkpoint) to 271MB, down from previously 464MB.

CRaC documentation:

@fapdash fapdash requested a review from a team as a code owner March 12, 2025 15:59
@fapdash
Copy link
Contributor Author

fapdash commented Mar 12, 2025

I assume that my changes currently break the deploy?
Currently I'm using exercism/java-test-runner-crac-restore as image name and I introduced an extra script for the container image build.

@SleeplessByte
Copy link
Member

I assume that my changes currently break the deploy? Currently I'm using exercism/java-test-runner-crac-restore as image name and I introduced an extra script for the container image build.

Why do you think your changes will break the deploy?

@fapdash
Copy link
Contributor Author

fapdash commented Mar 13, 2025

@SleeplessByte The GH action builds the image directly from the Dockerfile, but with this change it has to call bin/build-crac-checkpoint-image.sh to create the image. This is necessary because we first want to run the initial image to create a checkpoint, the actual production image is created by committing the container of that run to an image.

@SleeplessByte
Copy link
Member

Ahhh yes. @fapdash that makes sense.

@ErikSchierboom is there already a pattern for this in another track? IMO it makes sense what's being proposed here...

@ErikSchierboom
Copy link
Member

@ErikSchierboom is there already a pattern for this in another track? IMO it makes sense what's being proposed here...

Not really. I'll have to take some time to look at this, but maybe @iHiD can chip in to see if he wants to support those extra flags needing to be passed in.

@ErikSchierboom
Copy link
Member

So if I'm getting this correctly, the Docker build will actually build another Docker image, correct?l

fapdash added a commit to fapdash/exercism--github-actions that referenced this pull request Mar 14, 2025
- callers can pass in the optional variable `docker_build_script`
- script should produce a Docker image
- if the script variable is set the default Docker image build step
  isn't executed

See exercism/java-test-runner#147 for context
@fapdash
Copy link
Contributor Author

fapdash commented Mar 14, 2025

@ErikSchierboom Not sure what you mean with "another Docker image", but I'll try and describe the build process.

The build consists of 3 steps:

  1. Building an image from the Dockerfile
    docker build -t exercism/java-test-runner-crac-checkpoint .
  2. Starting a container from image created in the previous step and running some tests to warm up the JVM.
    docker commit --change='ENTRYPOINT ["sh", "/opt/test-runner/bin/run-restore-from-checkpoint.sh"]' java-test-runner-crac exercism/java-test-runner-crac-restore
  3. Committing the container from the previous step to an image
    docker commit --change='ENTRYPOINT ["sh", "/opt/test-runner/bin/run-restore-from-checkpoint.sh"]' java-test-runner-crac exercism/java-test-runner-crac-restore

My suggestion would be to add an additional optional step to the GHA. It's only ran when the caller passes in the variable `docker-build-script`. In that case the [default build](https://github.com/exercism/github-actions/blob/0b4b469c208ceda80022b2f90b0a2c39d3b5c88c/.github/workflows/docker-build-push-image.yml#L98-L109) wouldn't be run. I've opened a PR here: https://github.com/exercism/github-actions/pull/208

@fapdash
Copy link
Contributor Author

fapdash commented Mar 16, 2025

@ErikSchierboom I think I found a way to build the image without changing exercism/github-actions/.github/workflows/docker-build-push-image.yml. It's inspired by Micronauts CRaC Gradle plugin.

The new deploy action works as follows:

  1. Build an image with an entrypoint that creates a CRaC checkpoint. This is now done via docker/build-push-action, so the image gets cached. Uses Dockerfile.createCheckpoint
  2. Run the image as a container to create the checkpoint. The checkpoint gets written into a directory on the host.
  3. Run Exercisms docker-build-push-image.yml action. The Docker build copies the checkpoint generated in step 2 from the host file system. Uses a different Dockerfile than step 1

I'm unsure if it'll be a problem that we set up buildx twice, once in this repos deploy action and then inside the docker-build-push-image action that we're calling.

Please don't do a detailed review yet, there are still some minor todos: cleanup, documentation, strict version pinning in the GHAs.. But I'm interested in feedback regarding the new build approach. :)

@ErikSchierboom
Copy link
Member

I like it!

@fapdash
Copy link
Contributor Author

fapdash commented Mar 16, 2025

Cool, thanks for the feedback. :)
I'll try to ready this PR up for review next weekend!

@iHiD
Copy link
Member

iHiD commented Mar 19, 2025

Do you still need me on this?

@SleeplessByte
Copy link
Member

Do you still need me on this?

Not right now I think. If #147 (comment) works, no flags are needed.

@fapdash
Copy link
Contributor Author

fapdash commented Apr 23, 2025

@ErikSchierboom @SleeplessByte Ready for review!

fapdash added 8 commits April 23, 2025 18:33
The test runner suffers from the slow startup time of the JVM.
My goal with this PR was to significantly improve the performance
of the Java test runner.

There are at the time of writing several options to improve JVM
startup time:
1. Graal native image (AOT)
2. Shared AppCDS
3. Project Leyden
4. CRaC

Native image can't be used for the test runner as it has to
dynamically load classes at runtime and Graal AOT depends on a closed
world assumption.

Shared AppCDS improves performance. In my testing a test run did go
down from ~4s to ~3s.

Project Leyden is very promising, as it tries to do as much work
as possible ahead of time without making a closed world assumption.
At the time of this writing the project is only in early access,
so it's probably going to take a while before it lands in a LTS
release.

CRaC brings the best performance improvements, but with some caveats:
- build gets more complicated
- the CRIU based engine needs to run with two additional capabilities:
    - CHECKPOINT_RESTORE
    - SYS_PTRACE
- performance speedup depends on how well the JVM gets warmed up
  before the checkpoint is taken

bin/run-tests-in-docker.sh had to be adjusted to start a new
container for each test. The restored JVM needs to be run as a
specific PID, so it can only be restored once per container life
cycle.
The test run still finishes faster than before.

This commit uses the CRIU engine as I had some issues getting the warp
engine to work properly. The warp engine is also only supported by
Azul right now and isn't compatible with musl / Alpine yet.
In my tests the runtime of my example test did go down from ~4s to
>1s.
By switching to an Alpine based image this change also reduces to size
of the container (exercism/java-test-runner-crac-checkpoint) to 271MB,
down from previously 464MB.

CRaC documentation:
- https://crac.org/
- https://docs.azul.com/core/crac/crac-introduction
- https://openjdk.org/projects/crac/
Not needed for Exercism
I evaluated `getopt` and `getopts` and
then decided to keep it simple and just force clients
to pass --no-build as the fourth argument.
@fapdash
Copy link
Contributor Author

fapdash commented Apr 23, 2025

Meh, just realized that the deploy job already broke last month, it just doesn't get reported inside of the PR.
So my idea in #147 (comment) doesn't work. :(
Latest run: https://github.com/exercism/java-test-runner/actions/runs/14623350771

It's not possible to call a reusable workflow inside of steps, it has to be called by a job, which in turn can't have any steps.

There's the possibility to split the build into two jobs, but the jobs don't share a common workspace. We'd have to first upload the artifacts (jar+checkpoint) and download them in the job calling the reusable workflow: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow#passing-data-between-jobs-in-a-workflow
So we'd still have to change the GitHub action upstream.
This could be an alternative to exercism/github-actions#208, but I'm not sure if it offers any benefits? Wdyt?

@ErikSchierboom
Copy link
Member

I think the Docker build script is actually a nicer alternative here, but I'll let @iHiD chime in too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants