-
-
Notifications
You must be signed in to change notification settings - Fork 14
Improve test runner performance by restoring from CRaC checkpoint #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I assume that my changes currently break the deploy? |
Why do you think your changes will break the deploy? |
@SleeplessByte The GH action builds the image directly from the Dockerfile, but with this change it has to call bin/build-crac-checkpoint-image.sh to create the image. This is necessary because we first want to run the initial image to create a checkpoint, the actual production image is created by committing the container of that run to an image. |
Ahhh yes. @fapdash that makes sense. @ErikSchierboom is there already a pattern for this in another track? IMO it makes sense what's being proposed here... |
Not really. I'll have to take some time to look at this, but maybe @iHiD can chip in to see if he wants to support those extra flags needing to be passed in. |
So if I'm getting this correctly, the Docker build will actually build another Docker image, correct?l |
- callers can pass in the optional variable `docker_build_script` - script should produce a Docker image - if the script variable is set the default Docker image build step isn't executed See exercism/java-test-runner#147 for context
@ErikSchierboom Not sure what you mean with "another Docker image", but I'll try and describe the build process. The build consists of 3 steps:
My suggestion would be to add an additional optional step to the GHA. It's only ran when the caller passes in the variable `docker-build-script`. In that case the [default build](https://github.com/exercism/github-actions/blob/0b4b469c208ceda80022b2f90b0a2c39d3b5c88c/.github/workflows/docker-build-push-image.yml#L98-L109) wouldn't be run. I've opened a PR here: https://github.com/exercism/github-actions/pull/208 |
@ErikSchierboom I think I found a way to build the image without changing The new deploy action works as follows:
I'm unsure if it'll be a problem that we set up buildx twice, once in this repos deploy action and then inside the Please don't do a detailed review yet, there are still some minor todos: cleanup, documentation, strict version pinning in the GHAs.. But I'm interested in feedback regarding the new build approach. :) |
I like it! |
Cool, thanks for the feedback. :) |
Do you still need me on this? |
Not right now I think. If #147 (comment) works, no flags are needed. |
@ErikSchierboom @SleeplessByte Ready for review! |
The test runner suffers from the slow startup time of the JVM. My goal with this PR was to significantly improve the performance of the Java test runner. There are at the time of writing several options to improve JVM startup time: 1. Graal native image (AOT) 2. Shared AppCDS 3. Project Leyden 4. CRaC Native image can't be used for the test runner as it has to dynamically load classes at runtime and Graal AOT depends on a closed world assumption. Shared AppCDS improves performance. In my testing a test run did go down from ~4s to ~3s. Project Leyden is very promising, as it tries to do as much work as possible ahead of time without making a closed world assumption. At the time of this writing the project is only in early access, so it's probably going to take a while before it lands in a LTS release. CRaC brings the best performance improvements, but with some caveats: - build gets more complicated - the CRIU based engine needs to run with two additional capabilities: - CHECKPOINT_RESTORE - SYS_PTRACE - performance speedup depends on how well the JVM gets warmed up before the checkpoint is taken bin/run-tests-in-docker.sh had to be adjusted to start a new container for each test. The restored JVM needs to be run as a specific PID, so it can only be restored once per container life cycle. The test run still finishes faster than before. This commit uses the CRIU engine as I had some issues getting the warp engine to work properly. The warp engine is also only supported by Azul right now and isn't compatible with musl / Alpine yet. In my tests the runtime of my example test did go down from ~4s to >1s. By switching to an Alpine based image this change also reduces to size of the container (exercism/java-test-runner-crac-checkpoint) to 271MB, down from previously 464MB. CRaC documentation: - https://crac.org/ - https://docs.azul.com/core/crac/crac-introduction - https://openjdk.org/projects/crac/
Not needed for Exercism
I evaluated `getopt` and `getopts` and then decided to keep it simple and just force clients to pass --no-build as the fourth argument.
Meh, just realized that the deploy job already broke last month, it just doesn't get reported inside of the PR. It's not possible to call a reusable workflow inside of There's the possibility to split the build into two jobs, but the jobs don't share a common workspace. We'd have to first upload the artifacts (jar+checkpoint) and download them in the job calling the reusable workflow: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow#passing-data-between-jobs-in-a-workflow |
I think the Docker build script is actually a nicer alternative here, but I'll let @iHiD chime in too. |
Hey there, this PR is the result of some work I did for the OPPSEE project.
As part of that work I was evaluating the Exercism java-test-runner and was unhappy with the performance.
CRaC provides a significant performance boost, but it has some tradeoffs, listed below.
I also explain which alternatives where evaluated and why I think CRaC provides the most benefits.
Let me know what you think. :)
The test runner suffers from the slow startup time of the JVM. My goal with this PR was to significantly improve the performance of the Java test runner.
There are at the time of writing several options to improve JVM startup time:
Native image can't be used for the test runner as it has to dynamically load classes at runtime and Graal AOT depends on a closed world assumption.
Shared AppCDS improves performance. In my testing a test run did go down from ~4s to ~3s.
Project Leyden is very promising, as it tries to do as much work as possible ahead of time without making a closed world assumption. At the time of this writing the project is only in early access, so it's probably going to take a while before it lands in a LTS release.
CRaC brings the best performance improvements, but with some caveats:
bin/run-tests-in-docker.sh had to be changed to start a new container for each test. The restored JVM needs to be run as a specific PID, so it can only be restored once per container life cycle. The test run still finishes faster than before.
This commit uses the CRIU engine as I had some issues getting the warp engine to work properly. The warp engine is also only supported by Azul right now and isn't compatible with musl / Alpine yet. In my tests the runtime of my example test did go down from ~4s to >1s.
By switching to an Alpine based image this change also reduces to size of the container (exercism/java-test-runner-crac-checkpoint) to 271MB, down from previously 464MB.
CRaC documentation: