Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRaC] Fix hangup after restoring #34372

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

YaSuenag
Copy link

@YaSuenag YaSuenag commented Feb 6, 2025

I run following ApplicationRunner Spring Boot app and I obtained checkpoint by CRIU. The app did not finish after restoring.

  @Override
  public void run(ApplicationArguments args) throws Exception {
    if(args.containsOption("checkpoint")){
      System.out.println("Ready to obtain checkpoint...");
      // Wait restoring...
      cpCoordinator.await();
    }
    System.out.println("from Spring Boot App");
  }

I obtained thread dump, then I got following stack trace. It shows beforeCheckpoint CRaC handler waits signal in CyclicBarrier.

"prevent-shutdown" #29 [1504] prio=5 os_prio=0 cpu=0.17ms elapsed=25.76s tid=0x00007feb1017db00 nid=1504 waiting on condition  [0x00007feb4e22b000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x000000008a9279b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:371)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block([email protected]/AbstractQueuedSynchronizer.java:519)
        at java.util.concurrent.ForkJoinPool.unmanagedBlock([email protected]/ForkJoinPool.java:3780)
        at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3725)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1707)
        at java.util.concurrent.CyclicBarrier.dowait([email protected]/CyclicBarrier.java:236)
        at java.util.concurrent.CyclicBarrier.await([email protected]/CyclicBarrier.java:364)
        at org.springframework.context.support.DefaultLifecycleProcessor$CracResourceAdapter.awaitPreventShutdownBarrier(DefaultLifecycleProcessor.java:634)
        at org.springframework.context.support.DefaultLifecycleProcessor$CracResourceAdapter.lambda$beforeCheckpoint$0(DefaultLifecycleProcessor.java:606)
        at org.springframework.context.support.DefaultLifecycleProcessor$CracResourceAdapter$$Lambda/0x00007feb501c37c0.run(Unknown Source)
        at java.lang.Thread.runWith([email protected]/Thread.java:1596)
        at java.lang.Thread.run([email protected]/Thread.java:1583)

I investigated CracResourceAdapter, prevent-shutdown thread might through the second awaitPreventShutdownBarrier() call if that thread runs before awaitPreventShutdownBarrier() at beforeCheckpoint().

We need to separate barriers for beforeCheckpoint / afterRestore to work as expected.

Signed-off-by: Yasumasa Suenaga <[email protected]>
@YaSuenag YaSuenag force-pushed the pr/crac-restore-hang branch from 5136e9e to 13fbbd1 Compare February 6, 2025 03:19
@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Feb 6, 2025
@sdeleuze sdeleuze self-assigned this Feb 6, 2025
@sdeleuze sdeleuze added in: core Issues in core modules (aop, beans, core, context, expression) type: enhancement A general enhancement labels Feb 6, 2025
@snicoll snicoll removed the status: waiting-for-triage An issue we've not yet triaged or decided on label Feb 7, 2025
@snicoll snicoll added this to the 6.2.x milestone Feb 7, 2025
@sdeleuze sdeleuze modified the milestones: 6.2.x, 6.2.6 Mar 21, 2025
@sdeleuze
Copy link
Contributor

Could you please attach or share a link to the repository of your reproducer?

@sdeleuze sdeleuze added the status: waiting-for-feedback We need additional information before we can continue label Mar 21, 2025
@YaSuenag
Copy link
Author

YaSuenag commented Mar 22, 2025

Could you please attach or share a link to the repository of your reproducer?

Reproducer is here: https://github.com/YaSuenag/checkpointer/tree/main/example/springboot-cli

This is an example of checkpointer, library to implement CRaC event hooks and coordinate with CRIU events. This is not CRaC so event handling might be faster than CRaC, and it might surface race conditions which couldn't be see on CRaC.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Mar 22, 2025
@sdeleuze sdeleuze removed the status: feedback-provided Feedback has been provided label Apr 4, 2025
@sdeleuze sdeleuze modified the milestones: 6.2.6, 7.0.x Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: core Issues in core modules (aop, beans, core, context, expression) type: enhancement A general enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants