Skip to content

[AutoDiff] Pullbacks w/ loops can segfault #68392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jkshtj opened this issue Sep 8, 2023 · 7 comments · Fixed by #68413
Closed

[AutoDiff] Pullbacks w/ loops can segfault #68392

jkshtj opened this issue Sep 8, 2023 · 7 comments · Fixed by #68413
Assignees
Labels
AutoDiff bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. run-time crash Bug → crash: Swift code crashed during execution runtime The Swift Runtime swift 5.9

Comments

@jkshtj
Copy link
Contributor

jkshtj commented Sep 8, 2023

Description
Recently we fixed an issue in the Autodiff runtime due to which PBs w/ loops used to leak memory. A fix for the issue was merged in #67944. While validating the fix on our internal test suite we uncovered another memory related bug (a double free) because of which certain pullbacks w/ loops can segfault.

Below is a minimal reproducer for the issue.

import _Differentiation

@differentiable(reverse) func B(x: Float) -> Float {
    var y = x + 1
    for _ in 0 ..< 1 {}
    return y
}

func f() {
    let (_, pb) = valueWithPullback(at: 1.0, of: B)
    pb(21.0)
}

f()

Running the above program leads to a segfault.

Expected Behavior
The program should not segfault and pullback of B should be runnable multiple times.

@jkshtj jkshtj added bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels labels Sep 8, 2023
@jkshtj
Copy link
Contributor Author

jkshtj commented Sep 8, 2023

Why is there a double-free happening in this example?
In #67944 we fixed the autodiff loop-context memory management built-ins to properly free non-trivial context that may be captured by loopy pullbacks. It might seem surprising that fixing memory leaks has lead to double-frees. However, it turns out that not all non-trivial context, that is captured by loopy pullbacks, used to leak -- some of it was being freed by the pullback code itself, thus making the eventual "free" performed by the autodiff builtins a "double-free".

We missed the issue in #67944 because a double-free only happens if non-trivial context is captured in a region before or inside the loop. From the autodiff runtime perspective this leads to the context being stored in non-top level subcontexts. Then because of the code we generate for loopy pullbacks a double-free happens. The following example will clarify what I mean.

For loopy pullbacks capturing context after the end of for loop we have -

// A function whose pullback used to leak memory.
// This is what we fixed in #67944.
@differentiable(reverse) 
func B(x: Float) -> Float {
   for _ in 0 ..< 1 {}
   // In the below SIL we can see that the derivative of 
   // Float.+ is retrieved from the top level context.
   return x+1
}

// The corresponding SIL for this function looks as below (truncated to include useful parts only).
sil private @B: $@convention(thin) (Float, @guaranteed Builtin.NativeObject) -> Float {
bb0(%0 : $Float, %1 : $Builtin.NativeObject):

// This is where we are retrieving the derivative of Float.+.
// It is coming from the top level context.
  %2 = builtin "autoDiffProjectTopLevelSubcontext"(%1 : $Builtin.NativeObject) : $Builtin.RawPointer
  %3 = pointer_to_address %2 : $Builtin.RawPointer to [strict] $*(predecessor: $PredEnum, @callee_guaranteed (Float) -> Float)
  %4 = load %3 : $*(predecessor: $PredEnum, @callee_guaranteed (Float) -> Float) 

// We then +1 the ref count of the derivative
  retain_value %4 : $(predecessor: $PredEnum, @callee_guaranteed (Float) -> Float)

  %6 = tuple_extract %4 : $(predecessor: $PredEnum, @callee_guaranteed (Float) -> Float), 0 
  %7 = tuple_extract %4 : $(predecessor: $PredEnum, @callee_guaranteed (Float) -> Float), 1 
  %8 = apply %7(%0) : $@callee_guaranteed (Float) -> Float
 
// And the do a -1 for the derivative's ref count here
 strong_release %7 : $@callee_guaranteed (Float) -> Float

  switch_enum %6 : $PredEnum, case #$B_bb3__Pred__src_0_wrt_0.bb1!enumelt: bb2
...
}

// The loop context is freed by the autodiff runtime when the pullback goes out of scope.

For loopy pullbacks capturing context before or inside the for loop we have -

@differentiable(reverse) 
func B(x: Float) -> Float {
    // In the below SIL we can see that the derivative of 
    // Float.+ is retrieved from a non-top level subcontext.
    var y = x + 1
    for _ in 0 ..< 1 {}
    return y
}

// The corresponding SIL for this function looks as below (truncated to include useful parts only).
enum $PredEnum {
  case bb2(Builtin.RawPointer)
  case bb0((_: (Float) -> Float))
}

sil private @B: $@convention(thin) (Float, @guaranteed Builtin.NativeObject) -> Float {
...
bb2(%43 : $Builtin.RawPointer):
  // This is where we are retrieving the derivative of Float.+.
  // It is coming from a non-top level subcontext.                    
  %44 = pointer_to_address %43 : $Builtin.RawPointer to [strict] $*(predecessor: $PredEnum) 
  %45 = load %44 : $*(predecessor: $PredEnum) 
  br bb3(%33 : $Float, %37 : $Float, %45 : $(predecessor: $PredEnum)) 

bb3(%47 : $Float, %48 : $Float, %49 : $(predecessor: $PredEnum)): 
  %50 = tuple_extract %49 : $(predecessor: $PredEnum), 0 

 // Here, if PredEnum is the bb0 case, we go to bb5 and the derivative of
// Float.+ is passed to it as an @owned value.
  switch_enum %50 : $PredEnum, case #$PredEnum.bb2!enumelt: bb4, case #$PredEnum.bb0!enumelt: bb5

bb5(%65 : $(_: @callee_guaranteed (Float) -> Float)): 
  %66 = tuple_extract %65 : $(_: @callee_guaranteed (Float) -> Float), 0                        
  %75 = apply %66(%70) : $@callee_guaranteed (Float) -> Float 
  
  // bb5 received the derivative of Float.+ as an @owned value
  // and has no option but to release it.
  strong_release %66 : $@callee_guaranteed (Float) -> Float                          
  ...
}

// The loop context is freed by the autodiff runtime when the pullback goes out of scope
// **double-freeing** the derivative of Float.+, that was already freed in the bb5 basic block
// of the pullback of our original function.
//
// Note that this behavior also means that the above pullback can essentially only be run once,
// even without the memory leak fixes we made in #67944.

@jkshtj
Copy link
Contributor Author

jkshtj commented Sep 8, 2023

I have been working on fixing the issue, however it seems like I have reached an impasse and need some guidance/suggestions.

In order to solve this issue I needed to establish 3 high-level rules. The rules weren't only necessitated by the problem at hand -- in my mind they also seem like the logical/right thing to do.

  1. Loopy pullbacks should not free any of their context. All loop context should be freed by the autodiff runtime.
  2. Trampoline blocks should not do anything more than siphoning pullback arguments from successor blocks to predecessor blocks (both as per original function).
  3. All non-trivial pullback arguments must be propagated and received as @guaranteed arguments so that they don't need to be freed in the pullback.

With the above rules laid out a pullback which initially may have looked like this -

sil private @B: $@convention(thin) (Float, @guaranteed Builtin.NativeObject) -> Float {
...
bb2(%43 : $Builtin.RawPointer):
  %44 = pointer_to_address %43 : $Builtin.RawPointer to [strict] $*(predecessor: $PredEnum) 
  %45 = load %44 : $*(predecessor: $PredEnum) 
  br bb3(%33 : $Float, %37 : $Float, %45 : $(predecessor: $PredEnum)) 

bb3(%47 : $Float, %48 : $Float, %49 : $(predecessor: $PredEnum)): 
  %50 = tuple_extract %49 : $(predecessor: $PredEnum), 0 
  switch_enum %50 : $PredEnum, case #$PredEnum.bb2!enumelt: bb4, case #$PredEnum.bb0!enumelt: bb5

bb5(%65 : $(_: @callee_guaranteed (Float) -> Float)): 
  %66 = tuple_extract %65 : $(_: @callee_guaranteed (Float) -> Float), 0                        
  %75 = apply %66(%70) : $@callee_guaranteed (Float) -> Float 
  
  // bb5 received the derivative of Float.+ as an @owned value
  // and has no option but to release it.
  strong_release %66 : $@callee_guaranteed (Float) -> Float                          
  ...
}

Will now look like this -

bb2(%41 : $Builtin.RawPointer):                   
  br bb3(%33 : $Float, %37 : $Float, %41 : $Builtin.RawPointer) 

bb3(%43 : $Float, %44 : $Float, %45 : $Builtin.RawPointer): 
  %46 = pointer_to_address %45 : $Builtin.RawPointer to [strict] $*(predecessor: $PredEnum) 
  %47 = load %46 : $*(predecessor: $PredEnum) 
  %48 = tuple_extract %47 : $(predecessor: $PredEnum), 0 
  switch_enum %48 : $PredEnum, case #$PredEnum.bb2!enumelt: bb4, case #$PredEnum.bb0!enumelt: bb5 

bb5(%67 : $@guaranteed(_: @callee_guaranteed (Float) -> Float)): 
  %68 = tuple_extract %67 : $(_: @callee_guaranteed (Float) -> Float), 0 
  strong_retain %68 : $@callee_guaranteed (Float) -> Float 
  %78 = apply %68(%73) : $@callee_guaranteed (Float) -> Float 
  strong_release %68 : $@callee_guaranteed (Float) -> Float 
}

Notice that the trampoline block now simply forwards pullback arguments and non-trivial pullback context is received as @guaranteed dispensing with the need to do a "free".

@jkshtj
Copy link
Contributor Author

jkshtj commented Sep 8, 2023

While this change solves most of the issues, it falls short for a function like this.

public func C(from r: Float) -> V 
where 
    V: Swift.Numeric
{
    let values: [V] = D(from: r)
    var sum = V.zero
    
    for i in 1 ..< 10 {
         sum = V.plus(lhs: sum, rhs: values[i])
    }
    return sum
}

@differentiable(reverse where V: Differentiable)
public func D(from r: Float) -> [V] {
    return []
}

The compiler fails to compile this program and fails during SIL verification with the following error -

Have operand with incompatible ownership?!
Value: **%93** = destructure_tuple %92 : $(predecessor: _AD__$s16BroadcastKeyPath1BV1C4fromxSf_tSjRzlF_bb1__Pred__src_0_wrt_0_SjRz16_Differentiation14DifferentiableRz13TangentVectorAaBPQzRszl<τ_0_0>) // user: %107
User:   switch_enum %93 : $_AD__$s16BroadcastKeyPath1BV1C4fromxSf_tSjRzlF_bb1__Pred__src_0_wrt_0_SjRz16_Differentiation14DifferentiableRz13TangentVectorAaBPQzRszl<τ_0_0>, case #_AD__$s16BroadcastKeyPath1BV1C4fromxSf_tSjRzlF_bb1__Pred__src_0_wrt_0_SjRz16_Differentiation14DifferentiableRz13TangentVectorAaBPQzRszl.bb2!enumelt: bb4, case #_AD__$s16BroadcastKeyPath1BV1C4fromxSf_tSjRzlF_bb1__Pred__src_0_wrt_0_SjRz16_Differentiation14DifferentiableRz13TangentVectorAaBPQzRszl.bb0!enumelt: bb6, forwarding: @guaranteed // id: %107
Operand Number: 0
Conv: owned
Constraint:
<Constraint Kind:guaranteed LifetimeConstraint:NonLifetimeEnding>

Why?
The issue happens as a very direct result of rule #3 above. It happens due to the context that is captured inside the loop in the original function. Because the derivative captured inside the loop is generic (by way of being the derivative of a generic function) the basic block enum storing the derivative also becomes generic. This leads to an ownership kind mismatch between -

  1. The switch_enum instruction forwarding the pullback argument from successor to predeccesor basic block.
  2. The pullback's basic block tuple argument retrieved through the raw pointer (Builtin.RawPointer), received in the successor block.

The OSSA SIL for the problematic basic block looks like this -

enum $PredEnum<τ_0_0> where τ_0_0 : Numeric, τ_0_0 : Differentiable, τ_0_0 == τ_0_0.TangentVector {
  case bb2(Builtin.RawPointer)

 // This case makes PredEnum generic
  case bb0((_: (Array<τ_0_0>.DifferentiableView) -> Float))
}

bb3(%88 : $Float, %89 : @owned $Array<τ_0_0>.DifferentiableView, %90 : $Builtin.RawPointer): 
  %91 = pointer_to_address %90 : $Builtin.RawPointer to [strict] $*(predecessor: $PredEnum<τ_0_0>) 
  
  // Notice the `[copy]` flag on the `load`. This means that PredEnum is 
  // considered a non-trivial value by way of being generic.
  %92 = load [copy] %91 : $*(predecessor: $PredEnum<τ_0_0>) 

  %93 = destructure_tuple %92 : $(predecessor: $PredEnum<τ_0_0>) 
  %97 = copy_value %89 : $Array<τ_0_0>.DifferentiableView 
  destroy_value %89 : $Array<τ_0_0>.DifferentiableView 
  
 // The compiler does not like that we're trying to forward an @owned 
 // value as a @guaranteed value and crashes.
  switch_enum %93 : $PredEnum<τ_0_0>, case #$PredEnum.bb2!enumelt: bb4, case #$PredEnum.bb0!enumelt: bb6, forwarding: @guaranteed 

@jkshtj
Copy link
Contributor Author

jkshtj commented Sep 8, 2023

So that's where I'm at with my attempt to fix this issue. I need guidance/help with the following questions to proceed.

  1. Is there a way we can tell the compiler to not treat %92 in the above example as an owned value? I
  • I have looked into the unchecked_ownership_conversion SIL instruction but that does not solve the issue and the compiler gives the below error message.

    Non trivial values, non address values, and non guaranteed function args must have at least one lifetime ending use?!

  1. Is my approach even sane or am I going about this problem the wrong way?

@jkshtj
Copy link
Contributor Author

jkshtj commented Sep 8, 2023

@asl @rxwei @BradLarson The recent segfaults that we have seen in our internal test suite after #67944 was merged have proven to be trickier to fix than expected. I have tried to explain the problem, my approach to solve it and the problems of my proposed solution.

Could you guys take a look and provide any suggestions/guidance you might have?

P.S. - Apologize for the long write up!

@asl asl added AutoDiff runtime The Swift Runtime run-time crash Bug → crash: Swift code crashed during execution and removed triage needed This issue needs more specific labels labels Sep 8, 2023
@asl
Copy link
Contributor

asl commented Sep 8, 2023

I think one of the questions here is that why tuple containing enum with pullback payload is considered as a trivial type (note %45 below):

// %43                                            // user: %44
bb1(%43 : $Builtin.RawPointer):                   // Preds: bb5
  %44 = pointer_to_address %43 : $Builtin.RawPointer to [strict] $*(predecessor: _AD__$s2pb1B1xS2f_tF_bb1__Pred__src_0_wrt_0) // user: %45
  %45 = load [trivial] %44 : $*(predecessor: _AD__$s2pb1B1xS2f_tF_bb1__Pred__src_0_wrt_0) // user: %46
  br bb3(%64 : $Float, %65 : $Float, %45 : $(predecessor: _AD__$s2pb1B1xS2f_tF_bb1__Pred__src_0_wrt_0)) // id: %46

// %51                                            // users: %72, %57, %63, %55
// %52                                            // users: %72, %63
// %53                                            // user: %54
bb3(%51 : $Float, %52 : $Float, %53 : $(predecessor: _AD__$s2pb1B1xS2f_tF_bb1__Pred__src_0_wrt_0)): // Preds: bb1 bb2
  %54 = destructure_tuple %53 : $(predecessor: _AD__$s2pb1B1xS2f_tF_bb1__Pred__src_0_wrt_0) // user: %59
  debug_value %51 : $Float, let, name "x", argno 1 // id: %55
  copy_addr %14 to %10 : $*Float                  // id: %56
  debug_value %51 : $Float, let, name "x", argno 1 // id: %57
  copy_addr %14 to %6 : $*Float                   // id: %58
  switch_enum %54 : $_AD__$s2pb1B1xS2f_tF_bb1__Pred__src_0_wrt_0, case #_AD__$s2pb1B1xS2f_tF_bb1__Pred__src_0_wrt_0.bb2!enumelt: bb4, case #_AD__$s2pb1B1xS2f_tF_bb1__Pred__src_0_wrt_0.bb0!enumelt: bb6, forwarding:
@owned // id: %59
...
// %71                                            // user: %72
bb6(%71 : @owned $(_: @callee_guaranteed (Float) -> Float)): // Preds: bb3
  br bb7(%51 : $Float, %52 : $Float, %71 : $(_: @callee_guaranteed (Float) -> Float)) // id: %72

Somewhere here it looks like an ownership gap to me...

@asl asl self-assigned this Sep 9, 2023
@asl
Copy link
Contributor

asl commented Sep 9, 2023

Working on the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AutoDiff bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. run-time crash Bug → crash: Swift code crashed during execution runtime The Swift Runtime swift 5.9
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants