Skip to content

Apply AD gradient if optimizer is a first-order one #1365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wupeifan opened this issue Jul 29, 2020 · 16 comments
Closed

Apply AD gradient if optimizer is a first-order one #1365

wupeifan opened this issue Jul 29, 2020 · 16 comments

Comments

@wupeifan
Copy link
Contributor

In estimating MLE and MAP, the routines from Optim.jl are called. However, even for gradient-based methods, Optim.jl only support ForwardDiff as an AD backend, and apply finite differences otherwise. Therefore, it makes sense to use the AD backend in Turing (Turing.setadbackend) to define the gradient function and feed it to the optimizer.

Basically, we replace https://github.com/TuringLang/Turing.jl/blob/master/src/modes/ModeEstimation.jl#L383 by a structure like

if optimizer isa Optim.FirstOrderOptimizer
   ...
else
   ...
end

In the first branch, you will need to pass the original function and another that computes the function value and gradient-based on Turing.gradient_logp.
An example is https://github.com/JuliaNLSolvers/Optim.jl/blob/7b660484724755ee2b306dd3ceed13d6633067ae/test/multivariate/optimize/interface.jl#L54

@pkofod
Copy link

pkofod commented Jul 29, 2020

Let me know if you need help here. Is it possible to calculate the gradient and get the value simultaneously with your setup? That would help performance.

@wupeifan
Copy link
Contributor Author

I think in this setting the gradient is Turing.gradient_logp. Therefore, you can wrap the value and the gradient together. @mohamed82008 also pointed out this https://github.com/JuliaNLSolvers/NLSolversBase.jl/blob/9e89526817d5932489d8fdfead5f6e537c1291a1/src/objective_types/oncedifferentiable.jl#L172 in Slack discussion thread.

@devmotion
Copy link
Member

I guess the branch is not needed? It seems one could just define

function (f::OptimLogDensity)(F, G, H, x)
    if G !== nothing
        ...
    end
    if H !== nothing
        ...
    end
    if F !== nothing
        return ...
    end
    nothing
end

and then call Optim.optimize(Optim.only_fgh!(f), ...), similar to https://github.com/JuliaNLSolvers/Optim.jl/blob/7b660484724755ee2b306dd3ceed13d6633067ae/test/multivariate/optimize/interface.jl#L73-L78.

@wupeifan
Copy link
Contributor Author

Yeah, probably. Thanks!
I'm swamped with more urgent stuff so I might not be able to make a PR myself for this one... Maybe wait for Cameron after he finishes his qualifier, or maybe I'll revisit this in a couple of days...

@cpfiffer
Copy link
Member

cpfiffer commented Aug 3, 2020

@mohamed82008 what would be a quick way to get the Hessian as well? I'm switching things up a little to use gradient_logp to get the gradient too, but I'm wondering if I should add a method like hessian_logp. Anyone have thoughts on that?

@wupeifan
Copy link
Contributor Author

wupeifan commented Aug 3, 2020

Hessians are usually computationally expensive and that’s why the original Newton method is not preferred. I don’t think it is that necessary to add Hessians for optimization per se but maybe other people have different ideas.

That said, I think it would be great if the information matrix can be provided...

@cpfiffer
Copy link
Member

cpfiffer commented Aug 3, 2020

Well, if we're going to use the Optim.only_fgh! method (or the non-Hessian variant Optim.only_fg!) we should probably consider extending support for Hessians in an efficient way in case someone wants to use Newton or the other Hessian-based methods.

We've got support already for the information matrix, but I'm not sure if it's finite-difference based or not (I think it is) -- you can do it with

using StatsBase

m = optimize(model, MLE())
StatsBase.informationmatrix(m)

@wupeifan
Copy link
Contributor Author

wupeifan commented Aug 3, 2020

in case someone wants to use Newton or the other Hessian-based methods.

For small scale problems yeah. I don't know, maybe other people have better ideas.

@mohamed82008
Copy link
Member

I'm wondering if I should add a method like hessian_logp.

I think that's reasonable, but perhaps as a separate PR. For now, we can say that second-order methods are not supported if H !== nothing.

@wupeifan
Copy link
Contributor Author

wupeifan commented Aug 5, 2020

We've got support already for the information matrix, but I'm not sure if it's finite-difference based or not (I think it is) -- you can do it with

using StatsBase

m = optimize(model, MLE())
StatsBase.informationmatrix(m)

The information matrix actually spits an error. I think it calls ForwardDiff automatically. @cpfiffer

@cpfiffer
Copy link
Member

cpfiffer commented Aug 5, 2020

Welp, all the more reason to get the actual Hessian stuff built in too.

@ChrisRackauckas
Copy link
Collaborator

@mohamed82008 what would be a quick way to get the Hessian as well? I'm switching things up a little to use gradient_logp to get the gradient too, but I'm wondering if I should add a method like hessian_logp. Anyone have thoughts on that?

Just do forward mode over whatever reverse mode is chosen.

@wupeifan
Copy link
Contributor Author

Just do forward mode over whatever reverse mode is chosen.

It could be possible that under custom adjoints the users cannot provide a Hessian in some of the intermediate steps, so the forward mode might not proceed.
I think it's not that trivial when custom adjoints are provided; however, what you said should be feasible if everything is taken care of by an AD backend...

@wupeifan
Copy link
Contributor Author

Thanks for implementing this feature request! I just experimented with the new codes and they work perfectly.

@cpfiffer
Copy link
Member

cpfiffer commented Aug 20, 2020 via email

@wupeifan
Copy link
Contributor Author

@cpfiffer It's rather a feasibility issue for me as previously I can't run gradient-based methods. For sure it will be faster than the previous simplex method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants