You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blogposts/2019-03-05-dp-vs-rl.md
+10-10
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ We've discussed the idea of [differentiable programming](https://fluxml.ai/2019/
10
10
11
11
Differentiation is what makes deep learning tick; given a function $y = f(x)$ we use the gradient $\frac{dy}{dx}$ to figure out how a change in $x$ will affect $y$. Despite the mathematical clothing, gradients are actually a very general and intuitive concept. Forget the formulas you had to stare at in school; let's do something more fun, like throwing stuff.
When we throw things with a trebuchet, our $x$ represents a setting (say, the size of the counterweight, or the angle of release), and $y$ is the distance the projectile travels before landing. If you're trying to aim, the gradient tells you something very useful – whether a change in aim will increase or decrease the distance. To maximise distance, just follow the gradient.
16
16
@@ -31,19 +31,19 @@ Now we have that, let's do something interesting with it.
31
31
32
32
A simple way to use this is to aim the trebuchet at a target, using gradients to fine-tune the angle of release; this kind of thing is common under the name of _parameter estimation_, and we've [covered examples like it before](https://julialang.org/blog/2019/01/fluxdiffeq). We can make things more interesting by going meta: instead of aiming the trebuchet given a single target, we'll optimise a neural network that can aim it given _any_ target. Here's how it works: the neural net takes two inputs, the target distance in metres and the current wind speed. The network spits out trebuchet settings (the mass of the counterweight and the angle of release) that get fed into the simulator, which calculates the achieved distance. We then compare to our target, and _backpropagate through the entire chain_, end to end, to adjust the weights of the network. Our "dataset" is a randomly chosen set of targets and wind speeds.
A nice property of this simple model is that training it is _fast_, because we've expressed exactly what we want from the model in a fully differentiable way. Initially it looks like this:
@@ -55,7 +55,7 @@ This is about the simplest possible control problem, which we use mainly for ill
55
55
56
56
A more recognisable control problem is [CartPole](https://gym.openai.com/envs/CartPole-v0/), the "hello world" for reinforcement learning. The task is to learn to balance an upright pole by nudging its base left or right. Our setup is broadly similar to the trebuchet case: a [Julia implementation](https://github.com/tejank10/Gym.jl) means we can directly treat the reward produced by the environment as a loss. ∂P allows us to switch seamlessly from model-free to model-based RL.
The astute reader may notice a snag. The action space for cartpole – nudge left or right – is discrete, and therefore not differentiable. We solve this by introducing a _differentiable discretisation_, defined [like so](https://github.com/FluxML/model-zoo/blob/cdda5cad3e87b216fa67069a5ca84a3016f2a604/games/differentiable-programming/cartpole/DiffRL.jl#L32):
61
61
@@ -74,22 +74,22 @@ In other words, we force the gradient to behave as if $f$ were the identity func
74
74
75
75
The results speak for themselves. Where RL methods need to train for hundreds of episodes before solving the problem, the ∂P model only needs around 5 episodes to win conclusively.
An important aim for RL is to handle _delayed reward_, when an action doesn't help us until several steps in the future. ∂P allows this too, and in a very familiar way: when the environment is differentiable, we can actually train the agent using backpropagation through time, just like a recurrent net! In this case the environmental state becomes the "hidden state" that changes between time steps.
To demonstrate this technique we looked at the [pendulum](https://github.com/openai/gym/wiki/Pendulum-v0) environment, where the task is to swing a pendulum until it stands upright, keeping it balanced with minimal effort. This is hard for RL models; after around 20 episodes of training the problem is solved, but often the route to a solution is visibly sub-optimal. In contrast, BPTT can beat the [RL leaderboard](https://github.com/openai/gym/wiki/Leaderboard#pendulum-v0) in _a single episode of training_. It's instructive to actually watch this episode unfold; at the beginning of the recording the strategy is random, and the model improves over time. The pace of learning is almost alarming.
Despite only experiencing a single episode, the model generalises well to handle any initial angle, and has something pretty close to the optimal strategy. When restarted the model looks more like this.
This is just the beginning; we'll get the real wins applying DP to environments that are too hard for RL to work with at all, where rich simulations and models already exist (as in much of engineering and the sciences), and where interpretability is an important factor (as in medicine).
Copy file name to clipboardExpand all lines: blogposts/2019-09-11-simulating-the-motion-of-charges.md
+4-4
Original file line number
Diff line number
Diff line change
@@ -131,7 +131,7 @@ end
131
131
132
132
Using some basic plotting functions, I graphed my charges at every training iteration and put them all together in a gif.
133
133
134
-

134
+

135
135
136
136
*Red charges are positive, Blue charges are negative and size of the dot is proportional to the absolute value of charge.*
137
137
@@ -143,18 +143,18 @@ Let’s prove this.
143
143
144
144
Let’s take a trivial case and define a system of two charges: one positive and one negative. Let’s place them diametrically opposite with respect to the origin.
145
145
146
-

146
+

147
147
148
148
What do we expect to happen? The charges should move together.
149
149
150
-

150
+

151
151
152
152
And they do!
153
153
154
154
_(What’s interesting is that they seem to have overshot and actually crossed each other at one point, only to get drawn to each other once more. An apt parallel is charges overshooting due to inertia of motion)_
155
155
156
156
Let’s go back to that system of 100 charges and plot the potential energy at every training iteration.
157
-

157
+

158
158
159
159
It’s safe to say that the system has converged at a value of approximately -1360 .
Flux is a 100% pure-Julia stack and provides lightweight abstractions on top of Julia's native GPU and AD support. It makes the easy things easy while remaining fully hackable.
0 commit comments