Skip to content

Commit 06dbbdf

Browse files
committed
update Mega README with more transparent perspectives on the ablations
1 parent ede0b53 commit 06dbbdf

File tree

3 files changed

+56
-7
lines changed

3 files changed

+56
-7
lines changed

configs/experiment/mega/lra-image/README.md

+56-7
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,53 @@
1-
## Mega: SSM Ablations (EMA and S4D)
1+
## Mega: SSM Ablations - EMA and S4(D)
2+
3+
4+
The Mega model from "Mega: Moving Average Equipped Gated Attention" has been implemented in this codebase.
5+
Roughly, this model combines an *exponential moving average* (EMA) component with gating and attention. Although stemming from a quite different motivation and developed concurrently, the EMA component ends up very similar to S4 (in particularly S4D).
6+
This folder thus contains a limited set of ablations comparing these components.
7+
8+
### Disclaimers
9+
10+
**Reproducibility:** These ablations were run from an internal codebase in Nov 2022 which should be equivalent to this PR (https://github.com/HazyResearch/state-spaces/commit/e9ce652126cc773dcb6bb7d6f7270c425d4a36a2), although they have not been reproduced in this codebase and may have slight discrepancies. Furthermore other parts of the code have changed since then.
11+
12+
**Limited datasets:**
13+
These ablations were run only on the LRA-Image task, which is a toy task, and with the single setting where the Mega chunk size is $c=128$. Although the results below show S4 variants to outperform EMA in this setting, **the full Mega-chunk model $c=1024$ performs much better** and preliminary ablations showed that for $c=1024$, Mega-EMA outperformed Mega-S4D by 0.5-1 points on these particular hyperparameter settings.
14+
15+
16+
### Results
217

3-
This branch contains reproductions of the Mega: Moving Average Equipped Gated Attention model. To see all changes, see the PR for this branch (https://github.com/HazyResearch/state-spaces/commit/e9ce652126cc773dcb6bb7d6f7270c425d4a36a2).
418

519
| Model | Params | s/epoch | Val Acc |
620
| -------------------- | -------- | --------- | --------- |
7-
| (large) Mega-EMA | 2.73M | 180 | 82.56 |
21+
| (large) Mega-EMA^ | 2.73M | 180 | 82.56 |
822
| (large) Mega-EMA-Repro | 2.65M | 124 | 83.42 |
923
| (large) Mega-S4D-Real | 2.65M | 121 | 84.44 |
1024
| (large) Mega-S4D | 2.65M | 122 | 86.22 |
25+
| (large) Mega-S4 | 2.67M | 138 | 86.68 |
1126
| | | |
1227
| (small) Mega-EMA | 299K | 51 | 81.16 |
1328
| (small) Mega-EMA-Repro | 279K | 51 | 80.76 |
1429
| (small) Mega-S4D-Real | 279K | 54 | 81.20 |
1530
| (small) Mega-S4D | 279K | 53 | 81.46 |
31+
| (small) Mega-S4 | 284K | 61 | 81.63 |
1632
| | | |
17-
| (large) EMA | 4.35M | 128 | 70.96 |
33+
| (large) EMA | 4.35M | 129 | 70.96 |
1834
| (large) EMA-Repro | 3.96M | 119 | 71.52 |
1935
| (large) S4D-Real | 3.96M | 105 | 74.30 |
2036
| (large) S4D | 3.96M | 105 | 88.28 |
37+
| (large) S4 | 4.15M | 118 | 88.70 |
2138
| | | |
2239
| (small) EMA | 333K | 31 | 69.96 |
2340
| (small) EMA-Repro | 267K | 30 | 69.38 |
2441
| (small) S4D-Real | 267K | 32 | 70.88 |
2542
| (small) S4D | 267K | 30 | 82.78 |
43+
| (small) S4 | 300K | 39 | 84.76 |
2644

2745
These runs correspond to the experiment files
2846
`{large-mega,small-mega,small,large}-{ema,ema-with-s4,s4d-real,s4d}.yaml`
2947
described below.
3048

49+
^ Speed differences stem from a different implementation of the bidirectional logic and is not inherent to the model. The EMA-Repro runs use the same faster version that the S4(D) baselines use.
50+
3151
------------
3252

3353
### Large Mega Models
@@ -56,6 +76,11 @@ python -m train experiment=mega/lra-image/large-mega-s4d
5676
```
5777
Same model but replacing the EMA component with original (complex) S4D.
5878

79+
```
80+
python -m train experiment=mega/lra-image/large-mega-s4d '~model.layer.disc' '~model.layer.force_real' model.layer.mode=nplr model.layer.measure=legs
81+
```
82+
Same model but replacing S4D with S4.
83+
5984
----------
6085

6186
### Small Mega Models
@@ -99,6 +124,11 @@ python -m train experiment=mega/lra-image/small-mega-s4d
99124
```
100125
Same as above, but with the original (complex-valued) S4D layer.
101126

127+
```
128+
python -m train experiment=mega/lra-image/small-mega-s4d '~model.layer.disc' '~model.layer.force_real' model.layer.mode=nplr model.layer.measure=legs
129+
```
130+
Same model but replacing S4D with S4.
131+
102132
----------
103133

104134
The `{small,large}-{<model>}.yaml` experiments use a block with only an SSM convolution.
@@ -111,8 +141,7 @@ python -m train experiment=mega/lra-image/small-s4d
111141

112142
Pure S4D module from the S4D paper (no Attention).
113143

114-
To make the models more directly comparable, some architecture flags were tweaked to match the Mega models (namely using pre-batch-norm rather than post-layer-norm),
115-
which might lower these results slightly compared to the original S4D results for these model sizes.
144+
To make the models more directly comparable, some architecture flags were tweaked to match the Mega models (namely using pre-batch-norm rather than post-layer-norm).
116145

117146
```
118147
python -m train experiment=mega/lra-image/small-s4d-real
@@ -129,11 +158,16 @@ python -m train experiment=mega/lra-image/small-ema-with-s4d
129158
```
130159
Same as above, but use settings to match the parameter count of S4D.
131160

161+
```
162+
python -m train experiment=mega/lra-image/small-s4 '~model.layer.disc' '~model.layer.force_real' model.layer.mode=nplr model.layer.measure=legs
163+
```
164+
Same model but replacing S4 with S4D.
165+
132166
-----------
133167

134168
### Earlier runs with different warmup steps
135169

136-
The above configs have been updated with more warmup steps (see [TODO]()).
170+
The above configs have been updated with more warmup steps.
137171
Earlier versions of these experiments were run where everything was exactly the same except all runs had `scheduler.num_warm_steps=1000`. These are the earlier results.
138172

139173
| Model | Params | s/epoch | Val Acc |
@@ -150,3 +184,18 @@ Earlier versions of these experiments were run where everything was exactly the
150184
| (small) S4D-Real | 200K | 32 | 70.34 |
151185
| (small) S4D | 200K | 31 | 84.40 |
152186

187+
188+
-----------
189+
190+
### Runs in Mega repo
191+
192+
193+
| Model | Params | s/epoch | Val Acc |
194+
| -------------------- | -------- | --------- | --------- |
195+
| Mega-EMA (original) | 2.82M | 195 | 86.10 |
196+
| Mega-S4D-Real | 2.74M | 152 | 87.00 |
197+
| Mega-S4D | 2.74M | 152 | 87.12 |
198+
| Mega-S4 | 2.77M | 163 | 87.42 |
199+
200+
201+
Again, it is stressed that these were all for a **very limited task setting** and that Mega-EMA likely outperforms the S4(D) baselines for the setting $c=1024$ and these hyperparameters.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)