You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: configs/experiment/mega/lra-image/README.md
+56-7
Original file line number
Diff line number
Diff line change
@@ -1,33 +1,53 @@
1
-
## Mega: SSM Ablations (EMA and S4D)
1
+
## Mega: SSM Ablations - EMA and S4(D)
2
+
3
+
4
+
The Mega model from "Mega: Moving Average Equipped Gated Attention" has been implemented in this codebase.
5
+
Roughly, this model combines an *exponential moving average* (EMA) component with gating and attention. Although stemming from a quite different motivation and developed concurrently, the EMA component ends up very similar to S4 (in particularly S4D).
6
+
This folder thus contains a limited set of ablations comparing these components.
7
+
8
+
### Disclaimers
9
+
10
+
**Reproducibility:** These ablations were run from an internal codebase in Nov 2022 which should be equivalent to this PR (https://github.com/HazyResearch/state-spaces/commit/e9ce652126cc773dcb6bb7d6f7270c425d4a36a2), although they have not been reproduced in this codebase and may have slight discrepancies. Furthermore other parts of the code have changed since then.
11
+
12
+
**Limited datasets:**
13
+
These ablations were run only on the LRA-Image task, which is a toy task, and with the single setting where the Mega chunk size is $c=128$. Although the results below show S4 variants to outperform EMA in this setting, **the full Mega-chunk model $c=1024$ performs much better** and preliminary ablations showed that for $c=1024$, Mega-EMA outperformed Mega-S4D by 0.5-1 points on these particular hyperparameter settings.
14
+
15
+
16
+
### Results
2
17
3
-
This branch contains reproductions of the Mega: Moving Average Equipped Gated Attention model. To see all changes, see the PR for this branch (https://github.com/HazyResearch/state-spaces/commit/e9ce652126cc773dcb6bb7d6f7270c425d4a36a2).
^ Speed differences stem from a different implementation of the bidirectional logic and is not inherent to the model. The EMA-Repro runs use the same faster version that the S4(D) baselines use.
Pure S4D module from the S4D paper (no Attention).
113
143
114
-
To make the models more directly comparable, some architecture flags were tweaked to match the Mega models (namely using pre-batch-norm rather than post-layer-norm),
115
-
which might lower these results slightly compared to the original S4D results for these model sizes.
144
+
To make the models more directly comparable, some architecture flags were tweaked to match the Mega models (namely using pre-batch-norm rather than post-layer-norm).
The above configs have been updated with more warmup steps (see [TODO]()).
170
+
The above configs have been updated with more warmup steps.
137
171
Earlier versions of these experiments were run where everything was exactly the same except all runs had `scheduler.num_warm_steps=1000`. These are the earlier results.
138
172
139
173
| Model | Params | s/epoch | Val Acc |
@@ -150,3 +184,18 @@ Earlier versions of these experiments were run where everything was exactly the
Again, it is stressed that these were all for a **very limited task setting** and that Mega-EMA likely outperforms the S4(D) baselines for the setting $c=1024$ and these hyperparameters.
0 commit comments