From 2a62b90411f1276c88eb4d8e1942509f3637491e Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <florian@lemaitre.re>
Date: Thu, 19 Apr 2018 16:24:25 +0200
Subject: [PATCH 1/9] Added permutation and shuffling primitives

---
 proposals/simd/SIMD.md | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md
index afcdc9c09..77302974c 100644
--- a/proposals/simd/SIMD.md
+++ b/proposals/simd/SIMD.md
@@ -211,10 +211,28 @@ The input lane value, `x`, is interpreted the same way as for the splat
 instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored.
 
 ### Shuffle lanes
+
+#### Immediate permutation rule
+* `v8x16.permute(a: v128, s: LaneIdx16[16]) -> v128`
+* `v16x8.permute(a: v128, s: LaneIdx8[8]) -> v128`
+* `v32x4.permute(a: v128, s: LaneIdx4[4]) -> v128`
+* `v64x2.permute(a: v128, s: LaneIdx2[2]) -> v128`
 * `v8x16.shuffle(a: v128, b: v128, s: LaneIdx32[16]) -> v128`
+* `v16x8.shuffle(a: v128, b: v128, s: LaneIdx16[8]) -> v128`
+* `v32x4.shuffle(a: v128, b: v128, s: LaneIdx8[4]) -> v128`
+* `v64x2.shuffle(a: v128, b: v128, s: LaneIdx4[2]) -> v128`
 
-Create vector with lanes selected from the lanes of two input vectors:
+Create vector with lanes selected from the lanes of the input vector:
+
+```python
+def S.permute(a, s):
+    result = S.New()
+    for i in range(S.Lanes):
+        result[i] = a[s[i]]
+    return result
+```
 
+Create vector with lanes selected from the lanes of two input vectors:
 ```python
 def S.shuffle(a, b, s):
     result = S.New()
@@ -226,6 +244,18 @@ def S.shuffle(a, b, s):
     return result
 ```
 
+#### Variable permutation rule
+* `v8x16.permuteVar(a: v128, s: v128) -> v128`
+* `v16x8.permuteVar(a: v128, s: v128) -> v128`
+* `v32x4.permuteVar(a: v128, s: v128) -> v128`
+* `v64x2.permuteVar(a: v128, s: v128) -> v128`
+* `v8x16.shuffleVar(a: v128, b: v128, s: v128) -> v128`
+* `v16x8.shuffleVar(a: v128, b: v128, s: v128) -> v128`
+* `v32x4.shuffleVar(a: v128, b: v128, s: v128) -> v128`
+* `v64x2.shuffleVar(a: v128, b: v128, s: v128) -> v128`
+
+Same as non-`Var`, but where indices are runtime values.
+
 ## Integer arithmetic
 
 Wrapping integer arithmetic discards the high bits of the result.

From 219cc12509d7238ae89f72f1574c0d7cfba513f4 Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <florian@lemaitre.re>
Date: Thu, 19 Apr 2018 16:43:23 +0200
Subject: [PATCH 2/9] Add reduction paragraph

reductions are computed with permutes
---
 proposals/simd/SIMD.md | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md
index 77302974c..98ad29ab8 100644
--- a/proposals/simd/SIMD.md
+++ b/proposals/simd/SIMD.md
@@ -705,3 +705,28 @@ Lane-wise saturating conversion from floating point to integer using the IEEE
 resulting lane is 0. If the rounded integer value of a lane is outside the
 range of the destination type, the result is saturated to the nearest
 representable integer value.
+
+
+## Reductions
+
+There is no instruction for reductions.
+Instead, one can use permutations to reduce lane-wise operations like `add`, `min`, `max`, `and`, `or`...
+
+Here is an example to reduce add on f32x4:
+```
+get_local 0
+v32x4.permute 2 3 0 1  ;; swap the lower part with the higher part of the vector
+f32x4.add
+get_local 0
+v32x4.permute 1 0 3 2  ;; swap the 2 first elements together, and the 2 last elements together
+f32x4.add
+f32x4.extract_lane 0  ;; extract the first element
+```
+
+Here is an example to reduce add on f64x2:
+```
+get_local 0
+v64x2.permute 1 0  ;; swap the lower part with the higher part of the vector
+f64x2.add
+f64x2.extract_lane 0  ;; extract the first element
+```

From f25b996d33456d5df457939f3878415dbcaddc5f Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <florian@lemaitre.re>
Date: Thu, 19 Apr 2018 23:57:19 +0200
Subject: [PATCH 3/9] Shorter encoding for reduce add on f32x4

---
 proposals/simd/SIMD.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md
index 98ad29ab8..e194c01c3 100644
--- a/proposals/simd/SIMD.md
+++ b/proposals/simd/SIMD.md
@@ -715,7 +715,7 @@ Instead, one can use permutations to reduce lane-wise operations like `add`, `mi
 Here is an example to reduce add on f32x4:
 ```
 get_local 0
-v32x4.permute 2 3 0 1  ;; swap the lower part with the higher part of the vector
+v64x2.permute 1 0  ;; swap the lower part with the higher part of the vector
 f32x4.add
 get_local 0
 v32x4.permute 1 0 3 2  ;; swap the 2 first elements together, and the 2 last elements together

From 9146aeb2aaa77faa62fda7fd9edb76170eccc6ed Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <florian@lemaitre.re>
Date: Wed, 8 Aug 2018 23:30:52 +0200
Subject: [PATCH 4/9] Removed polemical shuffle instructions

---
 proposals/simd/SIMD.md | 21 ---------------------
 1 file changed, 21 deletions(-)

diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md
index e194c01c3..c6d6b0536 100644
--- a/proposals/simd/SIMD.md
+++ b/proposals/simd/SIMD.md
@@ -213,25 +213,11 @@ instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored.
 ### Shuffle lanes
 
 #### Immediate permutation rule
-* `v8x16.permute(a: v128, s: LaneIdx16[16]) -> v128`
-* `v16x8.permute(a: v128, s: LaneIdx8[8]) -> v128`
-* `v32x4.permute(a: v128, s: LaneIdx4[4]) -> v128`
-* `v64x2.permute(a: v128, s: LaneIdx2[2]) -> v128`
 * `v8x16.shuffle(a: v128, b: v128, s: LaneIdx32[16]) -> v128`
 * `v16x8.shuffle(a: v128, b: v128, s: LaneIdx16[8]) -> v128`
 * `v32x4.shuffle(a: v128, b: v128, s: LaneIdx8[4]) -> v128`
 * `v64x2.shuffle(a: v128, b: v128, s: LaneIdx4[2]) -> v128`
 
-Create vector with lanes selected from the lanes of the input vector:
-
-```python
-def S.permute(a, s):
-    result = S.New()
-    for i in range(S.Lanes):
-        result[i] = a[s[i]]
-    return result
-```
-
 Create vector with lanes selected from the lanes of two input vectors:
 ```python
 def S.shuffle(a, b, s):
@@ -245,14 +231,7 @@ def S.shuffle(a, b, s):
 ```
 
 #### Variable permutation rule
-* `v8x16.permuteVar(a: v128, s: v128) -> v128`
-* `v16x8.permuteVar(a: v128, s: v128) -> v128`
-* `v32x4.permuteVar(a: v128, s: v128) -> v128`
-* `v64x2.permuteVar(a: v128, s: v128) -> v128`
 * `v8x16.shuffleVar(a: v128, b: v128, s: v128) -> v128`
-* `v16x8.shuffleVar(a: v128, b: v128, s: v128) -> v128`
-* `v32x4.shuffleVar(a: v128, b: v128, s: v128) -> v128`
-* `v64x2.shuffleVar(a: v128, b: v128, s: v128) -> v128`
 
 Same as non-`Var`, but where indices are runtime values.
 

From 89b72454310b4bae13f51ae547970b1962c5baff Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <florian@lemaitre.re>
Date: Sat, 11 Aug 2018 15:31:47 +0200
Subject: [PATCH 5/9] snake case for `shuffle_var`

---
 proposals/simd/SIMD.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md
index c6d6b0536..6b2fa5a43 100644
--- a/proposals/simd/SIMD.md
+++ b/proposals/simd/SIMD.md
@@ -231,9 +231,9 @@ def S.shuffle(a, b, s):
 ```
 
 #### Variable permutation rule
-* `v8x16.shuffleVar(a: v128, b: v128, s: v128) -> v128`
+* `v8x16.shuffle_var(a: v128, b: v128, s: v128) -> v128`
 
-Same as non-`Var`, but where indices are runtime values.
+Same as non-`var`, but where indices are runtime values.
 
 ## Integer arithmetic
 

From 64dc7ae8d3b19c4166756163fc166564d43675e0 Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <florian@lemaitre.re>
Date: Thu, 13 Dec 2018 11:29:30 +0100
Subject: [PATCH 6/9] Update SIMD.md

---
 proposals/simd/SIMD.md | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md
index 6b2fa5a43..e215d24ea 100644
--- a/proposals/simd/SIMD.md
+++ b/proposals/simd/SIMD.md
@@ -694,10 +694,12 @@ Instead, one can use permutations to reduce lane-wise operations like `add`, `mi
 Here is an example to reduce add on f32x4:
 ```
 get_local 0
-v64x2.permute 1 0  ;; swap the lower part with the higher part of the vector
+get_local 0
+v64x2.shuffle 1 0  ;; swap the lower part with the higher part of the vector
 f32x4.add
 get_local 0
-v32x4.permute 1 0 3 2  ;; swap the 2 first elements together, and the 2 last elements together
+get_local 0
+v32x4.shuffle 1 0 3 2  ;; swap the 2 first elements together, and the 2 last elements together
 f32x4.add
 f32x4.extract_lane 0  ;; extract the first element
 ```
@@ -705,7 +707,8 @@ f32x4.extract_lane 0  ;; extract the first element
 Here is an example to reduce add on f64x2:
 ```
 get_local 0
-v64x2.permute 1 0  ;; swap the lower part with the higher part of the vector
+get_local 0
+v64x2.shuffle 1 0  ;; swap the lower part with the higher part of the vector
 f64x2.add
 f64x2.extract_lane 0  ;; extract the first element
 ```

From 12aed66888bdca14c2caee3c217e672458580c15 Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <florian@lemaitre.re>
Date: Sat, 2 Mar 2019 12:38:11 +0100
Subject: [PATCH 7/9] Fixed paragraph positionning

---
 proposals/simd/SIMD.md | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md
index 68d3d1d0d..d129be149 100644
--- a/proposals/simd/SIMD.md
+++ b/proposals/simd/SIMD.md
@@ -288,6 +288,7 @@ instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored.
 
 #### Immediate permutation rule
 * `v8x16.shuffle(a: v128, b: v128, imm: ImmLaneIdx32[16]) -> v128`
+
 Returns a new vector with lanes selected from the lanes of the two input vectors
 `a` and `b` specified in the 12 byte wide immediate mode operand `imm`. This
 instruction is encoded with 12 bytes providing the indices of the elements to
@@ -295,6 +296,7 @@ return. The indices `i` in range `[0, 15]` select the `i`-th element of `a`. The
 indices in range `[16, 31]` select the `i - 16`-th element of `b`.
 
 * `v16x8.shuffle(a: v128, b: v128, imm: ImmLaneIdx16[8]) -> v128`
+
 Returns a new vector with lanes selected from the lanes of the two input vectors
 `a` and `b` specified in the 3 byte wide immediate mode operand `imm`. This
 instruction is encoded with 3 bytes providing the indices of the elements to
@@ -302,6 +304,7 @@ return. The indices `i` in range `[0, 7]` select the `i`-th element of `a`. The
 indices in range `[8, 15]` select the `i - 8`-th element of `b`.
 
 * `v32x4.shuffle(a: v128, b: v128, imm: ImmLaneIdx8[4]) -> v128`
+
 Returns a new vector with lanes selected from the lanes of the two input vectors
 `a` and `b` specified in the 2 byte wide immediate mode operand `imm`. This
 instruction is encoded with 2 bytes providing the indices of the elements to
@@ -309,6 +312,7 @@ return. The indices `i` in range `[0, 3]` select the `i`-th element of `a`. The
 indices in range `[4, 7]` select the `i - 4`-th element of `b`.
 
 * `v64x2.shuffle(a: v128, b: v128, imm: ImmLaneIdx4[2]) -> v128`
+
 Returns a new vector with lanes selected from the lanes of the two input vectors
 `a` and `b` specified in the 1 byte wide immediate mode operand `imm`. This
 instruction is encoded with 1 bytes providing the indices of the elements to
@@ -328,8 +332,10 @@ def S.shuffle(a, b, s):
 
 #### Variable permutation rule
 * `v8x16.permute_dyn(a: v128, s: v128) -> v128`
+
 Returns a new vector with lanes selected from the lanes of the first input vector
-`a` and specified in the second input vector `s`.
+`a` and specified in the second input vector `s`. The indices from `s` are first
+fit into the range `[0, 15]` via a modulo.
 
 ```python
 def S.permute_dyn(a, s):

From c3ce95aa10fe78c62d9487e9a5884b95a3ce5c44 Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <florian@lemaitre.re>
Date: Sat, 2 Mar 2019 14:58:52 +0100
Subject: [PATCH 8/9] Fixed length of v8x16.shuffle immediate

---
 proposals/simd/SIMD.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md
index d129be149..526ec72f1 100644
--- a/proposals/simd/SIMD.md
+++ b/proposals/simd/SIMD.md
@@ -290,8 +290,8 @@ instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored.
 * `v8x16.shuffle(a: v128, b: v128, imm: ImmLaneIdx32[16]) -> v128`
 
 Returns a new vector with lanes selected from the lanes of the two input vectors
-`a` and `b` specified in the 12 byte wide immediate mode operand `imm`. This
-instruction is encoded with 12 bytes providing the indices of the elements to
+`a` and `b` specified in the 10 byte wide immediate mode operand `imm`. This
+instruction is encoded with 10 bytes providing the indices of the elements to
 return. The indices `i` in range `[0, 15]` select the `i`-th element of `a`. The
 indices in range `[16, 31]` select the `i - 16`-th element of `b`.
 

From 3152128de8d39ceaefcce7a08cc719b2d89a6f62 Mon Sep 17 00:00:00 2001
From: Lemaitre <florian@lemaitre.re>
Date: Sun, 31 Mar 2019 17:38:04 +0200
Subject: [PATCH 9/9] Updated Binary and text encoding

---
 proposals/simd/BinarySIMD.md | 5 ++++-
 proposals/simd/TextSIMD.md   | 5 ++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/proposals/simd/BinarySIMD.md b/proposals/simd/BinarySIMD.md
index ff51a7f0f..0b6151fc1 100644
--- a/proposals/simd/BinarySIMD.md
+++ b/proposals/simd/BinarySIMD.md
@@ -167,4 +167,7 @@ The `v8x16.shuffle2_imm` instruction has 16 bytes after `simdop`.
 | `f64x2.convert_s/i64x2`   |    `0xb1`| -                  |
 | `f64x2.convert_u/i64x2`   |    `0xb2`| -                  |
 | `v8x16.shuffle1`          |    `0xc0`| -                  |
-| `v8x16.shuffle2_imm`      |    `0xc1`| s:LaneIdx32[16]    |
\ No newline at end of file
+| `v8x16.shuffle2_imm`      |    `0xcc`| s:LaneIdx32[16]    |
+| `v16x8.shuffle2_imm`      |    `0xcd`| s:LaneIdx16[8]     |
+| `v32x4.shuffle2_imm`      |    `0xce`| s:LaneIdx8[4]      |
+| `v64x2.shuffle2_imm`      |    `0xcf`| s:LaneIdx4[2]      |
diff --git a/proposals/simd/TextSIMD.md b/proposals/simd/TextSIMD.md
index fc3a7e7d2..6fda0692f 100644
--- a/proposals/simd/TextSIMD.md
+++ b/proposals/simd/TextSIMD.md
@@ -20,8 +20,11 @@ The canonical text format used for printing `v128.const` instructions is
 v128.const i32x4 0xNNNNNNNN 0xNNNNNNNN 0xNNNNNNNN 0xNNNNNNNN
 ```
 
-### v8x16.shuffle2_imm
+### Shuffling using immediate indices
 
 ```
 v8x16.shuffle2_imm i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5
+v16x8.shuffle2_imm i4 i4 i4 i4 i4 i4 i4 i4
+v32x4.shuffle2_imm i3 i3 i3 i3
+v64x2.shuffle2_imm i2 i2
 ```