Port all active C# hardware intrinsics APIs for SSE from SIMD native algorithms #668

briancylui · 2018-08-08T22:00:40Z

What's new:

Implemented all remaining active SSE intrinsics, including their software fallbacks and passing unit tests
Implemented the performance tests of all remaining active SSE intrinsics, except for those that accept AlignedArray as an argument
Performance test results for all applicable, active SSE intrinsics are updated in Progress on porting ML.NET native SIMD algorithms to managed code briancylui/machinelearning#1

Description from #562:

with unit tests and performance tests for two target frameworks: .NET Core App 3.0 and .NET Standard 2.0.
.NET Core App 3.0 gets the new C# hardware intrinsics APIs, while .NET Standard 2.0 gets the original native SIMD algorithms.
Several things have made this multi-targeting feature possible.
1. The new CpuMathUtils class that contains the new APIs is implemented as a partial class with method definitions split into two separate files (src\Microsoft.ML.CpuMath\CpuMathUtils.[target].cs), only one of which is compiled based on the target framework.
2. The Microsoft.ML.CpuMath.csproj file makes the switching decision to compile the right files based on the target framework.

Structure:

All new hardware intrinsics APIs are contained in src\Microsoft.ML.CpuMath.
Unit tests for the two target frameworks live in test\Microsoft.ML.CpuMath.UnitTests.[target], and contain the same content except for the target framework in .csproj.
Performance tests live in test\Microsoft.ML.CpuMath.PerformanceTests.

Changes to users:

Originally, users call intrinsics APIs via the SseUtils class in src\Microsoft.ML.CpuMath\Sse.cs, but now they call them via the new CpuMathUtils class, which will handle switching between SSE and AVX in the future.
CpuMathUtils methods and SseUtils methods share the same signature, but CpuMathUtils methods additionally call a new helper class (SseIntrinsics) for C# implementations of SSE operations.

Future follow-up for `CpuMath` enhancement

Suggestions on CpuMath enhancement in this PR scheduled for future follow-ups have been compiled into an issue page (Suggestions on CpuMath enhancement briancylui/machinelearning#2).

List of new SSE intrinisics implemented

• MatMulA
• MatMulTranA
• MatMulPA
• MatMulTranPA
• SdcaL1UpdateU
• SdcaL1UpdateSU
• AddScaleCopyU
• SumU
• AddScalarU
• SumSqDiffU
• SumAbsDiffU
• MaxAbsDiffU
• MaxAbsU
• ScaleSrcU
• ScaleAddU
• ZeroItemsU
• ZeroMatrixItemsCoreU

cc: @eerhardt @tannergooding @safern @danmosemsft

…s, passing unit tests, and performance tests Note: Performance tests for functions that involve AlignedArray are not implemented.

briancylui · 2018-08-08T23:28:56Z

test OSX10.13 Debug please

briancylui · 2018-08-08T23:37:41Z

All 5 checks have passed. No merge conflicts.

tannergooding · 2018-08-09T00:05:13Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

@@ -55,7 +78,7 @@ private static unsafe void Store4(Vector128<float> x, float* dst, int* idx)
        }

        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
-        private static Vector128<float> VectorSum(in Vector128<float> vector)


Why drop in?

Nice catch - amended thanks!

safern · 2018-08-09T00:27:48Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+            {
+                if (!tran)
+                {
+                    Contracts.Assert(0 <= crun && crun <= dst.Size);


This feels odd to me, it feels more readable to do crun >= 0 -- also, you could do a common Contracts.Assert for that, it seems like in all the blocks you're doing 0 <= crun

It is also generally useful to split && asserts into two separate asserts, as it allows easy diagnosis on failure.

Thank you for your feedback! Tanner also raised similar concerns about Contracts.Asserts like count > 0. I have adopted these changes and documented them in briancylui#2.

safern · 2018-08-09T00:28:30Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+            {
+                if (!tran)
+                {
+                    Contracts.Assert(0 <= crun && crun <= dst.Size);


Here also, and all the places where this can be unified and change the comparison to be crun >= 0

safern · 2018-08-09T00:29:45Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+        public static void Add(float a, float[] dst, int count)
+        {
+            Contracts.AssertNonEmpty(dst);
+            Contracts.Assert(0 < count);


count > 0? I know it is just the order, but we usually read from left to right, so if you change the sides it is read differently. (0 is smaller than count -> count is greater than 0)

safern · 2018-08-09T00:31:54Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+        // dst = a * src
+        public static void Scale(float a, float[] src, float[] dst, int count)
+        {
+            Contracts.AssertNonEmpty(src);


If AssertNonEmpty just checks for the length to be greater than 0 then this doesn't feel necessary to me, since you're asserting below that 0 < count and then that count <= src.Length, which automatically achieves that. If count is greater than 0 then it means src.Length must be greater than 0 as well.

It is true that 0 < count && count <= src.Length already takes care of checking if src is empty, and Contracts.AssertNonEmpty<T> doesn't seem to print any useful error messages here:

machinelearning/src/Microsoft.ML.Core/Utilities/Contracts.cs

Lines 922 to 927 in d0664c1

[Conditional("DEBUG")]

public static void AssertNonEmpty<T>(ICollection<T> args)

{

if (Size(args) == 0)

DbgFail();

}

It seems to me that we may want to first check whether src is non-empty. If not, the program does not even bother to do the two checks for count. 0 < count && count <= src.Length sounds to me more like a restriction on count given the knowledge that src is a non-empty array. If src were really empty, with the AssertNonEmpty, users could see from the stack trace that the line Contracts.AssertNonEmpty(src) is the problem, but without the AsssertNonEmpty, they would see from the stack trace that the line Contracts.Assert(count <= src.Length) is the problem, which might lead them to check for both count and src.

Yes, let's check first wether src is non-empty and then just check if count <= src.Length.

Contracts.Assert(count > 0); Contracts.AssertNonEmpty(src); Contracts.Assert(count <= src.Length);

As @tannergooding stated in another comment, let's break out the && asserts.

safern · 2018-08-09T00:34:16Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+        {
+            if (Sse.IsSupported)
+            {
+                if (mean == 0)


this whole block can be simplified to:

return mean == 0 ? SseIntrinsics.SumAbsU(src) : SseIntrinsics.SumAbsDiffU(mean, src);

Great minds think alike - your way is actucally the preferred way from the original author:

machinelearning/src/Microsoft.ML.CpuMath/Sse.cs

Lines 1070 to 1081 in d0664c1

public static float SumAbs(float mean, float[] src, int offset, int count)

{

Contracts.AssertNonEmpty(src);

Contracts.Assert(0 < count);

Contracts.Assert(0 <= offset && offset <= src.Length - count);

unsafe

{

fixed (float* psrc = &src[offset])

return (mean == 0 ? Thunk.SumAbsU(psrc, count) : Thunk.SumAbsDiffU(mean, psrc, count));

}

}

I thought breaking down the ?: operator into an if-else statement would have made it a bit more readable, but I guess the ?: operator looks better in this case. Thank you for bringing this up!

eerhardt · 2018-08-09T14:57:42Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

@@ -9,6 +9,174 @@ namespace Microsoft.ML.Runtime.Internal.CpuMath
 {
    public static partial class CpuMathUtils
    {
+        public static void MatTimesSrc(bool tran, bool add, AlignedArray mat, AlignedArray src, AlignedArray dst, int crun)


(probably for future clean up, like other naming suggestions) tran doesn't really describe what this is. It should probably be transpose.

Agree and thanks - documented this in briancylui#2.

eerhardt · 2018-08-09T15:02:18Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+                    SseIntrinsics.MatMulTranA(add, mat, src, dst, dst.Size, crun);
+                }
+            }
+            else


I just realized we don't have any unit tests for these software fallback implementations. I think that is OK for this PR, but it is something to put on your TODO list.

tannergooding · 2018-08-09T15:10:18Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

+                {
+                    Vector128<float> x01 = Sse.LoadAlignedVector128(pSrcCurrent);
+                    // Replicate each slot of x01 into its own register.
+                    Vector128<float> x11 = Sse.Shuffle(x01, x01, 0x55);


A comment about what each of these is extracting would be useful:

0x00 == X

0x55 == Y

0xAA == Z

0xFF == W

Thanks and adopted the change. May I know if there is any preference/convention for using XYZW instead of ABCD? I used ABCD in comments of private functions earlier thinking that all users can always infer the correct elements from their alphabetical order, but not sure if it's the best way.

eerhardt · 2018-08-09T15:11:44Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+        {
+            Contracts.Assert(0 < ccol && ccol <= cfltRow);
+
+            // REVIEW NEEDED: Since the two methods below do not involve any SSE hardware intrinsics, no software fallback is needed.


I'm not understanding here. If there are no SIMD operations involved in these methods, then let's just have the software function in CpuMathUtils and not have the method at all in SseIntrinsics. SseIntrinsics should be only for when we use the Sse operations.

ZeroItemsU and ZeroMatrixItemsCore are two functions in Sse.cpp like all other functions currently in SseIntrinsics.cs, but since even their native implementation does not involve Intel hardware intrinsics like _mm_set_ps at all, there is no need to implement software fallbacks. We require SSE support only for us to call those Intel hardware intrinsics, which are absent in ZeroItemsU and ZeroMatrixItemsCore.

While the implementations of these two functions are software-based, they do involve pointers and follow the C++ syntax in Sse.cpp. I can put these two functions in CpuMathUtils and rewrite them so that they follow the C# syntax.

Before I proceed to modify these function, may I know if I have given relevant info, and if this change is still desired?

I think the right change here is just to move your current code (using pointers) from SseIntrinsics into CpuMathUtils. I assume the reason it is using pointers is for speed. Let's keep it that way until it can be proved we don't need pointers.

Agree - thanks! ZeroItemsU and ZeroMatrixItemsCore call the Ptr method which is private in SseIntrinsics. Should I make Ptr in SseIntrinsics internal instead of Private?

The above change could solve the issue. Another way is to remove the dependence on Ptr in CpuMathUtils:

private static unsafe void ZeroItemsU(AlignedArray dst, int c, int[] indices, int cindices) { for (int i = 0; i < cindices; i++) { int index = indices[i]; Contracts.Assert(index >= 0); Contracts.Assert(index < c); dst[index] = 0; } /* Original version calls the Ptr method, which is private in SseIntrinsics fixed (float* pDstStart = &dst.Items[0]) fixed (int* pidx = &indices[0]) { float* pdst = Ptr(dst, pDstStart); for (int i = 0; i < cindices; ++i) { int index = pidx[i]; Contracts.Assert(0 <= index && index < c); pdst[index] = 0; } } */ } private static unsafe void ZeroMatrixItemsCore(AlignedArray dst, int c, int ccol, int cfltRow, int[] indices, int cindices) { int ivLogMin = 0; int ivLogLim = ccol; int ivPhyMin = 0; for (int i = 0; i < cindices; i++) { int index = indices[i]; Contracts.Assert(index >= 0); Contracts.Assert(index < c); int col = index - ivLogMin; if ((uint)col >= (uint)ccol) { Contracts.Assert(index < ivLogMin || index >= ivLogLim); int row = index / ccol; ivLogMin = row * ccol; ivLogLim = ivLogMin + ccol; ivPhyMin = row * cfltRow; Contracts.Assert(index >= ivLogMin); Contracts.Assert(index < ivLogLim); col = index - ivLogMin; } dst[ivPhyMin + col] = 0; } /* Original version calls the Ptr method, which is private in SseIntrinsics fixed (float* pDstStart = &dst.Items[0]) fixed (int* pidx = &indices[0]) { float* pdst = Ptr(dst, pDstStart); int ivLogMin = 0; int ivLogLim = ccol; int ivPhyMin = 0; for (int i = 0; i < cindices; ++i) { int index = pidx[i]; Contracts.Assert(0 <= index && index < c); int col = index - ivLogMin; if ((uint)col >= (uint)ccol) { Contracts.Assert(ivLogMin > index || index >= ivLogLim); int row = index / ccol; ivLogMin = row * ccol; ivLogLim = ivLogMin + ccol; ivPhyMin = row * cfltRow; Contracts.Assert(ivLogMin <= index && index < ivLogLim); col = index - ivLogMin; } pdst[ivPhyMin + col] = 0; } } */ }

Please let me know which approach I should follow (the former or latter).

You're "latter" approach isn't using pointers at all.

I was thinking we wouldn't need to call Ptr at all. That method moves the array around in memory so it is guaranteed to be aligned. Since we aren't using SIMD instructions to set all the values to 0, we can just pin the arrays using

fixed (float* pDstStart = &dst.Items[0]) fixed (int* pidx = &indices[0])

And then continue with the logic without calling Ptr.

Thank you! I think this is doable, but pdst[index] = 0 in ZeroItemsU and pdst[ivPhyMin + col] = 0 will be affected if we do not call Ptr and increment pDstStart to get the pointer pdst that points to the base of dst. How should I proceed?

You can just use the pointer you have when you pinned the array:

internal static unsafe void ZeroItemsU(AlignedArray dst, int c, int[] indices, int cindices) { fixed (float* pdst = &dst.Items[0]) fixed (int* pidx = &indices[0]) { for (int i = 0; i < cindices; ++i) { int index = pidx[i]; Contracts.Assert(0 <= index && index < c); pdst[index] = 0; } } }

eerhardt · 2018-08-09T15:13:50Z

test/Microsoft.ML.CpuMath.PerformanceTests/SsePerformanceTests.cs


-        private float[] src, dst, original, src1, src2;
+        // Naming follows from SseIntrinsics.
+        private const int CbAlign = 16;


This doesn't appear to be used.

Nice catch - thanks! I originally implemented perf tests for the intrinsics that involved matrix operations and AlignedArray. It was a remnant of a clean-up of unused constants and variables.

eerhardt · 2018-08-09T15:30:30Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

+                    Sse.StoreAligned(pDstCurrent, res0);
+
+                    pDstCurrent += 4;
+                    pMatCurrent += 3 * ccol;


Did you figure out why this is 3 * ccol?

Maybe because here it is adding the ccol 3 times to pMatTemp which is a copy of pMatCurrent and at the end it needs to reset pMatCurrent to the right position?

https://github.com/dotnet/machinelearning/pull/668/files/02bfbe6c0f2f8d12ef7ef37006557d4f0616a054#diff-1212867e04546a8e692d4e263dcc390cR130

It is following the same pattern in this method:

https://github.com/dotnet/machinelearning/pull/668/files/02bfbe6c0f2f8d12ef7ef37006557d4f0616a054#diff-1212867e04546a8e692d4e263dcc390cR292

Agree and thanks! We may get a slightly better view from the native code here:

machinelearning/src/Native/CpuMathNative/Sse.cpp

Lines 108 to 146 in d0664c1

EXPORT_API(void) MatMulA(bool add, _In_ const float * pmat, _In_ const float * psrc, _Inout_ float * pdst, int crow, int ccol)

{

const float * psLim = psrc + ccol;

const float * pdLim = pdst + crow;

const float * pm = pmat;

for (float * pd = pdst; pd < pdLim; pd += 4, pm += 3 * ccol)

{

__m128 res0 = _mm_setzero_ps();

__m128 res1 = res0;

__m128 res2 = res0;

__m128 res3 = res0;

for (const float * ps = psrc; ps < psLim; ps += 4, pm += 4)

{

const float * pmTmp;

__m128 x01 = _mm_load_ps(pmTmp = pm);

__m128 x11 = _mm_load_ps(pmTmp += ccol);

__m128 x21 = _mm_load_ps(pmTmp += ccol);

__m128 x31 = _mm_load_ps(pmTmp += ccol);

__m128 x02 = _mm_load_ps(ps);

x01 = _mm_mul_ps(x01, x02);

x11 = _mm_mul_ps(x11, x02);

x21 = _mm_mul_ps(x21, x02);

x31 = _mm_mul_ps(x31, x02);

res0 = _mm_add_ps(res0, x01);

res1 = _mm_add_ps(res1, x11);

res2 = _mm_add_ps(res2, x21);

res3 = _mm_add_ps(res3, x31);

}

// Add up the entries of each, with the 4 results in res0

res0 = _mm_hadd_ps(res0, res1);

res2 = _mm_hadd_ps(res2, res3);

res0 = _mm_hadd_ps(res0, res2);

if (add)

res0 = _mm_add_ps(res0, _mm_load_ps(pd));

_mm_store_ps(pd, res0);

}

}

pMatCurrent corresponds to pm in the native code in the window. For correctness, pm needs to be incremented by 4 * ccol in each iteration of the outer for loop. We see that at the end of such iteration, pm is incremented by 3 * ccol only - this is because the inner for loop increments pm by 4 in each of its own iteration. SInce the inner for loop runs for (psLim - psrc) / 4 = ccol / 4 iterations, the entire inner for loop increments pm by (ccol / 4) * 4 = ccol. Together with the increment by 3 * ccol done by the outer for loop, pm is incremented by 4 * ccol in each iteration of the outer for loop, which is the desired behavior.

eerhardt · 2018-08-09T15:32:47Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

+
+                float* pSrcEnd = psrc + ccol;
+                float* pDstEnd = pdst + crow;
+                float* pSrcCurrent = psrc;


Why do we need another set of pointers here? Can't we just use psrc?

Actually, I think there is a bug here. Shouldn't pSrcCurrent get reset to psrc after each inner loop?

From the C++ code:

for (const float * ps = psrc; ps < psLim; ps += 4, pm += 4) { const float * pmTmp; __m128 x01 = _mm_load_ps(pmTmp = pm); __m128 x11 = _mm_load_ps(pmTmp += ccol); __m128 x21 = _mm_load_ps(pmTmp += ccol); __m128 x31 = _mm_load_ps(pmTmp += ccol); __m128 x02 = _mm_load_ps(ps);

Maybe we need a few more unit tests to cover different shapes of matrices. We only have 4x4 and 4x8 tests.

In reply to: 208976581 [](ancestors = 208976581)

Reply to the first comment:
In this case, yes, we can use psrc directly and feel free to increment it in the for loops without having a pSrcCurrent, since we don't have to keep track of the start of the array (adjusted for the base of the AlignedArray) in this function. The same goes for dst and mat.

In other functions that do not involve matrix multiplication, I have adopted the convention to only increment pXCurrent, while using pX to indicate a fixed pointer that indicates the start of the array X. Should I skip using pXCurrent just for these 4-6 functions that involve matrix operations?

Reply to the second comment:

Nice catch - thank you! I have looked at all 4 functions that involve matrix multiplications to ensure the same thing doesn't happen elsewhere in these 4 functions.

Since the dst array has a constant size of 4, the outer for loop never has a second iteration, and thus the 2 unit tests implemented for each matrix-op function did not manage to catch this bug. I will probably add a 8x4 test.

Note: These matrix-op functions have the suffix A, which means that the inputs are "aligned and padded for SSE operations," so both dimensions of the input matrix/vector have to be multiples of 4 (except 1).

eerhardt · 2018-08-09T16:01:34Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

+            fixed (float* pdst = dst)
+            {
+                float* pDstEnd = pdst + dst.Length;
+                float* pSrcCurrent = psrc;


We shouldn't need pSrcCurrent and pDstCurrent, right? We can just use psrc and pdst here.

This time, it may be different from the case in the matrix-op function (e.g. MatMulA). Since here psrc is fixed to point to src in the fixed statement, psrc cannot be modified or incremented inside the fixed block. I assigned a new variable pSrcCurrent the value of psrc, so that we can increment pSrcCurrent to achieve the desired behavior.

eerhardt · 2018-08-09T16:06:05Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

@@ -340,13 +850,148 @@ internal static unsafe float SumAbsU(Span<float> src)
                    pSrcCurrent += 4;
                }

-                result = VectorSum(in result);
+                result = VectorSum(result);


It may be nice to keep the in here and below. It is good documentation for the reader.

It is not just good documentation, it can result in different code being emitted for certain cases.

You should always specify in for a variable that is passed that way, IMO.

eerhardt · 2018-08-09T16:08:51Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

+            Vector128<float> result = Sse.SetZeroVector128();
+            Vector128<float> mask;
+
+            if (Sse2.IsSupported)


Maybe refactor this common chunk out into a method GetAbsMask or something similar.

eerhardt · 2018-08-09T16:10:54Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

+                    pSrcCurrent += 4;
+                }
+
+                Vector128<float> x1 = Sse.Shuffle(result, result, 0xB1);


These 4 lines are duplicated with the other Max method. Can we extract a common method here?

Agree and thanks!

Should I make SumAbsU call SumAbsDiffU and MaxAbsU call MaxAbsDiffU with mean == 0f?

No, I wouldn't do it that way. If you did it that way, you always would need to emit Subtract(0) calls, which would slow the function down.

What I mean is just these 4 duplicated lines that take the result, and get the MaxScalar value from it. Similar to the VectorSum method.

Thanks! Yes, I share the same thought, and have been on the same page with you. Wanted to see if it could be a way to reduce similar-looking codes :)

eerhardt · 2018-08-09T16:13:10Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

+            {
+                float* pdst = Ptr(dst, pDstStart);
+
+                // REVIEW NEEDED: This line expands to (void)(c); but is it necessary?


This is not necessary in C#. It is used in C++ to get the compiler to stop complaining that c is only used in an Assert. (It complains in Release builds becuase c is never used, because the assert gets compiled out.)

Thanks - I will remove these two comments. Same for below.

eerhardt · 2018-08-09T16:32:47Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

+
+                Vector128<float> xPrimal = Sse.SetAllVector128(primalUpdate);
+
+                Vector128<float> signMask = Sse.SetAllVector128(-0.0f); // 1000 0000 ...


(nit) I know this is ported from the C++ code, but I think the comment is slightly confusing/misleading. typically I would expect hexadecimal representation here 0x8000 0000. I guess this is binary representation, which is why it is using the ....

it may make sense to call this out as binary using the 0b prefix. Or my preference would be just to use hex - 0x8000 0000.

eerhardt · 2018-08-09T16:35:37Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

+                    Vector128<float> xDst1 = Load4(pdst1, pIdxCurrent);
+                    xDst1 = Sse.Add(xDst1, Sse.Multiply(xSrc, xPrimal));
+
+                    Vector128<float> xSign = Sse.And(xDst1, signMask); // result = 10000... if xDst1 is negative or 00000 otherwise


I think this chunk of code can be refactored out into a method, since it is duplicated with the above method.

…g soon)

briancylui · 2018-08-09T22:45:09Z

Responded to all PR feedback so far. Relevant changes are included in the last two commits.

Now ready for a second round of review if any.

eerhardt

This looks good to me. Thanks for the good work, @briancylui.

I'd like to get @safern and @tannergooding to have a look as well. @safern and @tannergooding, when you approve, can you merge this? I will be out for the next 2 working days.

helloguo · 2018-08-10T15:41:06Z

@briancylui

I see the code usually uses while (pDstCurrent + 4 <= pDstEnd) for the loop bound checking both in this PR and PR #562, do you mind changing it to while (pDstCurrent <= pDstEnd - 4)?

Take DotU as an example, when using while (pSrcCurrent + 4 <= pEnd), the code gen looks like this:

0x7ffbc7fae07c		Block 10:										
0x7ffbc7fae07c		vmovups xmm0, xmmword ptr [rdx]	
0x7ffbc7fae081		vmovups xmm1, xmmword ptr [rax]	
0x7ffbc7fae086		vmulps xmm0, xmm0, xmm1	
0x7ffbc7fae08b		vaddps xmm6, xmm6, xmm0	
0x7ffbc7fae090		add rdx, 0x10	
0x7ffbc7fae094		add rax, 0x10	
0x7ffbc7fae098		lea r8, ptr [rdx+0x10]	
0x7ffbc7fae09c		cmp r8, rcx										
0x7ffbc7fae09f		jbe 0x7ffbc7fae07c <Block 10>

When using while (pSrcCurrent <= pEnd - 4), the code gen looks like this:

0x7ffbc7f8e07c		Block 10:										
0x7ffbc7f8e07c		vmovups xmm0, xmmword ptr [rdx]	
0x7ffbc7f8e081		vmovups xmm1, xmmword ptr [rax]	
0x7ffbc7f8e086		vmulps xmm0, xmm0, xmm1	
0x7ffbc7f8e08b		vaddps xmm6, xmm6, xmm0	
0x7ffbc7f8e090		add rdx, 0x10	
0x7ffbc7f8e094		add rax, 0x10	
0x7ffbc7f8e098		cmp rdx, r8	
0x7ffbc7f8e09b		jbe 0x7ffbc7f8e07c <Block 10>

So the second one saves the instruction lea r8, ptr [rdx+0x10], which is redundant with add rdx, 0x10.

tannergooding · 2018-08-10T16:03:09Z

@helloguo, I believe that is covered in the existing notes on briancylui#2

The initial port is basically a direct translation of the C++ code, and the refactorings/cleanup are going to come in a separate PR (for the most part).

briancylui · 2018-08-10T16:23:40Z

@helloguo Thank you for your comment. At this point, we aim for a clean port of the original native code, from which we will obtain a baseline perf. All suggestions for future follow-up are documented in briancylui#2, where there are some suggestions about optimizing the loops. I have added your suggestion into the issue page. After implementing the new optimizations in a separate PR, we will be able to obtain new perf results and compare them against the baseline.

safern · 2018-08-10T21:29:01Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+            {
+                if (!tran)
+                {
+                    Contracts.Assert(crun >= 0);


since this assert is in all the paths, we can move it out to the common path, right?

safern · 2018-08-10T21:30:39Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+            {
+                if (!tran)
+                {
+                    Contracts.Assert(crun >= 0);


safern · 2018-08-10T21:31:40Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+            Add(a, new Span<float>(dst, 0, count));
+        }
+
+        // dst += a


Nit: I think it is pretty clear what this is doing. Maybe this comment can be deleted.

safern · 2018-08-10T21:35:59Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+                for (int i = 0; i < cindices; ++i)
+                {
+                    int index = pidx[i];
+                    Contracts.Assert(0 <= index && index < c);


Nit: more asserts using && that can be splitted in this file.

safern · 2018-08-10T21:36:06Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+                for (int i = 0; i < cindices; ++i)
+                {
+                    int index = pidx[i];
+                    Contracts.Assert(0 <= index && index < c);


safern · 2018-08-10T21:36:13Z

src/Microsoft.ML.CpuMath/CpuMathUtils.netcoreapp.cs

+                        ivLogLim = ivLogMin + ccol;
+                        ivPhyMin = row * cfltRow;
+
+                        Contracts.Assert(ivLogMin <= index && index < ivLogLim);


safern · 2018-08-10T21:40:18Z

src/Microsoft.ML.CpuMath/SseIntrinsics.cs

+        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
+        private static Vector128<float> GetAbsMask()
+        {
+            return (Sse2.IsSupported) ?


Nit: you don't need to wrap Sse2.IsSupported in ()

safern

LGTM.

briancylui added 4 commits August 6, 2018 14:42

Implemented all remaining active SSE intrinsics

5cd4918

Moved CpuMathUtils' architecture-dependent members into SseIntrinsics

079dd48

Implemented all new referenced SSE intrinsics, with software fallback…

5e1854d

…s, passing unit tests, and performance tests Note: Performance tests for functions that involve AlignedArray are not implemented.

Minor clean-up before submitting PR

81d0c29

tannergooding reviewed Aug 9, 2018

View reviewed changes

Minor changes

02bfbe6

safern reviewed Aug 9, 2018

View reviewed changes

eerhardt reviewed Aug 9, 2018

View reviewed changes

tannergooding reviewed Aug 9, 2018

View reviewed changes

eerhardt reviewed Aug 9, 2018

View reviewed changes

briancylui added 2 commits August 9, 2018 15:08

Respond to PR feedback, except for implementing new unit tests (comin…

be3281d

…g soon)

Respond to PR feedback: Implemented new unit tests

17b2f79

eerhardt approved these changes Aug 9, 2018

View reviewed changes

briancylui mentioned this pull request Aug 10, 2018

Suggestions on CpuMath enhancement briancylui/machinelearning#2

Open

14 tasks

tannergooding approved these changes Aug 10, 2018

View reviewed changes

safern reviewed Aug 10, 2018

View reviewed changes

Respond to PR feedback: Style changes

ef979b1

safern approved these changes Aug 10, 2018

View reviewed changes

safern closed this Aug 10, 2018

safern reopened this Aug 10, 2018

safern merged commit b51d9f9 into dotnet:master Aug 10, 2018

briancylui mentioned this pull request Aug 18, 2018

Port all relevant AVX hardware intrinsics C# APIs from SIMD native algorithms #691

Merged

This was referenced Sep 5, 2018

Suggestions on CpuMath Enhancement #823

Open

CpuMath Enhancement: Make bound checking of loops in hardware intrinsics more efficient #835

Open

ghost locked as resolved and limited conversation to collaborators Mar 29, 2022

	[Conditional("DEBUG")]
	public static void AssertNonEmpty<T>(ICollection<T> args)
	{
	if (Size(args) == 0)
	DbgFail();
	}

	public static float SumAbs(float mean, float[] src, int offset, int count)
	{
	Contracts.AssertNonEmpty(src);
	Contracts.Assert(0 < count);
	Contracts.Assert(0 <= offset && offset <= src.Length - count);

	unsafe
	{
	fixed (float* psrc = &src[offset])
	return (mean == 0 ? Thunk.SumAbsU(psrc, count) : Thunk.SumAbsDiffU(mean, psrc, count));
	}
	}

	EXPORT_API(void) MatMulA(bool add, _In_ const float * pmat, _In_ const float * psrc, _Inout_ float * pdst, int crow, int ccol)
	{
	const float * psLim = psrc + ccol;
	const float * pdLim = pdst + crow;
	const float * pm = pmat;
	for (float * pd = pdst; pd < pdLim; pd += 4, pm += 3 * ccol)
	{
	__m128 res0 = _mm_setzero_ps();
	__m128 res1 = res0;
	__m128 res2 = res0;
	__m128 res3 = res0;
	for (const float * ps = psrc; ps < psLim; ps += 4, pm += 4)
	{
	const float * pmTmp;
	__m128 x01 = _mm_load_ps(pmTmp = pm);
	__m128 x11 = _mm_load_ps(pmTmp += ccol);
	__m128 x21 = _mm_load_ps(pmTmp += ccol);
	__m128 x31 = _mm_load_ps(pmTmp += ccol);
	__m128 x02 = _mm_load_ps(ps);
	x01 = _mm_mul_ps(x01, x02);
	x11 = _mm_mul_ps(x11, x02);
	x21 = _mm_mul_ps(x21, x02);
	x31 = _mm_mul_ps(x31, x02);
	res0 = _mm_add_ps(res0, x01);
	res1 = _mm_add_ps(res1, x11);
	res2 = _mm_add_ps(res2, x21);
	res3 = _mm_add_ps(res3, x31);
	}

	// Add up the entries of each, with the 4 results in res0
	res0 = _mm_hadd_ps(res0, res1);
	res2 = _mm_hadd_ps(res2, res3);
	res0 = _mm_hadd_ps(res0, res2);

	if (add)
	res0 = _mm_add_ps(res0, _mm_load_ps(pd));
	_mm_store_ps(pd, res0);
	}
	}


		Vector128<float> xPrimal = Sse.SetAllVector128(primalUpdate);

		Vector128<float> signMask = Sse.SetAllVector128(-0.0f); // 1000 0000 ...

Port all active C# hardware intrinsics APIs for SSE from SIMD native algorithms #668

Port all active C# hardware intrinsics APIs for SSE from SIMD native algorithms #668

Conversation

briancylui commented Aug 8, 2018 • edited Loading

What's new:

Description from #562:

Structure:

Changes to users:

Future follow-up for CpuMath enhancement

List of new SSE intrinisics implemented

briancylui commented Aug 8, 2018

briancylui commented Aug 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safern Aug 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

briancylui Aug 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eerhardt Aug 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

briancylui commented Aug 9, 2018

eerhardt left a comment

Choose a reason for hiding this comment

helloguo commented Aug 10, 2018

tannergooding commented Aug 10, 2018

briancylui commented Aug 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safern left a comment

Choose a reason for hiding this comment

briancylui commented Aug 8, 2018 •

edited

Loading

Future follow-up for `CpuMath` enhancement

safern Aug 9, 2018 •

edited

Loading

briancylui Aug 9, 2018 •

edited

Loading

eerhardt Aug 9, 2018 •

edited

Loading