Skip to content

Port all active C# hardware intrinsics APIs for SSE from SIMD native algorithms #668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 10, 2018

Conversation

briancylui
Copy link
Contributor

@briancylui briancylui commented Aug 8, 2018

What's new:

  1. Implemented all remaining active SSE intrinsics, including their software fallbacks and passing unit tests
  2. Implemented the performance tests of all remaining active SSE intrinsics, except for those that accept AlignedArray as an argument
  3. Performance test results for all applicable, active SSE intrinsics are updated in Progress on porting ML.NET native SIMD algorithms to managed code briancylui/machinelearning#1

Description from #562:

  • with unit tests and performance tests for two target frameworks: .NET Core App 3.0 and .NET Standard 2.0.
  • .NET Core App 3.0 gets the new C# hardware intrinsics APIs, while .NET Standard 2.0 gets the original native SIMD algorithms.
  • Several things have made this multi-targeting feature possible.
    1. The new CpuMathUtils class that contains the new APIs is implemented as a partial class with method definitions split into two separate files (src\Microsoft.ML.CpuMath\CpuMathUtils.[target].cs), only one of which is compiled based on the target framework.
    2. The Microsoft.ML.CpuMath.csproj file makes the switching decision to compile the right files based on the target framework.

Structure:

  1. All new hardware intrinsics APIs are contained in src\Microsoft.ML.CpuMath.
  2. Unit tests for the two target frameworks live in test\Microsoft.ML.CpuMath.UnitTests.[target], and contain the same content except for the target framework in .csproj.
  3. Performance tests live in test\Microsoft.ML.CpuMath.PerformanceTests.

Changes to users:

  1. Originally, users call intrinsics APIs via the SseUtils class in src\Microsoft.ML.CpuMath\Sse.cs, but now they call them via the new CpuMathUtils class, which will handle switching between SSE and AVX in the future.
  2. CpuMathUtils methods and SseUtils methods share the same signature, but CpuMathUtils methods additionally call a new helper class (SseIntrinsics) for C# implementations of SSE operations.

Future follow-up for CpuMath enhancement

  1. Suggestions on CpuMath enhancement in this PR scheduled for future follow-ups have been compiled into an issue page (Suggestions on CpuMath enhancement briancylui/machinelearning#2).

List of new SSE intrinisics implemented

• MatMulA
• MatMulTranA
• MatMulPA
• MatMulTranPA
• SdcaL1UpdateU
• SdcaL1UpdateSU
• AddScaleCopyU
• SumU
• AddScalarU
• SumSqDiffU
• SumAbsDiffU
• MaxAbsDiffU
• MaxAbsU
• ScaleSrcU
• ScaleAddU
• ZeroItemsU
• ZeroMatrixItemsCoreU

cc: @eerhardt @tannergooding @safern @danmosemsft

@briancylui
Copy link
Contributor Author

test OSX10.13 Debug please

@briancylui
Copy link
Contributor Author

All 5 checks have passed. No merge conflicts.

@@ -55,7 +78,7 @@ private static unsafe void Store4(Vector128<float> x, float* dst, int* idx)
}

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static Vector128<float> VectorSum(in Vector128<float> vector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why drop in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - amended thanks!

{
if (!tran)
{
Contracts.Assert(0 <= crun && crun <= dst.Size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels odd to me, it feels more readable to do crun >= 0 -- also, you could do a common Contracts.Assert for that, it seems like in all the blocks you're doing 0 <= crun

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also generally useful to split && asserts into two separate asserts, as it allows easy diagnosis on failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback! Tanner also raised similar concerns about Contracts.Asserts like count > 0. I have adopted these changes and documented them in briancylui#2.

{
if (!tran)
{
Contracts.Assert(0 <= crun && crun <= dst.Size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also, and all the places where this can be unified and change the comparison to be crun >= 0

public static void Add(float a, float[] dst, int count)
{
Contracts.AssertNonEmpty(dst);
Contracts.Assert(0 < count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count > 0? I know it is just the order, but we usually read from left to right, so if you change the sides it is read differently. (0 is smaller than count -> count is greater than 0)

// dst = a * src
public static void Scale(float a, float[] src, float[] dst, int count)
{
Contracts.AssertNonEmpty(src);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If AssertNonEmpty just checks for the length to be greater than 0 then this doesn't feel necessary to me, since you're asserting below that 0 < count and then that count <= src.Length, which automatically achieves that. If count is greater than 0 then it means src.Length must be greater than 0 as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that 0 < count && count <= src.Length already takes care of checking if src is empty, and Contracts.AssertNonEmpty<T> doesn't seem to print any useful error messages here:

[Conditional("DEBUG")]
public static void AssertNonEmpty<T>(ICollection<T> args)
{
if (Size(args) == 0)
DbgFail();
}

It seems to me that we may want to first check whether src is non-empty. If not, the program does not even bother to do the two checks for count. 0 < count && count <= src.Length sounds to me more like a restriction on count given the knowledge that src is a non-empty array. If src were really empty, with the AssertNonEmpty, users could see from the stack trace that the line Contracts.AssertNonEmpty(src) is the problem, but without the AsssertNonEmpty, they would see from the stack trace that the line Contracts.Assert(count <= src.Length) is the problem, which might lead them to check for both count and src.

Copy link
Member

@safern safern Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's check first wether src is non-empty and then just check if count <= src.Length.

Contracts.Assert(count > 0);
Contracts.AssertNonEmpty(src);
Contracts.Assert(count <= src.Length);

As @tannergooding stated in another comment, let's break out the && asserts.

{
if (Sse.IsSupported)
{
if (mean == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole block can be simplified to:

return mean == 0 ? SseIntrinsics.SumAbsU(src) : SseIntrinsics.SumAbsDiffU(mean, src);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great minds think alike - your way is actucally the preferred way from the original author:

public static float SumAbs(float mean, float[] src, int offset, int count)
{
Contracts.AssertNonEmpty(src);
Contracts.Assert(0 < count);
Contracts.Assert(0 <= offset && offset <= src.Length - count);
unsafe
{
fixed (float* psrc = &src[offset])
return (mean == 0 ? Thunk.SumAbsU(psrc, count) : Thunk.SumAbsDiffU(mean, psrc, count));
}
}

I thought breaking down the ?: operator into an if-else statement would have made it a bit more readable, but I guess the ?: operator looks better in this case. Thank you for bringing this up!

@@ -9,6 +9,174 @@ namespace Microsoft.ML.Runtime.Internal.CpuMath
{
public static partial class CpuMathUtils
{
public static void MatTimesSrc(bool tran, bool add, AlignedArray mat, AlignedArray src, AlignedArray dst, int crun)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(probably for future clean up, like other naming suggestions) tran doesn't really describe what this is. It should probably be transpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree and thanks - documented this in briancylui#2.

SseIntrinsics.MatMulTranA(add, mat, src, dst, dst.Size, crun);
}
}
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized we don't have any unit tests for these software fallback implementations. I think that is OK for this PR, but it is something to put on your TODO list.

{
Vector128<float> x01 = Sse.LoadAlignedVector128(pSrcCurrent);
// Replicate each slot of x01 into its own register.
Vector128<float> x11 = Sse.Shuffle(x01, x01, 0x55);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment about what each of these is extracting would be useful:

  • 0x00 == X
  • 0x55 == Y
  • 0xAA == Z
  • 0xFF == W

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and adopted the change. May I know if there is any preference/convention for using XYZW instead of ABCD? I used ABCD in comments of private functions earlier thinking that all users can always infer the correct elements from their alphabetical order, but not sure if it's the best way.

{
Contracts.Assert(0 < ccol && ccol <= cfltRow);

// REVIEW NEEDED: Since the two methods below do not involve any SSE hardware intrinsics, no software fallback is needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding here. If there are no SIMD operations involved in these methods, then let's just have the software function in CpuMathUtils and not have the method at all in SseIntrinsics. SseIntrinsics should be only for when we use the Sse operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZeroItemsU and ZeroMatrixItemsCore are two functions in Sse.cpp like all other functions currently in SseIntrinsics.cs, but since even their native implementation does not involve Intel hardware intrinsics like _mm_set_ps at all, there is no need to implement software fallbacks. We require SSE support only for us to call those Intel hardware intrinsics, which are absent in ZeroItemsU and ZeroMatrixItemsCore.

While the implementations of these two functions are software-based, they do involve pointers and follow the C++ syntax in Sse.cpp. I can put these two functions in CpuMathUtils and rewrite them so that they follow the C# syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I proceed to modify these function, may I know if I have given relevant info, and if this change is still desired?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right change here is just to move your current code (using pointers) from SseIntrinsics into CpuMathUtils. I assume the reason it is using pointers is for speed. Let's keep it that way until it can be proved we don't need pointers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - thanks! ZeroItemsU and ZeroMatrixItemsCore call the Ptr method which is private in SseIntrinsics. Should I make Ptr in SseIntrinsics internal instead of Private?

The above change could solve the issue. Another way is to remove the dependence on Ptr in CpuMathUtils:

        private static unsafe void ZeroItemsU(AlignedArray dst, int c, int[] indices, int cindices)
        {
            for (int i = 0; i < cindices; i++)
            {
                int index = indices[i];
                Contracts.Assert(index >= 0);
                Contracts.Assert(index < c);
                dst[index] = 0;
            }

            /* Original version calls the Ptr method, which is private in SseIntrinsics
            fixed (float* pDstStart = &dst.Items[0])
            fixed (int* pidx = &indices[0])
            {
                float* pdst = Ptr(dst, pDstStart);

                for (int i = 0; i < cindices; ++i)
                {
                    int index = pidx[i];
                    Contracts.Assert(0 <= index && index < c);
                    pdst[index] = 0;
                }
            }
            */
        }

        private static unsafe void ZeroMatrixItemsCore(AlignedArray dst, int c, int ccol, int cfltRow, int[] indices, int cindices)
        {
            int ivLogMin = 0;
            int ivLogLim = ccol;
            int ivPhyMin = 0;

            for (int i = 0; i < cindices; i++)
            {
                int index = indices[i];
                Contracts.Assert(index >= 0);
                Contracts.Assert(index < c);

                int col = index - ivLogMin;
                if ((uint)col >= (uint)ccol)
                {
                    Contracts.Assert(index < ivLogMin || index >= ivLogLim);

                    int row = index / ccol;
                    ivLogMin = row * ccol;
                    ivLogLim = ivLogMin + ccol;
                    ivPhyMin = row * cfltRow;

                    Contracts.Assert(index >= ivLogMin);
                    Contracts.Assert(index < ivLogLim);
                    col = index - ivLogMin;
                }

                dst[ivPhyMin + col] = 0;
            }

            /* Original version calls the Ptr method, which is private in SseIntrinsics
            fixed (float* pDstStart = &dst.Items[0])
            fixed (int* pidx = &indices[0])
            {
                float* pdst = Ptr(dst, pDstStart);

                int ivLogMin = 0;
                int ivLogLim = ccol;
                int ivPhyMin = 0;

                for (int i = 0; i < cindices; ++i)
                {
                    int index = pidx[i];
                    Contracts.Assert(0 <= index && index < c);

                    int col = index - ivLogMin;
                    if ((uint)col >= (uint)ccol)
                    {
                        Contracts.Assert(ivLogMin > index || index >= ivLogLim);

                        int row = index / ccol;
                        ivLogMin = row * ccol;
                        ivLogLim = ivLogMin + ccol;
                        ivPhyMin = row * cfltRow;

                        Contracts.Assert(ivLogMin <= index && index < ivLogLim);
                        col = index - ivLogMin;
                    }

                    pdst[ivPhyMin + col] = 0;
                }
            }
            */
        }

Please let me know which approach I should follow (the former or latter).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're "latter" approach isn't using pointers at all.

I was thinking we wouldn't need to call Ptr at all. That method moves the array around in memory so it is guaranteed to be aligned. Since we aren't using SIMD instructions to set all the values to 0, we can just pin the arrays using

fixed (float* pDstStart = &dst.Items[0])
fixed (int* pidx = &indices[0])

And then continue with the logic without calling Ptr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I think this is doable, but pdst[index] = 0 in ZeroItemsU and pdst[ivPhyMin + col] = 0 will be affected if we do not call Ptr and increment pDstStart to get the pointer pdst that points to the base of dst. How should I proceed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use the pointer you have when you pinned the array:

        internal static unsafe void ZeroItemsU(AlignedArray dst, int c, int[] indices, int cindices)
        {
            fixed (float* pdst = &dst.Items[0])
            fixed (int* pidx = &indices[0])
            {
                for (int i = 0; i < cindices; ++i)
                {
                    int index = pidx[i];
                    Contracts.Assert(0 <= index && index < c);
                    pdst[index] = 0;
                }
            }
        }


private float[] src, dst, original, src1, src2;
// Naming follows from SseIntrinsics.
private const int CbAlign = 16;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be used.

Copy link
Contributor Author

@briancylui briancylui Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - thanks! I originally implemented perf tests for the intrinsics that involved matrix operations and AlignedArray. It was a remnant of a clean-up of unused constants and variables.

Sse.StoreAligned(pDstCurrent, res0);

pDstCurrent += 4;
pMatCurrent += 3 * ccol;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you figure out why this is 3 * ccol?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe because here it is adding the ccol 3 times to pMatTemp which is a copy of pMatCurrent and at the end it needs to reset pMatCurrent to the right position?

https://github.com/dotnet/machinelearning/pull/668/files/02bfbe6c0f2f8d12ef7ef37006557d4f0616a054#diff-1212867e04546a8e692d4e263dcc390cR130

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree and thanks! We may get a slightly better view from the native code here:

EXPORT_API(void) MatMulA(bool add, _In_ const float * pmat, _In_ const float * psrc, _Inout_ float * pdst, int crow, int ccol)
{
const float * psLim = psrc + ccol;
const float * pdLim = pdst + crow;
const float * pm = pmat;
for (float * pd = pdst; pd < pdLim; pd += 4, pm += 3 * ccol)
{
__m128 res0 = _mm_setzero_ps();
__m128 res1 = res0;
__m128 res2 = res0;
__m128 res3 = res0;
for (const float * ps = psrc; ps < psLim; ps += 4, pm += 4)
{
const float * pmTmp;
__m128 x01 = _mm_load_ps(pmTmp = pm);
__m128 x11 = _mm_load_ps(pmTmp += ccol);
__m128 x21 = _mm_load_ps(pmTmp += ccol);
__m128 x31 = _mm_load_ps(pmTmp += ccol);
__m128 x02 = _mm_load_ps(ps);
x01 = _mm_mul_ps(x01, x02);
x11 = _mm_mul_ps(x11, x02);
x21 = _mm_mul_ps(x21, x02);
x31 = _mm_mul_ps(x31, x02);
res0 = _mm_add_ps(res0, x01);
res1 = _mm_add_ps(res1, x11);
res2 = _mm_add_ps(res2, x21);
res3 = _mm_add_ps(res3, x31);
}
// Add up the entries of each, with the 4 results in res0
res0 = _mm_hadd_ps(res0, res1);
res2 = _mm_hadd_ps(res2, res3);
res0 = _mm_hadd_ps(res0, res2);
if (add)
res0 = _mm_add_ps(res0, _mm_load_ps(pd));
_mm_store_ps(pd, res0);
}
}

pMatCurrent corresponds to pm in the native code in the window. For correctness, pm needs to be incremented by 4 * ccol in each iteration of the outer for loop. We see that at the end of such iteration, pm is incremented by 3 * ccol only - this is because the inner for loop increments pm by 4 in each of its own iteration. SInce the inner for loop runs for (psLim - psrc) / 4 = ccol / 4 iterations, the entire inner for loop increments pm by (ccol / 4) * 4 = ccol. Together with the increment by 3 * ccol done by the outer for loop, pm is incremented by 4 * ccol in each iteration of the outer for loop, which is the desired behavior.


float* pSrcEnd = psrc + ccol;
float* pDstEnd = pdst + crow;
float* pSrcCurrent = psrc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need another set of pointers here? Can't we just use psrc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think there is a bug here. Shouldn't pSrcCurrent get reset to psrc after each inner loop?

From the C++ code:

        for (const float * ps = psrc; ps < psLim; ps += 4, pm += 4)
        {
            const float * pmTmp;
            __m128 x01 = _mm_load_ps(pmTmp = pm);
            __m128 x11 = _mm_load_ps(pmTmp += ccol);
            __m128 x21 = _mm_load_ps(pmTmp += ccol);
            __m128 x31 = _mm_load_ps(pmTmp += ccol);
            __m128 x02 = _mm_load_ps(ps);

Maybe we need a few more unit tests to cover different shapes of matrices. We only have 4x4 and 4x8 tests.


In reply to: 208976581 [](ancestors = 208976581)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reply to the first comment:
In this case, yes, we can use psrc directly and feel free to increment it in the for loops without having a pSrcCurrent, since we don't have to keep track of the start of the array (adjusted for the base of the AlignedArray) in this function. The same goes for dst and mat.

In other functions that do not involve matrix multiplication, I have adopted the convention to only increment pXCurrent, while using pX to indicate a fixed pointer that indicates the start of the array X. Should I skip using pXCurrent just for these 4-6 functions that involve matrix operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reply to the second comment:

Nice catch - thank you! I have looked at all 4 functions that involve matrix multiplications to ensure the same thing doesn't happen elsewhere in these 4 functions.

Since the dst array has a constant size of 4, the outer for loop never has a second iteration, and thus the 2 unit tests implemented for each matrix-op function did not manage to catch this bug. I will probably add a 8x4 test.

Note: These matrix-op functions have the suffix A, which means that the inputs are "aligned and padded for SSE operations," so both dimensions of the input matrix/vector have to be multiples of 4 (except 1).

fixed (float* pdst = dst)
{
float* pDstEnd = pdst + dst.Length;
float* pSrcCurrent = psrc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need pSrcCurrent and pDstCurrent, right? We can just use psrc and pdst here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time, it may be different from the case in the matrix-op function (e.g. MatMulA). Since here psrc is fixed to point to src in the fixed statement, psrc cannot be modified or incremented inside the fixed block. I assigned a new variable pSrcCurrent the value of psrc, so that we can increment pSrcCurrent to achieve the desired behavior.

@@ -340,13 +850,148 @@ internal static unsafe float SumAbsU(Span<float> src)
pSrcCurrent += 4;
}

result = VectorSum(in result);
result = VectorSum(result);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be nice to keep the in here and below. It is good documentation for the reader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not just good documentation, it can result in different code being emitted for certain cases.

You should always specify in for a variable that is passed that way, IMO.

Vector128<float> result = Sse.SetZeroVector128();
Vector128<float> mask;

if (Sse2.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe refactor this common chunk out into a method GetAbsMask or something similar.

pSrcCurrent += 4;
}

Vector128<float> x1 = Sse.Shuffle(result, result, 0xB1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 4 lines are duplicated with the other Max method. Can we extract a common method here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree and thanks!

Should I make SumAbsU call SumAbsDiffU and MaxAbsU call MaxAbsDiffU with mean == 0f?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I wouldn't do it that way. If you did it that way, you always would need to emit Subtract(0) calls, which would slow the function down.

What I mean is just these 4 duplicated lines that take the result, and get the MaxScalar value from it. Similar to the VectorSum method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes, I share the same thought, and have been on the same page with you. Wanted to see if it could be a way to reduce similar-looking codes :)

{
float* pdst = Ptr(dst, pDstStart);

// REVIEW NEEDED: This line expands to (void)(c); but is it necessary?
Copy link
Member

@eerhardt eerhardt Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary in C#. It is used in C++ to get the compiler to stop complaining that c is only used in an Assert. (It complains in Release builds becuase c is never used, because the assert gets compiled out.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I will remove these two comments. Same for below.


Vector128<float> xPrimal = Sse.SetAllVector128(primalUpdate);

Vector128<float> signMask = Sse.SetAllVector128(-0.0f); // 1000 0000 ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) I know this is ported from the C++ code, but I think the comment is slightly confusing/misleading. typically I would expect hexadecimal representation here 0x8000 0000. I guess this is binary representation, which is why it is using the ....

it may make sense to call this out as binary using the 0b prefix. Or my preference would be just to use hex - 0x8000 0000.

Vector128<float> xDst1 = Load4(pdst1, pIdxCurrent);
xDst1 = Sse.Add(xDst1, Sse.Multiply(xSrc, xPrimal));

Vector128<float> xSign = Sse.And(xDst1, signMask); // result = 10000... if xDst1 is negative or 00000 otherwise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this chunk of code can be refactored out into a method, since it is duplicated with the above method.

@briancylui
Copy link
Contributor Author

Responded to all PR feedback so far. Relevant changes are included in the last two commits.

Now ready for a second round of review if any.

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks for the good work, @briancylui.

I'd like to get @safern and @tannergooding to have a look as well. @safern and @tannergooding, when you approve, can you merge this? I will be out for the next 2 working days.

@helloguo
Copy link

@briancylui

I see the code usually uses while (pDstCurrent + 4 <= pDstEnd) for the loop bound checking both in this PR and PR #562, do you mind changing it to while (pDstCurrent <= pDstEnd - 4)?

Take DotU as an example, when using while (pSrcCurrent + 4 <= pEnd), the code gen looks like this:

0x7ffbc7fae07c		Block 10:										
0x7ffbc7fae07c		vmovups xmm0, xmmword ptr [rdx]	
0x7ffbc7fae081		vmovups xmm1, xmmword ptr [rax]	
0x7ffbc7fae086		vmulps xmm0, xmm0, xmm1	
0x7ffbc7fae08b		vaddps xmm6, xmm6, xmm0	
0x7ffbc7fae090		add rdx, 0x10	
0x7ffbc7fae094		add rax, 0x10	
0x7ffbc7fae098		lea r8, ptr [rdx+0x10]	
0x7ffbc7fae09c		cmp r8, rcx										
0x7ffbc7fae09f		jbe 0x7ffbc7fae07c <Block 10>	 

When using while (pSrcCurrent <= pEnd - 4), the code gen looks like this:

0x7ffbc7f8e07c		Block 10:										
0x7ffbc7f8e07c		vmovups xmm0, xmmword ptr [rdx]	
0x7ffbc7f8e081		vmovups xmm1, xmmword ptr [rax]	
0x7ffbc7f8e086		vmulps xmm0, xmm0, xmm1	
0x7ffbc7f8e08b		vaddps xmm6, xmm6, xmm0	
0x7ffbc7f8e090		add rdx, 0x10	
0x7ffbc7f8e094		add rax, 0x10	
0x7ffbc7f8e098		cmp rdx, r8	
0x7ffbc7f8e09b		jbe 0x7ffbc7f8e07c <Block 10>	

So the second one saves the instruction lea r8, ptr [rdx+0x10], which is redundant with add rdx, 0x10.

@tannergooding
Copy link
Member

@helloguo, I believe that is covered in the existing notes on briancylui#2

The initial port is basically a direct translation of the C++ code, and the refactorings/cleanup are going to come in a separate PR (for the most part).

@briancylui
Copy link
Contributor Author

@helloguo Thank you for your comment. At this point, we aim for a clean port of the original native code, from which we will obtain a baseline perf. All suggestions for future follow-up are documented in briancylui#2, where there are some suggestions about optimizing the loops. I have added your suggestion into the issue page. After implementing the new optimizations in a separate PR, we will be able to obtain new perf results and compare them against the baseline.

{
if (!tran)
{
Contracts.Assert(crun >= 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this assert is in all the paths, we can move it out to the common path, right?

{
if (!tran)
{
Contracts.Assert(crun >= 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Add(a, new Span<float>(dst, 0, count));
}

// dst += a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it is pretty clear what this is doing. Maybe this comment can be deleted.

for (int i = 0; i < cindices; ++i)
{
int index = pidx[i];
Contracts.Assert(0 <= index && index < c);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: more asserts using && that can be splitted in this file.

for (int i = 0; i < cindices; ++i)
{
int index = pidx[i];
Contracts.Assert(0 <= index && index < c);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

ivLogLim = ivLogMin + ccol;
ivPhyMin = row * cfltRow;

Contracts.Assert(ivLogMin <= index && index < ivLogLim);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static Vector128<float> GetAbsMask()
{
return (Sse2.IsSupported) ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you don't need to wrap Sse2.IsSupported in ()

Copy link
Member

@safern safern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants