-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Port all active C# hardware intrinsics APIs for SSE from SIMD native algorithms #668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s, passing unit tests, and performance tests Note: Performance tests for functions that involve AlignedArray are not implemented.
test OSX10.13 Debug please |
All 5 checks have passed. No merge conflicts. |
@@ -55,7 +78,7 @@ private static unsafe void Store4(Vector128<float> x, float* dst, int* idx) | |||
} | |||
|
|||
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] | |||
private static Vector128<float> VectorSum(in Vector128<float> vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why drop in
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch - amended thanks!
{ | ||
if (!tran) | ||
{ | ||
Contracts.Assert(0 <= crun && crun <= dst.Size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels odd to me, it feels more readable to do crun >= 0
-- also, you could do a common Contracts.Assert for that, it seems like in all the blocks you're doing 0 <= crun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also generally useful to split &&
asserts into two separate asserts, as it allows easy diagnosis on failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback! Tanner also raised similar concerns about Contracts.Asserts
like count > 0
. I have adopted these changes and documented them in briancylui#2.
{ | ||
if (!tran) | ||
{ | ||
Contracts.Assert(0 <= crun && crun <= dst.Size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also, and all the places where this can be unified and change the comparison to be crun >= 0
public static void Add(float a, float[] dst, int count) | ||
{ | ||
Contracts.AssertNonEmpty(dst); | ||
Contracts.Assert(0 < count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count > 0
? I know it is just the order, but we usually read from left to right, so if you change the sides it is read differently. (0 is smaller than count -> count is greater than 0)
// dst = a * src | ||
public static void Scale(float a, float[] src, float[] dst, int count) | ||
{ | ||
Contracts.AssertNonEmpty(src); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If AssertNonEmpty just checks for the length to be greater than 0 then this doesn't feel necessary to me, since you're asserting below that 0 < count and then that count <= src.Length, which automatically achieves that. If count is greater than 0 then it means src.Length must be greater than 0 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that 0 < count && count <= src.Length
already takes care of checking if src
is empty, and Contracts.AssertNonEmpty<T>
doesn't seem to print any useful error messages here:
machinelearning/src/Microsoft.ML.Core/Utilities/Contracts.cs
Lines 922 to 927 in d0664c1
[Conditional("DEBUG")] | |
public static void AssertNonEmpty<T>(ICollection<T> args) | |
{ | |
if (Size(args) == 0) | |
DbgFail(); | |
} |
It seems to me that we may want to first check whether src
is non-empty. If not, the program does not even bother to do the two checks for count
. 0 < count && count <= src.Length
sounds to me more like a restriction on count
given the knowledge that src
is a non-empty array. If src
were really empty, with the AssertNonEmpty
, users could see from the stack trace that the line Contracts.AssertNonEmpty(src)
is the problem, but without the AsssertNonEmpty
, they would see from the stack trace that the line Contracts.Assert(count <= src.Length)
is the problem, which might lead them to check for both count
and src
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's check first wether src
is non-empty and then just check if count <= src.Length.
Contracts.Assert(count > 0);
Contracts.AssertNonEmpty(src);
Contracts.Assert(count <= src.Length);
As @tannergooding stated in another comment, let's break out the && asserts.
{ | ||
if (Sse.IsSupported) | ||
{ | ||
if (mean == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole block can be simplified to:
return mean == 0 ? SseIntrinsics.SumAbsU(src) : SseIntrinsics.SumAbsDiffU(mean, src);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great minds think alike - your way is actucally the preferred way from the original author:
machinelearning/src/Microsoft.ML.CpuMath/Sse.cs
Lines 1070 to 1081 in d0664c1
public static float SumAbs(float mean, float[] src, int offset, int count) | |
{ | |
Contracts.AssertNonEmpty(src); | |
Contracts.Assert(0 < count); | |
Contracts.Assert(0 <= offset && offset <= src.Length - count); | |
unsafe | |
{ | |
fixed (float* psrc = &src[offset]) | |
return (mean == 0 ? Thunk.SumAbsU(psrc, count) : Thunk.SumAbsDiffU(mean, psrc, count)); | |
} | |
} |
I thought breaking down the ?:
operator into an if-else statement would have made it a bit more readable, but I guess the ?:
operator looks better in this case. Thank you for bringing this up!
@@ -9,6 +9,174 @@ namespace Microsoft.ML.Runtime.Internal.CpuMath | |||
{ | |||
public static partial class CpuMathUtils | |||
{ | |||
public static void MatTimesSrc(bool tran, bool add, AlignedArray mat, AlignedArray src, AlignedArray dst, int crun) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(probably for future clean up, like other naming suggestions) tran
doesn't really describe what this is. It should probably be transpose
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree and thanks - documented this in briancylui#2.
SseIntrinsics.MatMulTranA(add, mat, src, dst, dst.Size, crun); | ||
} | ||
} | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized we don't have any unit tests for these software fallback implementations. I think that is OK for this PR, but it is something to put on your TODO list.
{ | ||
Vector128<float> x01 = Sse.LoadAlignedVector128(pSrcCurrent); | ||
// Replicate each slot of x01 into its own register. | ||
Vector128<float> x11 = Sse.Shuffle(x01, x01, 0x55); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment about what each of these is extracting would be useful:
0x00
==X
0x55
==Y
0xAA
==Z
0xFF
==W
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and adopted the change. May I know if there is any preference/convention for using XYZW
instead of ABCD
? I used ABCD
in comments of private functions earlier thinking that all users can always infer the correct elements from their alphabetical order, but not sure if it's the best way.
{ | ||
Contracts.Assert(0 < ccol && ccol <= cfltRow); | ||
|
||
// REVIEW NEEDED: Since the two methods below do not involve any SSE hardware intrinsics, no software fallback is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not understanding here. If there are no SIMD operations involved in these methods, then let's just have the software function in CpuMathUtils
and not have the method at all in SseIntrinsics
. SseIntrinsics
should be only for when we use the Sse
operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ZeroItemsU
and ZeroMatrixItemsCore
are two functions in Sse.cpp
like all other functions currently in SseIntrinsics.cs
, but since even their native implementation does not involve Intel hardware intrinsics like _mm_set_ps
at all, there is no need to implement software fallbacks. We require SSE support only for us to call those Intel hardware intrinsics, which are absent in ZeroItemsU
and ZeroMatrixItemsCore
.
While the implementations of these two functions are software-based, they do involve pointers and follow the C++ syntax in Sse.cpp
. I can put these two functions in CpuMathUtils
and rewrite them so that they follow the C# syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I proceed to modify these function, may I know if I have given relevant info, and if this change is still desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the right change here is just to move your current code (using pointers) from SseIntrinsics
into CpuMathUtils
. I assume the reason it is using pointers is for speed. Let's keep it that way until it can be proved we don't need pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree - thanks! ZeroItemsU
and ZeroMatrixItemsCore
call the Ptr
method which is private in SseIntrinsics
. Should I make Ptr
in SseIntrinsics
internal
instead of Private
?
The above change could solve the issue. Another way is to remove the dependence on Ptr
in CpuMathUtils
:
private static unsafe void ZeroItemsU(AlignedArray dst, int c, int[] indices, int cindices)
{
for (int i = 0; i < cindices; i++)
{
int index = indices[i];
Contracts.Assert(index >= 0);
Contracts.Assert(index < c);
dst[index] = 0;
}
/* Original version calls the Ptr method, which is private in SseIntrinsics
fixed (float* pDstStart = &dst.Items[0])
fixed (int* pidx = &indices[0])
{
float* pdst = Ptr(dst, pDstStart);
for (int i = 0; i < cindices; ++i)
{
int index = pidx[i];
Contracts.Assert(0 <= index && index < c);
pdst[index] = 0;
}
}
*/
}
private static unsafe void ZeroMatrixItemsCore(AlignedArray dst, int c, int ccol, int cfltRow, int[] indices, int cindices)
{
int ivLogMin = 0;
int ivLogLim = ccol;
int ivPhyMin = 0;
for (int i = 0; i < cindices; i++)
{
int index = indices[i];
Contracts.Assert(index >= 0);
Contracts.Assert(index < c);
int col = index - ivLogMin;
if ((uint)col >= (uint)ccol)
{
Contracts.Assert(index < ivLogMin || index >= ivLogLim);
int row = index / ccol;
ivLogMin = row * ccol;
ivLogLim = ivLogMin + ccol;
ivPhyMin = row * cfltRow;
Contracts.Assert(index >= ivLogMin);
Contracts.Assert(index < ivLogLim);
col = index - ivLogMin;
}
dst[ivPhyMin + col] = 0;
}
/* Original version calls the Ptr method, which is private in SseIntrinsics
fixed (float* pDstStart = &dst.Items[0])
fixed (int* pidx = &indices[0])
{
float* pdst = Ptr(dst, pDstStart);
int ivLogMin = 0;
int ivLogLim = ccol;
int ivPhyMin = 0;
for (int i = 0; i < cindices; ++i)
{
int index = pidx[i];
Contracts.Assert(0 <= index && index < c);
int col = index - ivLogMin;
if ((uint)col >= (uint)ccol)
{
Contracts.Assert(ivLogMin > index || index >= ivLogLim);
int row = index / ccol;
ivLogMin = row * ccol;
ivLogLim = ivLogMin + ccol;
ivPhyMin = row * cfltRow;
Contracts.Assert(ivLogMin <= index && index < ivLogLim);
col = index - ivLogMin;
}
pdst[ivPhyMin + col] = 0;
}
}
*/
}
Please let me know which approach I should follow (the former or latter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're "latter" approach isn't using pointers at all.
I was thinking we wouldn't need to call Ptr
at all. That method moves the array around in memory so it is guaranteed to be aligned. Since we aren't using SIMD instructions to set all the values to 0
, we can just pin the arrays using
fixed (float* pDstStart = &dst.Items[0])
fixed (int* pidx = &indices[0])
And then continue with the logic without calling Ptr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I think this is doable, but pdst[index] = 0
in ZeroItemsU
and pdst[ivPhyMin + col] = 0
will be affected if we do not call Ptr
and increment pDstStart
to get the pointer pdst
that points to the base of dst
. How should I proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use the pointer you have when you pinned the array:
internal static unsafe void ZeroItemsU(AlignedArray dst, int c, int[] indices, int cindices)
{
fixed (float* pdst = &dst.Items[0])
fixed (int* pidx = &indices[0])
{
for (int i = 0; i < cindices; ++i)
{
int index = pidx[i];
Contracts.Assert(0 <= index && index < c);
pdst[index] = 0;
}
}
}
|
||
private float[] src, dst, original, src1, src2; | ||
// Naming follows from SseIntrinsics. | ||
private const int CbAlign = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch - thanks! I originally implemented perf tests for the intrinsics that involved matrix operations and AlignedArray
. It was a remnant of a clean-up of unused constants and variables.
Sse.StoreAligned(pDstCurrent, res0); | ||
|
||
pDstCurrent += 4; | ||
pMatCurrent += 3 * ccol; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you figure out why this is 3 * ccol
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe because here it is adding the ccol 3 times to pMatTemp which is a copy of pMatCurrent and at the end it needs to reset pMatCurrent to the right position?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is following the same pattern in this method:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree and thanks! We may get a slightly better view from the native code here:
machinelearning/src/Native/CpuMathNative/Sse.cpp
Lines 108 to 146 in d0664c1
EXPORT_API(void) MatMulA(bool add, _In_ const float * pmat, _In_ const float * psrc, _Inout_ float * pdst, int crow, int ccol) | |
{ | |
const float * psLim = psrc + ccol; | |
const float * pdLim = pdst + crow; | |
const float * pm = pmat; | |
for (float * pd = pdst; pd < pdLim; pd += 4, pm += 3 * ccol) | |
{ | |
__m128 res0 = _mm_setzero_ps(); | |
__m128 res1 = res0; | |
__m128 res2 = res0; | |
__m128 res3 = res0; | |
for (const float * ps = psrc; ps < psLim; ps += 4, pm += 4) | |
{ | |
const float * pmTmp; | |
__m128 x01 = _mm_load_ps(pmTmp = pm); | |
__m128 x11 = _mm_load_ps(pmTmp += ccol); | |
__m128 x21 = _mm_load_ps(pmTmp += ccol); | |
__m128 x31 = _mm_load_ps(pmTmp += ccol); | |
__m128 x02 = _mm_load_ps(ps); | |
x01 = _mm_mul_ps(x01, x02); | |
x11 = _mm_mul_ps(x11, x02); | |
x21 = _mm_mul_ps(x21, x02); | |
x31 = _mm_mul_ps(x31, x02); | |
res0 = _mm_add_ps(res0, x01); | |
res1 = _mm_add_ps(res1, x11); | |
res2 = _mm_add_ps(res2, x21); | |
res3 = _mm_add_ps(res3, x31); | |
} | |
// Add up the entries of each, with the 4 results in res0 | |
res0 = _mm_hadd_ps(res0, res1); | |
res2 = _mm_hadd_ps(res2, res3); | |
res0 = _mm_hadd_ps(res0, res2); | |
if (add) | |
res0 = _mm_add_ps(res0, _mm_load_ps(pd)); | |
_mm_store_ps(pd, res0); | |
} | |
} |
pMatCurrent
corresponds to pm
in the native code in the window. For correctness, pm
needs to be incremented by 4 * ccol
in each iteration of the outer for loop. We see that at the end of such iteration, pm
is incremented by 3 * ccol
only - this is because the inner for loop increments pm
by 4
in each of its own iteration. SInce the inner for loop runs for (psLim - psrc) / 4 = ccol / 4
iterations, the entire inner for loop increments pm
by (ccol / 4) * 4 = ccol
. Together with the increment by 3 * ccol
done by the outer for loop, pm
is incremented by 4 * ccol
in each iteration of the outer for loop, which is the desired behavior.
|
||
float* pSrcEnd = psrc + ccol; | ||
float* pDstEnd = pdst + crow; | ||
float* pSrcCurrent = psrc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need another set of pointers here? Can't we just use psrc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think there is a bug here. Shouldn't pSrcCurrent
get reset to psrc
after each inner loop?
From the C++ code:
for (const float * ps = psrc; ps < psLim; ps += 4, pm += 4)
{
const float * pmTmp;
__m128 x01 = _mm_load_ps(pmTmp = pm);
__m128 x11 = _mm_load_ps(pmTmp += ccol);
__m128 x21 = _mm_load_ps(pmTmp += ccol);
__m128 x31 = _mm_load_ps(pmTmp += ccol);
__m128 x02 = _mm_load_ps(ps);
Maybe we need a few more unit tests to cover different shapes of matrices. We only have 4x4 and 4x8 tests.
In reply to: 208976581 [](ancestors = 208976581)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reply to the first comment:
In this case, yes, we can use psrc
directly and feel free to increment it in the for loops without having a pSrcCurrent
, since we don't have to keep track of the start of the array (adjusted for the base of the AlignedArray
) in this function. The same goes for dst
and mat
.
In other functions that do not involve matrix multiplication, I have adopted the convention to only increment pXCurrent
, while using pX
to indicate a fixed pointer that indicates the start of the array X
. Should I skip using pXCurrent
just for these 4-6 functions that involve matrix operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reply to the second comment:
Nice catch - thank you! I have looked at all 4 functions that involve matrix multiplications to ensure the same thing doesn't happen elsewhere in these 4 functions.
Since the dst
array has a constant size of 4, the outer for loop never has a second iteration, and thus the 2 unit tests implemented for each matrix-op function did not manage to catch this bug. I will probably add a 8x4 test.
Note: These matrix-op functions have the suffix A
, which means that the inputs are "aligned and padded for SSE operations," so both dimensions of the input matrix/vector have to be multiples of 4 (except 1).
fixed (float* pdst = dst) | ||
{ | ||
float* pDstEnd = pdst + dst.Length; | ||
float* pSrcCurrent = psrc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need pSrcCurrent
and pDstCurrent
, right? We can just use psrc
and pdst
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This time, it may be different from the case in the matrix-op function (e.g. MatMulA
). Since here psrc
is fixed to point to src
in the fixed
statement, psrc
cannot be modified or incremented inside the fixed
block. I assigned a new variable pSrcCurrent
the value of psrc
, so that we can increment pSrcCurrent
to achieve the desired behavior.
@@ -340,13 +850,148 @@ internal static unsafe float SumAbsU(Span<float> src) | |||
pSrcCurrent += 4; | |||
} | |||
|
|||
result = VectorSum(in result); | |||
result = VectorSum(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be nice to keep the in
here and below. It is good documentation for the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not just good documentation, it can result in different code being emitted for certain cases.
You should always specify in
for a variable that is passed that way, IMO.
Vector128<float> result = Sse.SetZeroVector128(); | ||
Vector128<float> mask; | ||
|
||
if (Sse2.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe refactor this common chunk out into a method GetAbsMask
or something similar.
pSrcCurrent += 4; | ||
} | ||
|
||
Vector128<float> x1 = Sse.Shuffle(result, result, 0xB1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 4 lines are duplicated with the other Max method. Can we extract a common method here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree and thanks!
Should I make SumAbsU
call SumAbsDiffU
and MaxAbsU
call MaxAbsDiffU
with mean == 0f
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I wouldn't do it that way. If you did it that way, you always would need to emit Subtract(0)
calls, which would slow the function down.
What I mean is just these 4 duplicated lines that take the result
, and get the MaxScalar value from it. Similar to the VectorSum
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yes, I share the same thought, and have been on the same page with you. Wanted to see if it could be a way to reduce similar-looking codes :)
{ | ||
float* pdst = Ptr(dst, pDstStart); | ||
|
||
// REVIEW NEEDED: This line expands to (void)(c); but is it necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary in C#. It is used in C++ to get the compiler to stop complaining that c
is only used in an Assert. (It complains in Release builds becuase c
is never used, because the assert
gets compiled out.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I will remove these two comments. Same for below.
|
||
Vector128<float> xPrimal = Sse.SetAllVector128(primalUpdate); | ||
|
||
Vector128<float> signMask = Sse.SetAllVector128(-0.0f); // 1000 0000 ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) I know this is ported from the C++ code, but I think the comment is slightly confusing/misleading. typically I would expect hexadecimal representation here 0x8000 0000
. I guess this is binary representation, which is why it is using the ...
.
it may make sense to call this out as binary using the 0b
prefix. Or my preference would be just to use hex - 0x8000 0000
.
Vector128<float> xDst1 = Load4(pdst1, pIdxCurrent); | ||
xDst1 = Sse.Add(xDst1, Sse.Multiply(xSrc, xPrimal)); | ||
|
||
Vector128<float> xSign = Sse.And(xDst1, signMask); // result = 10000... if xDst1 is negative or 00000 otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this chunk of code can be refactored out into a method, since it is duplicated with the above method.
Responded to all PR feedback so far. Relevant changes are included in the last two commits. Now ready for a second round of review if any. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Thanks for the good work, @briancylui.
I'd like to get @safern and @tannergooding to have a look as well. @safern and @tannergooding, when you approve, can you merge this? I will be out for the next 2 working days.
I see the code usually uses Take DotU as an example, when using
When using
So the second one saves the instruction |
@helloguo, I believe that is covered in the existing notes on briancylui#2 The initial port is basically a direct translation of the C++ code, and the refactorings/cleanup are going to come in a separate PR (for the most part). |
@helloguo Thank you for your comment. At this point, we aim for a clean port of the original native code, from which we will obtain a baseline perf. All suggestions for future follow-up are documented in briancylui#2, where there are some suggestions about optimizing the loops. I have added your suggestion into the issue page. After implementing the new optimizations in a separate PR, we will be able to obtain new perf results and compare them against the baseline. |
{ | ||
if (!tran) | ||
{ | ||
Contracts.Assert(crun >= 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this assert is in all the paths, we can move it out to the common path, right?
{ | ||
if (!tran) | ||
{ | ||
Contracts.Assert(crun >= 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
Add(a, new Span<float>(dst, 0, count)); | ||
} | ||
|
||
// dst += a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think it is pretty clear what this is doing. Maybe this comment can be deleted.
for (int i = 0; i < cindices; ++i) | ||
{ | ||
int index = pidx[i]; | ||
Contracts.Assert(0 <= index && index < c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: more asserts using && that can be splitted in this file.
for (int i = 0; i < cindices; ++i) | ||
{ | ||
int index = pidx[i]; | ||
Contracts.Assert(0 <= index && index < c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
ivLogLim = ivLogMin + ccol; | ||
ivPhyMin = row * cfltRow; | ||
|
||
Contracts.Assert(ivLogMin <= index && index < ivLogLim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] | ||
private static Vector128<float> GetAbsMask() | ||
{ | ||
return (Sse2.IsSupported) ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you don't need to wrap Sse2.IsSupported in ()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
What's new:
AlignedArray
as an argumentDescription from #562:
Structure:
Changes to users:
Future follow-up for
CpuMath
enhancementCpuMath
enhancement in this PR scheduled for future follow-ups have been compiled into an issue page (Suggestions on CpuMath enhancement briancylui/machinelearning#2).List of new SSE intrinisics implemented
• MatMulA
• MatMulTranA
• MatMulPA
• MatMulTranPA
• SdcaL1UpdateU
• SdcaL1UpdateSU
• AddScaleCopyU
• SumU
• AddScalarU
• SumSqDiffU
• SumAbsDiffU
• MaxAbsDiffU
• MaxAbsU
• ScaleSrcU
• ScaleAddU
• ZeroItemsU
• ZeroMatrixItemsCoreU
cc: @eerhardt @tannergooding @safern @danmosemsft