Skip to content

Change bound checking in SSE/AVX intrinsics to avoid pointer overflow #821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions src/Microsoft.ML.CpuMath/AvxIntrinsics.cs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ internal static class AvxIntrinsics
{
private static readonly Vector256<float> _absMask256 = Avx.StaticCast<int, float>(Avx.SetAllVector256(0x7FFFFFFF));

// The count of 32-bit floats in Vector256<T>
private const int AvxAlignment = 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't alignment, but rather the number of 32-bit elements that Vector256 can hold...

Maybe Vector256SingleElementCount or something similar...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Vector256FloatCount? Is there any preference for SingleElement?


// The count of bytes in Vector256<T>, corresponding to _cbAlign in AlignedArray
private const int Vector256Alignment = 32;

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
Expand Down Expand Up @@ -415,32 +419,32 @@ public static unsafe void AddScalarU(float scalar, Span<float> dst)
{
fixed (float* pdst = dst)
{
float* pDstEnd = pdst + dst.Length;
float* pDstCurrent = pdst;

Vector256<float> scalarVector256 = Avx.SetAllVector256(scalar);
int countAvx = Math.DivRem(dst.Length, AvxAlignment, out int remainderAvx);
float* pDstCurrent = pdst;

while (pDstCurrent + 8 <= pDstEnd)
for (int i = 0; i < countAvx; i++)
{
Vector256<float> dstVector = Avx.LoadVector256(pDstCurrent);
dstVector = Avx.Add(dstVector, scalarVector256);
Avx.Store(pDstCurrent, dstVector);

pDstCurrent += 8;
pDstCurrent += AvxAlignment;
}

Vector128<float> scalarVector128 = Sse.SetAllVector128(scalar);
int countSse = Math.DivRem(remainderAvx, SseIntrinsics.SseAlignment, out int remainderSse);

if (pDstCurrent + 4 <= pDstEnd)
if (countSse > 0)
{
Vector128<float> dstVector = Sse.LoadVector128(pDstCurrent);
dstVector = Sse.Add(dstVector, scalarVector128);
Sse.Store(pDstCurrent, dstVector);

pDstCurrent += 4;
pDstCurrent += SseIntrinsics.SseAlignment;
}

while (pDstCurrent < pDstEnd)
for (int i = 0; i < remainderSse; i++)
{
Vector128<float> dstVector = Sse.LoadScalarVector128(pDstCurrent);
dstVector = Sse.AddScalar(dstVector, scalarVector128);
Expand Down
14 changes: 8 additions & 6 deletions src/Microsoft.ML.CpuMath/SseIntrinsics.cs
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ internal static class SseIntrinsics
Sse.StaticCast<int, float>(Sse2.SetAllVector128(0x7FFFFFFF)) :
Sse.SetAllVector128(BitConverter.Int32BitsToSingle(0x7FFFFFFF));

// The count of 32-bit floats in Vector128<T>
internal const int SseAlignment = 4;

// The count of bytes in Vector128<T>, corresponding to _cbAlign in AlignedArray
private const int Vector128Alignment = 16;

Expand Down Expand Up @@ -412,21 +415,20 @@ public static unsafe void AddScalarU(float scalar, Span<float> dst)
{
fixed (float* pdst = dst)
{
float* pDstEnd = pdst + dst.Length;
float* pDstCurrent = pdst;

Vector128<float> scalarVector = Sse.SetAllVector128(scalar);
int count = Math.DivRem(dst.Length, SseAlignment, out int remainder);
float* pDstCurrent = pdst;

while (pDstCurrent + 4 <= pDstEnd)
for (int i = 0; i < count; i++)
{
Vector128<float> dstVector = Sse.LoadVector128(pDstCurrent);
dstVector = Sse.Add(dstVector, scalarVector);
Sse.Store(pDstCurrent, dstVector);

pDstCurrent += 4;
pDstCurrent += SseAlignment;
}

while (pDstCurrent < pDstEnd)
for (int i = 0; i < remainder; i++)
{
Vector128<float> dstVector = Sse.LoadScalarVector128(pDstCurrent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be way more readable as pDstCurrent[i] += scalar, and it should produce the same code.

The various scalar intrinsics are really meant for places where you have to interop between Vector and Scalar code, or where you need some scalar operations which aren't expressible in normal C# code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, perf test results show that this change improves the runtime significantly:

After all changes, including changing scalar intrinsics to indexed code:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests AddScalarU 172.1 us 2.589 us 2.422 us
NativePerformanceTests AddScalarU 216.9 us 1.785 us 1.491 us
SsePerformanceTests AddScalarU 209.7 us 1.492 us 1.246 us

After partial changes (removing the 2nd time of Math.DivRem):

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests AddScalarU 193.3 us 3.6653 us 3.4285 us
NativePerformanceTests AddScalarU 215.4 us 0.7014 us 0.6218 us
SsePerformanceTests AddScalarU 238.2 us 4.1955 us 3.5034 us

Before the change:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests AddScalarU 200.3 us 3.919 us 4.513 us
NativePerformanceTests AddScalarU 249.4 us 4.400 us 4.116 us
SsePerformanceTests AddScalarU 235.3 us 1.393 us 1.235 us

dstVector = Sse.AddScalar(dstVector, scalarVector);
Expand Down