Skip to content

Port native SIMD algorithms for SSE to managed code #552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
37 tasks done
briancylui opened this issue Jul 18, 2018 · 2 comments
Closed
37 tasks done

Port native SIMD algorithms for SSE to managed code #552

briancylui opened this issue Jul 18, 2018 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@briancylui
Copy link
Contributor

briancylui commented Jul 18, 2018

Summary (July 19)

  1. Finished preparation work to check in code to ML.NET repo, with:
  1. Resolved multi-targeting issue of targeting two different frameworks: .NET Core App 3.0 and .NET Standard 2.0
  2. Added additional unit tests
  3. Link to working repo (forked): https://github.com/briancylui/machinelearning
  4. Link to original issue page for 12-week timeline: Progress on porting ML.NET native SIMD algorithms to managed code briancylui/machinelearning#1

Goals

  1. Port ML.NET C++ SIMD algorithms for SSE to C#
  2. Ensure C# Hardware Intrinsics feature for SSE meets the needs of ML.NET
  3. Unit test all functions and get performance benchmark numbers for before and after changes
  4. (Stretch) Provide software fallback implementations to support more architectures

[Keeping only the relevant, high-level details below from original progress page to give a general sense of progress]

Progress

Week 2 (Jun 25-29): Port SIMD operations in .NET to managed code outside of ML.NET

  • Implement SSE support and software fallbacks in managed code for all key intrinsics
  • Comply with coding style standard
  • Implement working unit tests for all key intrinsics
  • Implement working performance tests for all key intrinsics using BenchmarkDotNet (slides and recording)
  • Present performance results in a table (SsePerf-report-github.pdf)
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1155 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515623 Hz, Resolution=284.4446 ns, Timer=TSC
.NET Core SDK=2.1.300
  [Host]     : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT

Method Mean Error StdDev
NativeDotUPerf 363.2 us 7.7293 us 18.8143 us
MyDotUPerf 340.2 us 6.7218 us 8.0018 us
NativeDotSUPerf 2,178.3 us 43.4641 us 40.6563 us
MyDotSUPerf 2,144.7 us 19.1638 us 16.0027 us
NativeSumSqUPerf 540.6 us 3.0299 us 2.8342 us
MySumSqUPerf 538.8 us 2.5507 us 2.3859 us
NativeAddUPerf 313.9 us 2.5163 us 2.3537 us
MyAddUPerf 303.3 us 4.5125 us 4.2210 us
NativeAddSUPerf 2,691.8 us 29.4588 us 27.5558 us
MyAddSUPerf 2,658.1 us 51.3336 us 64.9206 us
NativeAddScaleUPerf 300.0 us 5.5529 us 5.1941 us
MyAddScaleUPerf 309.8 us 5.3974 us 4.7846 us
NativeAddScaleSUPerf 2,550.9 us 21.8322 us 20.4218 us
MyAddScaleSUPerf 2,805.3 us 20.5171 us 19.1917 us
NativeScaleUPerf 131.4 us 0.6347 us 0.5626 us
MyScaleUPerf 130.7 us 1.2159 us 1.1373 us
NativeDist2Perf 336.4 us 2.0555 us 1.9227 us
MyDist2Perf 335.2 us 8.3427 us 11.4196 us
NativeSumAbsUPerf 258.0 us 1.6470 us 1.5406 us
MySumAbsqUPerf 258.9 us 0.9447 us 0.7889 us
NativeMulElementWiseUPerf 466.4 us 1.9625 us 1.6388 us
MyMulElementWiseUPerf 467.2 us 4.3560 us 4.0747 us

Week 3-5 (Jul 2-20): Port algo to C#, write unit tests and performance tests, check in code

  • Apply real data to test implemented managed code using BenchmarkDotNet
  • Integrate local code into ML.NET repo to prepare for checking in code, including:
  • C# implementations of intrinsics
  • Unit tests
  • Performance tests

Week 6 (Jul 23-27)

  • Participate in Microsoft Hackathon
  • Attend IEEE conference

Week 7 (Jul 30-Aug 3)

  • Respond to PR comments and Intel partners
  • Fix build issues in multi-targeting and disabling netcoreapp3.0 test projects
  • Hard-code unit tests
  • Introduced a custom random seed in perf tests based on environmental variables for better testing
  • Major style changes to best utilize existing libraries and ensure aggressive inlining wherever needed
  • Document follow-up action items for performance enhancement in an issue page (Suggestions on CpuMath enhancement briancylui/machinelearning#2)
  • Fix perf issues of some SSE intrinsics in compliance with C# 7.3 updates
  • Fix merge conflicts and obtain green builds for PR
  • PR on SSE key intrinsics, as well as their unit tests and perf tests, with multi-targeting, is approved

Week 8-9 (Aug 6-17)

  • PR is merged
  • Scale up implementation, unit tests, and performance tests to cover all SSE intrinsics
  • Write AVX implementations
  • Performance test before and after. We should see some perf gains here.
  • Check in code to ML.NET (submitted PR)

Perf test results for all active SSE hardware intrinsics:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Method Mean Error StdDev Median
NativeAddScalarUPerf 221.7 us 4.323 us 5.467 us 220.8 us
ManagedAddScalarUPerf 217.3 us 4.207 us 3.729 us 215.5 us
NativeScaleUPerf 219.0 us 2.368 us 2.215 us 218.9 us
ManagedScaleUPerf 182.2 us 2.677 us 2.504 us 182.4 us
NativeScaleSrcUPerf 252.4 us 4.404 us 3.904 us 250.8 us
ManagedScaleSrcUPerf 271.5 us 5.357 us 6.377 us 272.0 us
NativeScaleAddUPerf 230.6 us 3.230 us 3.021 us 230.5 us
ManagedScaleAddUPerf 232.3 us 3.281 us 2.908 us 231.8 us
NativeAddScaleUPerf 317.5 us 4.360 us 4.079 us 316.0 us
ManagedAddScaleUPerf 317.1 us 4.778 us 3.990 us 317.5 us
NativeAddScaleSUPerf 4,135.9 us 66.596 us 62.294 us 4,126.9 us
ManagedAddScaleSUPerf 4,812.6 us 39.148 us 34.704 us 4,803.0 us
NativeAddScaleCopyUPerf 505.4 us 5.658 us 4.725 us 503.8 us
ManagedAddScaleCopyUPerf 481.7 us 9.140 us 8.550 us 480.0 us
NativeAddUPerf 316.5 us 5.698 us 5.330 us 314.7 us
ManagedAddUPerf 335.2 us 12.130 us 23.944 us 321.9 us
NativeAddSUPerf 4,249.0 us 58.001 us 54.255 us 4,254.0 us
ManagedAddSUPerf 4,583.9 us 78.739 us 73.652 us 4,556.6 us
NativeMulElementWiseUPerf 552.5 us 7.078 us 5.911 us 551.5 us
ManagedMulElementWiseUPerf 507.9 us 7.059 us 6.258 us 507.8 us
NativeSumUPerf 289.2 us 5.435 us 5.084 us 287.6 us
ManagedSumUPerf 288.3 us 2.815 us 2.350 us 287.8 us
NativeSumSqUPerf 283.2 us 1.572 us 1.393 us 283.3 us
ManagedSumSqUPerf 289.8 us 2.493 us 2.210 us 288.8 us
NativeSumSqDiffUPerf 289.4 us 3.621 us 3.387 us 289.4 us
ManagedSumSqDiffUPerf 290.9 us 2.772 us 2.593 us 290.0 us
NativeSumAbsUPerf 289.2 us 4.836 us 4.524 us 287.0 us
ManagedSumAbsUPerf 293.1 us 1.338 us 1.186 us 293.2 us
NativeSumAbsDiffUPerf 290.7 us 5.000 us 4.677 us 288.8 us
ManagedSumAbsDiffUPerf 294.4 us 5.242 us 4.903 us 293.0 us
NativeMaxAbsUPerf 288.0 us 3.924 us 3.671 us 285.8 us
ManagedMaxAbsUPerf 290.1 us 2.614 us 2.317 us 289.0 us
NativeMaxAbsDiffUPerf 292.1 us 4.805 us 4.495 us 289.6 us
ManagedMaxAbsDiffUPerf 290.6 us 2.083 us 1.846 us 290.3 us
NativeDotUPerf 328.8 us 3.844 us 3.407 us 328.6 us
ManagedDotUPerf 333.8 us 2.154 us 1.910 us 333.3 us
NativeDotSUPerf 3,414.2 us 67.058 us 68.864 us 3,393.7 us
ManagedDotSUPerf 3,753.1 us 37.440 us 33.189 us 3,737.5 us
NativeDist2Perf 332.3 us 3.152 us 2.632 us 332.0 us
ManagedDist2Perf 333.7 us 4.368 us 3.647 us 332.0 us
NativeSdcaL1UpdateUPerf 607.5 us 8.506 us 7.957 us 608.7 us
ManagedSdcaL1UpdateUPerf 600.8 us 12.003 us 27.820 us 591.3 us
NativeSdcaL1UpdateSUPerf 13,445.5 us 116.336 us 108.821 us 13,447.1 us
ManagedSdcaL1UpdateSUPerf 13,824.3 us 97.564 us 86.488 us 13,795.3 us

Perf tests results for all managed intrinsics with AVX enhancement:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Method Mean Error StdDev
ManagedAddScalarUPerf 157.3 us 1.3138 us 1.1647 us
ManagedScaleUPerf 177.0 us 3.5143 us 7.5649 us
ManagedScaleSrcUPerf 260.5 us 0.9317 us 0.8715 us
ManagedScaleAddUPerf 170.3 us 1.6569 us 1.5499 us
ManagedAddScaleUPerf 272.5 us 5.4200 us 9.2035 us
ManagedAddScaleSUPerf 5,253.6 us 105.0419 us 163.5375 us
ManagedAddScaleCopyUPerf 448.2 us 11.0005 us 19.8362 us
ManagedAddUPerf 263.4 us 2.5347 us 2.2469 us
ManagedAddSUPerf 4,256.5 us 38.0944 us 33.7697 us
ManagedMulElementWiseUPerf 441.7 us 3.2423 us 2.8742 us
ManagedSumUPerf 161.0 us 1.3688 us 1.2134 us
ManagedSumSqUPerf 165.0 us 0.4772 us 0.4230 us
ManagedSumSqDiffUPerf 179.5 us 1.1673 us 1.0919 us
ManagedSumAbsUPerf 174.9 us 3.4667 us 5.9799 us
ManagedSumAbsDiffUPerf 178.7 us 0.6264 us 0.4529 us
ManagedMaxAbsUPerf 168.2 us 1.1892 us 1.0542 us
ManagedMaxAbsDiffUPerf 179.7 us 1.9884 us 1.7626 us
ManagedDotUPerf 258.1 us 2.6630 us 2.2237 us
ManagedDotSUPerf 3,297.7 us 23.2337 us 19.4012 us
ManagedDist2Perf 258.8 us 3.9883 us 3.5355 us
ManagedSdcaL1UpdateUPerf 545.0 us 10.7959 us 17.1234 us
ManagedSdcaL1UpdateSUPerf 13,624.1 us 34.6645 us 32.4252 us

Week 10-11 (Stretch) (Aug 20-31)

  • Provide software fallback implementations (stretch goals)
  • Respond to PR feedback for AVX intrinsics
  • Streamlined perf test layout
  • Report improvement in running time of intrinsics: averaged 17.78%
  • Report improvement in running time of end-to-end real-life user scenarios: 13.88%
  • Get ML.NET to run on Raspberry Pi
  • Present on August 31 (11am-12nn 25/3365, also on Skype)
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.228 (1803/April2018Update/Redstone4)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.2.100-refac-20180613-1
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Type Method Mean Error StdDev
AvxPerformanceTests AddScalarU 152.8 us 3.200 us 2.993 us
NativePerformanceTests AddScalarU 183.6 us 1.962 us 1.739 us
SsePerformanceTests AddScalarU 188.7 us 2.526 us 2.363 us
AvxPerformanceTests ScaleU 172.6 us 3.406 us 3.497 us
NativePerformanceTests ScaleU 185.1 us 3.683 us 3.941 us
SsePerformanceTests ScaleU 189.8 us 5.175 us 5.083 us
AvxPerformanceTests ScaleSrcU 260.7 us 4.639 us 5.156 us
NativePerformanceTests ScaleSrcU 273.6 us 4.780 us 4.237 us
SsePerformanceTests ScaleSrcU 275.8 us 3.545 us 3.142 us
AvxPerformanceTests ScaleAddU 153.8 us 3.020 us 2.522 us
NativePerformanceTests ScaleAddU 204.2 us 2.024 us 1.794 us
SsePerformanceTests ScaleAddU 201.3 us 2.281 us 2.133 us
AvxPerformanceTests AddScaleU 277.9 us 5.266 us 6.268 us
NativePerformanceTests AddScaleU 321.4 us 9.161 us 7.650 us
SsePerformanceTests AddScaleU 322.5 us 8.266 us 16.121 us
AvxPerformanceTests AddScaleSU 4,433.3 us 80.711 us 75.498 us
NativePerformanceTests AddScaleSU 4,129.7 us 81.846 us 76.559 us
SsePerformanceTests AddScaleSU 4,718.3 us 59.922 us 50.038 us
AvxPerformanceTests AddScaleCopyU 447.0 us 8.758 us 10.086 us
NativePerformanceTests AddScaleCopyU 479.4 us 5.484 us 4.861 us
SsePerformanceTests AddScaleCopyU 481.8 us 3.736 us 3.312 us
AvxPerformanceTests AddU 283.1 us 4.842 us 4.292 us
NativePerformanceTests AddU 345.2 us 2.573 us 2.281 us
SsePerformanceTests AddU 343.6 us 2.210 us 2.067 us
AvxPerformanceTests AddSU 4,220.7 us 56.154 us 52.526 us
NativePerformanceTests AddSU 4,099.2 us 51.729 us 45.856 us
SsePerformanceTests AddSU 4,582.3 us 46.382 us 41.116 us
AvxPerformanceTests MulElementWiseU 452.8 us 4.688 us 4.156 us
NativePerformanceTests MulElementWiseU 461.9 us 7.896 us 7.000 us
SsePerformanceTests MulElementWiseU 461.7 us 2.374 us 1.982 us
AvxPerformanceTests SumU 164.5 us 3.516 us 4.186 us
NativePerformanceTests SumU 285.2 us 3.615 us 3.205 us
SsePerformanceTests SumU 283.0 us 3.259 us 2.889 us
AvxPerformanceTests SumSqU 165.2 us 1.658 us 1.550 us
NativePerformanceTests SumSqU 262.4 us 2.580 us 2.413 us
SsePerformanceTests SumSqU 261.7 us 1.935 us 1.810 us
AvxPerformanceTests SumSqDiffU 177.9 us 2.384 us 1.991 us
NativePerformanceTests SumSqDiffU 289.5 us 2.508 us 2.095 us
SsePerformanceTests SumSqDiffU 290.5 us 1.844 us 1.725 us
AvxPerformanceTests SumAbsU 178.5 us 2.198 us 1.948 us
NativePerformanceTests SumAbsU 260.4 us 2.086 us 1.951 us
SsePerformanceTests SumAbsU 268.2 us 4.085 us 3.821 us
AvxPerformanceTests SumAbsDiffU 186.0 us 1.551 us 1.451 us
NativePerformanceTests SumAbsDiffU 289.1 us 1.975 us 1.649 us
SsePerformanceTests SumAbsDiffU 299.8 us 2.306 us 2.044 us
AvxPerformanceTests MaxAbsU 177.1 us 2.103 us 1.864 us
NativePerformanceTests MaxAbsU 263.5 us 1.667 us 1.560 us
SsePerformanceTests MaxAbsU 267.9 us 4.266 us 3.562 us
AvxPerformanceTests MaxAbsDiffU 185.6 us 1.796 us 1.592 us
NativePerformanceTests MaxAbsDiffU 289.0 us 2.099 us 1.963 us
SsePerformanceTests MaxAbsDiffU 301.7 us 3.368 us 2.986 us
AvxPerformanceTests DotU 264.1 us 4.975 us 5.323 us
NativePerformanceTests DotU 344.0 us 1.875 us 1.662 us
SsePerformanceTests DotU 350.5 us 2.387 us 2.116 us
AvxPerformanceTests DotSU 3,289.4 us 36.279 us 32.160 us
NativePerformanceTests DotSU 3,381.5 us 41.831 us 39.129 us
SsePerformanceTests DotSU 3,766.5 us 32.342 us 28.670 us
AvxPerformanceTests Dist2 266.8 us 5.161 us 4.310 us
NativePerformanceTests Dist2 357.5 us 3.980 us 3.722 us
SsePerformanceTests Dist2 373.4 us 7.129 us 7.321 us
AvxPerformanceTests SdcaL1UpdateU 559.8 us 8.247 us 7.311 us
NativePerformanceTests SdcaL1UpdateU 616.0 us 9.798 us 8.685 us
SsePerformanceTests SdcaL1UpdateU 630.3 us 18.576 us 54.772 us
AvxPerformanceTests SdcaL1UpdateSU 13,510.0 us 104.569 us 97.814 us
NativePerformanceTests SdcaL1UpdateSU 12,786.0 us 70.993 us 66.407 us
SsePerformanceTests SdcaL1UpdateSU 12,874.8 us 391.218 us 401.752 us
// * Warnings *
MultimodalDistribution
  SsePerformanceTests.SdcaL1UpdateU: Toolchain=InProcessToolchain -> It seems that the distribution is multimodal (mValue = 4.25)

// * Legends *
  Mean   : Arithmetic mean of all measurements
  Error  : Half of 99.9% confidence interval
  StdDev : Standard deviation of all measurements
  1 us   : 1 Microsecond (0.000001 sec)

Week 12 (Sept 3-7)

  • Write blog post on how ML.NET is taking advantage of .NET Core hardware intrinsics, and AVX vs SSE comparisons (both implementation and runtime perf)
  • Documented future perf enhancement measures (e.g. optimizing loops and alignment issues at the assembly/instruction level) in Suggestions on CpuMath enhancement briancylui/machinelearning#2
  • Clean up, presentation, close out remaining issues
@shauheen shauheen added the enhancement New feature or request label Jul 19, 2018
@shauheen shauheen added this to the 0718 milestone Jul 19, 2018
@shauheen shauheen modified the milestones: 0718, 0818 Aug 4, 2018
@shauheen shauheen modified the milestones: 0818, 0918 Aug 31, 2018
@danmoseley
Copy link
Member

@eerhardt does this issue serve any remaining purpose

@eerhardt
Copy link
Member

Nope this work is now done. Brian logged separate issues for remaining work. This can be closed now.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants