-
Notifications
You must be signed in to change notification settings - Fork 65
[ML] Ensure the performance critical data are 16 byte aligned for data frame analyses #1142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… derivatives (we always merge before remapping, but this way is safe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for a couple of minor things
The speed up is fairly stable and around a 15% mean improvement for a range of regression and binary classification tasks.
It will be interesting to see whether the same magnitude of speedup is also observed on Linux and Windows. The default malloc
on 64 bit Linux and Windows already returns 16 byte aligned memory, so these changes will only help for padding within large arrays, not for the start of the arrays. I don't have a feel for what proportion of the changed alignments are due to extra padding within arrays as opposed to the start address.
Agreed and I haven't tested this yet. However, it is a tiny proportion, but more importantly, the key is to tell Eigen that the memory is aligned, via the new alignment template parameter. If you specify unaligned, the default, it won't exploit the fact the memory is aligned. |
Eigen uses explicit intrinsic instructions for various operations. Currently, we enable SSE 4.2 (although should perhaps consider AVX as well longer term). These benefit considerably from using 16 byte alignment for the memory backing the vector/matrix. Eigen ensures this for memory it allocates itself, but in a number of performance critical pieces of code we use mapped matrices and vectors to cutdown on allocations (see here for documentation of this type). It is possible to supply the alignment to the map class if this is known, but before this change we were supplying unaligned here. So, loads/stores were not optimised.
This change extends to support alignments up to 32 bytes and chooses 16 byte alignment by default. There are two areas where I now have to manage alignment as a result: in the data frame and in the accumulators of the loss function gradients. The high-level strategy is as follows:
Eigen::aligned_allocator
to ensure that the start of vector storage are aligned to 32 bytes. (Note that it is cleaner to manage the alignment of Eigen allocations globally and I've selected 32 bytes because this is more future proof.)The pads are calculated by rounding up the capacity of the row to a multiple of the alignment and by rounding up the start position of derivatives as necessary.
I've tested this across of a range of benchmark sets on my i9 laptop. The speed up is fairly stable and around a 15% mean improvement for a range of regression and binary classification tasks. This does not change the results.