Skip to content

[ML] Ensure the performance critical data are 16 byte aligned for data frame analyses #1142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Apr 16, 2020

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Apr 15, 2020

Eigen uses explicit intrinsic instructions for various operations. Currently, we enable SSE 4.2 (although should perhaps consider AVX as well longer term). These benefit considerably from using 16 byte alignment for the memory backing the vector/matrix. Eigen ensures this for memory it allocates itself, but in a number of performance critical pieces of code we use mapped matrices and vectors to cutdown on allocations (see here for documentation of this type). It is possible to supply the alignment to the map class if this is known, but before this change we were supplying unaligned here. So, loads/stores were not optimised.

This change extends to support alignments up to 32 bytes and chooses 16 byte alignment by default. There are two areas where I now have to manage alignment as a result: in the data frame and in the accumulators of the loss function gradients. The high-level strategy is as follows:

  1. Use the Eigen::aligned_allocator to ensure that the start of vector storage are aligned to 32 bytes. (Note that it is cleaner to manage the alignment of Eigen allocations globally and I've selected 32 bytes because this is more future proof.)
  2. Insert pads into the vectors so that we maintain the alignment of addresses to the start of the rows and the start of the accumulated loss derivatives.

The pads are calculated by rounding up the capacity of the row to a multiple of the alignment and by rounding up the start position of derivatives as necessary.

I've tested this across of a range of benchmark sets on my i9 laptop. The speed up is fairly stable and around a 15% mean improvement for a range of regression and binary classification tasks. This does not change the results.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for a couple of minor things

The speed up is fairly stable and around a 15% mean improvement for a range of regression and binary classification tasks.

It will be interesting to see whether the same magnitude of speedup is also observed on Linux and Windows. The default malloc on 64 bit Linux and Windows already returns 16 byte aligned memory, so these changes will only help for padding within large arrays, not for the start of the arrays. I don't have a feel for what proportion of the changed alignments are due to extra padding within arrays as opposed to the start address.

@tveasey
Copy link
Contributor Author

tveasey commented Apr 16, 2020

The default malloc on 64 bit Linux and Windows already returns 16 byte aligned memory, so these changes will only help for padding within large arrays, not for the start of the arrays.

Agreed and I haven't tested this yet. However, it is a tiny proportion, but more importantly, the key is to tell Eigen that the memory is aligned, via the new alignment template parameter. If you specify unaligned, the default, it won't exploit the fact the memory is aligned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants