travisdowns
diff --git a/‎_data/comments/icelake-zero-opt/entry1614101482805.yml
+1-1 b/‎_data/comments/icelake-zero-opt/entry1614101482805.yml
+1-1
diff --git a/‎_data/comments/now-with-comments/entry1584922563808.yml
+1-1 b/‎_data/comments/now-with-comments/entry1584922563808.yml
+1-1
diff --git a/‎_data/comments/now-with-comments/entry1595821332796.yml
+1-1 b/‎_data/comments/now-with-comments/entry1595821332796.yml
+1-1
@@ -1,7 +1,7 @@
 _id: f26ed410-75fc-11eb-b1b1-b1e90cd6ebd3
 _parent: 'https://travisdowns.github.io/blog/2020/05/18/icelake-zero-opt.html'
 replying_to_uid: ''
-message: "Hi Travis,\r\n\r\nRe: [Why am I seeing more RFO (Read For Ownership) requests using REP MOVSB than with vmovdqa](https://stackoverflow.com/questions/66274948/why-am-i-seeing-more-rfo-read-for-ownership-requests-using-rep-movsb-than-with?noredirect=1#comment117262448_66274948)\r\n\r\nYour explination for why the [RFO-ND](https://community.intel.com/t5/Software-Tuning-Performance/What-is-the-ItoM-IDI-opcode-in-the-performance-monitoring-events/m-p/1253627#M7797) optimization and 0-over-0 write elimination optimization for mutally exclusive makes sense. What I am still curious about is why ```zmm``` fill0 seem to get worst of both worlds; being unable to make the RFO-ND optimization and having a worse rate of 0-over-0 write elimination than ```ymm``` fill0.\r\n\r\nHere is a layout of some observations that I think might be useful for figuring this out:\r\n\r\n1. ```rep stosb``` writing in 64 byte chunks is able to make the RFO-ND optimization but is not able to get the 0-over-0 write elminiation optimization.\r\n2. ```vmovdqa ymm, (reg); vmovdqa ymm, 32(reg)``` (assumining reg is 64 byte aligned) is unable to make the RFO-ND optimization and gets a high rate of 0-over-0 elimination optimization\r\n3. ```vmovdqa ymm, (reg); vmovdqa ymm, 64(reg); vmovdqa ymm, 32(reg); vmovdqa ymm, 96(reg)``` (assumining reg is 64 byte aligned) is unable to make the RFO-ND optimization and gets a *slightly higher* rate of 0-over-0 elimination optimization than case 2.\r\n4. ```vmovdqa zmm, (reg)``` is not able to make the RFO-ND optimization and has the worse rate of 0-over-0 elminination than both case 2 and case 3 above.\r\n\r\nCase 4 is strange because it gets the worst of both worlds. Intuitively if it was the RFO-ND that prevents 0-over-0 write elimination (i.e whats happening with ```rep stosb```) we would either expect to see some RFO-ND requests when using ```vmovdqa zmm, (reg)``` but we don't. Likewise if we not seeing RFO-ND requests why would its 0-over-0 elimination rate be lower.\r\n\r\nCases 2 and 3 I think are really interesting (reproducible by changing that order of stores [here](https://github.com/travisdowns/zero-fill-bench/blob/master/algos.cpp#L52)) and I might be useful in understanding case 4. The only difference between cases 2 and 3 that I can think of is that [case 2 can write coalesce in the LFB whereas case 3 cannot](https://www.realworldtech.com/forum/?threadid=173441&curpostid=192262).\r\n\r\nSo as a possible explinination for why case 3 gets a higher 0-over-0 write elimination rate than case 2 I was thinking something along the following for relationship between RFO requests and LFB coalescing. \r\n\r\nFor Case 2:\r\n\r\n- ```vmovdqa ymm, (reg)``` (write A) graduates in the store buffer and goes to LFB.\r\n- ```vmovdqa ymm, (reg)``` (write B) graduates in the store buffer and goes to LFB.\r\n- Write B coalesces with tail of LFB write A\r\n- Write AB makes an RFO request w/ data\r\n- Write AB is 64 bytes so when RFO returns something special happens that might ignore the data returned by RFO request.\r\n    - We know its not fully optimizing out the check because case 2 has a relatively high 0-over-0 write elminination rate and case 4 has a non-zero rate.\r\n\r\nFor Case 3:\r\n\r\n- ```vmovdqa ymm, (reg)``` (write A) graduates in the store buffer and goes to LFB.\r\n- Write A makes an RFO request w/ data\r\n- ```vmovdqa ymm, (reg)``` (write B) graduates in the store buffer and goes to LFB.\r\n- Write B makes an RFO request w/ data\r\n- Write A is 32 bytes so when RFO returns it checks the data\r\n- Write B is 32 bytes so when RFO returns it checks the data\r\n    - Since A and B where not coalesced we will never see the \"special\" circumstances where the RFO data is ignored.\r\n\r\n```\r\nperf stat --all-user -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/\r\n```\r\n(The data has the same trend with prefetched events counted)\r\n\r\n| Case | l2_rqsts_references | l2_rqsts_all_rfo | l2_rqsts_rfo_hit | l2_rqsts_rfo_miss |\r\n|------|---------------------|------------------|------------------|-------------------|\r\n|    2 |           4,449,489 |        4,325,110 |        3,320,352 |         1,004,758 |\r\n|    3 |          11,832,291 |       11,704,843 |        7,624,891 |         4,079,952 |\r\n|    4 |           4,402,628 |        4,254,720 |        3,263,507 |           991,213 |\r\n\r\nWe see here that case 2 and case 4 have basically the same RFO behavior which would indicate that writes A and B merge before the RFO request is made. What this doesn't make clear is:\r\n- whether the difference in 0-over-0 elimination rate between cases 2 and 3 is because there is a difference in the RFO requests, or whether the difference is related to how the RFO requests are handled upon returning.\r\n- where the difference between cases 2 and 4 emerges (i.e we can't just say an RFO request for a full cache line write has its data part ignored).\r\n\r\n\r\nThe only explination I can think of is if RFO prefetching takes place (mentioned a bit [here](https://stackoverflow.com/questions/61129773/how-do-the-store-buffer-and-line-fill-buffer-interact-with-each-other)) its possible you could try and explain the data with the following assumptions:\r\n- That RFO prefetching does in fact take place\r\n- That whether the RFO data can be ignored is a function of RFO configuration, not the size of the entry in the LFB.\r\n- Prefetched RFO requests are always configured to use the return data (unknown what's different about ```rep stosb``` and why this would be case).\r\n- Prefetched RFO requests don't show up on event counters.\r\n\r\nThere are basically four assumptions here any of which being untrue would blowup the theory, but if they all happened to be the case then we could explain the difference in 0-over-0 write elmination between cases 2 and 4 by saying that case 2 has many more chances to prefetch and the difference between cases 2 and 3 by saying that when case 2 fails both prefetches the coalesced RFO from writes AB will optimize out the data check.\r\n\r\n\r\nOverall having trouble wrapping my head around all of this and wondering if you have an idea what distinguishes the 4 cases above."
+message: "Hi Travis,\r\n\r\nRe: [Why am I seeing more RFO (Read For Ownership) requests using REP MOVSB than with vmovdqa](https://stackoverflow.com/questions/66274948/why-am-i-seeing-more-rfo-read-for-ownership-requests-using-rep-movsb-than-with?noredirect=1#comment117262448_66274948)\r\n\r\nYour explination for why the [RFO-ND](https://community.intel.com/t5/Software-Tuning-Performance/What-is-the-ItoM-IDI-opcode-in-the-performance-monitoring-events/m-p/1253627#M7797) optimization and 0-over-0 write elimination optimization for mutally exclusive makes sense. What I am still curious about is why ```zmm``` fill0 seem to get worst of both worlds; being unable to make the RFO-ND optimization and having a worse rate of 0-over-0 write elimination than ```ymm``` fill0.\r\n\r\nHere is a layout of some observations that I think might be useful for figuring this out:\r\n\r\n1. ```rep stosb``` writing in 64 byte chunks is able to make the RFO-ND optimization but is not able to get the 0-over-0 write elminiation optimization.\r\n2. ```vmovdqa ymm, (reg); vmovdqa ymm, 32(reg)``` (assumining reg is 64 byte aligned) is unable to make the RFO-ND optimization and gets a high rate of 0-over-0 elimination optimization\r\n3. ```vmovdqa ymm, (reg); vmovdqa ymm, 64(reg); vmovdqa ymm, 32(reg); vmovdqa ymm, 96(reg)``` (assumining reg is 64 byte aligned) is unable to make the RFO-ND optimization and gets a *slightly higher* rate of 0-over-0 elimination optimization than case 2.\r\n4. ```vmovdqa zmm, (reg)``` is not able to make the RFO-ND optimization and has the worse rate of 0-over-0 elminination than both case 2 and case 3 above.\r\n\r\nCase 4 is strange because it gets the worst of both worlds. Intuitively if it was the RFO-ND that prevents 0-over-0 write elimination (i.e whats happening with ```rep stosb```) we would either expect to see some RFO-ND requests when using ```vmovdqa zmm, (reg)``` but we don't. Likewise if we not seeing RFO-ND requests why would its 0-over-0 elimination rate be lower.\r\n\r\nCases 2 and 3 I think are really interesting (reproducible by [changing that order of stores here](https://github.com/travisdowns/zero-fill-bench/blob/master/algos.cpp#L52)) and I might be useful in understanding case 4. The only difference between cases 2 and 3 that I can think of is that [case 2 can write coalesce in the LFB whereas case 3 cannot](https://www.realworldtech.com/forum/?threadid=173441&curpostid=192262).\r\n\r\nSo as a possible explinination for why case 3 gets a higher 0-over-0 write elimination rate than case 2 I was thinking something along the following for relationship between RFO requests and LFB coalescing. \r\n\r\nFor Case 2:\r\n\r\n- ```vmovdqa ymm, (reg)``` (write A) graduates in the store buffer and goes to LFB.\r\n- ```vmovdqa ymm, (reg)``` (write B) graduates in the store buffer and goes to LFB.\r\n- Write B coalesces with tail of LFB write A\r\n- Write AB makes an RFO request w/ data\r\n- Write AB is 64 bytes so when RFO returns something special happens that might ignore the data returned by RFO request.\r\n    - We know its not fully optimizing out the check because case 2 has a relatively high 0-over-0 write elminination rate and case 4 has a non-zero rate.\r\n\r\nFor Case 3:\r\n\r\n- ```vmovdqa ymm, (reg)``` (write A) graduates in the store buffer and goes to LFB.\r\n- Write A makes an RFO request w/ data\r\n- ```vmovdqa ymm, (reg)``` (write B) graduates in the store buffer and goes to LFB.\r\n- Write B makes an RFO request w/ data\r\n- Write A is 32 bytes so when RFO returns it checks the data\r\n- Write B is 32 bytes so when RFO returns it checks the data\r\n    - Since A and B where not coalesced we will never see the \"special\" circumstances where the RFO data is ignored.\r\n\r\n```\r\nperf stat --all-user -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/\r\n```\r\n(The data has the same trend with prefetched events counted)\r\n\r\n| Case | l2_rqsts_references | l2_rqsts_all_rfo | l2_rqsts_rfo_hit | l2_rqsts_rfo_miss |\r\n|------|---------------------|------------------|------------------|-------------------|\r\n|    2 |           4,449,489 |        4,325,110 |        3,320,352 |         1,004,758 |\r\n|    3 |          11,832,291 |       11,704,843 |        7,624,891 |         4,079,952 |\r\n|    4 |           4,402,628 |        4,254,720 |        3,263,507 |           991,213 |\r\n\r\nWe see here that case 2 and case 4 have basically the same RFO behavior which would indicate that writes A and B merge before the RFO request is made. What this doesn't make clear is:\r\n- whether the difference in 0-over-0 elimination rate between cases 2 and 3 is because there is a difference in the RFO requests, or whether the difference is related to how the RFO requests are handled upon returning.\r\n- where the difference between cases 2 and 4 emerges (i.e we can't just say an RFO request for a full cache line write has its data part ignored).\r\n\r\n\r\nThe only explination I can think of is if RFO prefetching takes place (mentioned a bit [here](https://stackoverflow.com/questions/61129773/how-do-the-store-buffer-and-line-fill-buffer-interact-with-each-other)) its possible you could try and explain the data with the following assumptions:\r\n- That RFO prefetching does in fact take place\r\n- That whether the RFO data can be ignored is a function of RFO configuration, not the size of the entry in the LFB.\r\n- Prefetched RFO requests are always configured to use the return data (unknown what's different about ```rep stosb``` and why this would be case).\r\n- Prefetched RFO requests don't show up on event counters.\r\n\r\nThere are basically four assumptions here any of which being untrue would blowup the theory, but if they all happened to be the case then we could explain the difference in 0-over-0 write elmination between cases 2 and 4 by saying that case 2 has many more chances to prefetch and the difference between cases 2 and 3 by saying that when case 2 fails both prefetches the coalesced RFO from writes AB will optimize out the data check.\r\n\r\n\r\nOverall having trouble wrapping my head around all of this and wondering if you have an idea what distinguishes the 4 cases above."
 name: Noah Goldstein
 email: 5c6c5e08ed042ab5db692956c8c768c2
 hp: ''
 
@@ -2,7 +2,7 @@ _id: 7b707640-6c9b-11ea-9683-37803f7596d9
 _parent: 'https://travisdowns.github.io/blog/2020/02/05/now-with-comments.html'
 replying_to: ''
 replying_to_uid: ''
-message: "First of all, thanks for the very details post!\r\nI want to mention my experience which went against footnote 8: I initially accepted the invitation from the bot account via the Github linked but kept invitation not found when trying to connect it. Then i came across [this](https://www.flyinggrizzly.io/2017/12/setting-up-staticman-server/) which says I should not accept it so after I tried removed the bot as a collaborator and tried to connect again it worked\r\n\r\nOnce again thanks"
+message: "First of all, thanks for the very details post!\r\nI want to mention my experience which went against footnote 8: I initially accepted the invitation from the bot account via the Github linked but kept invitation not found when trying to connect it. Then i came across [https://www.flyinggrizzly.io/2017/12/setting-up-staticman-server/](https://www.flyinggrizzly.io/2017/12/setting-up-staticman-server/) which says I should not accept it so after I tried removed the bot as a collaborator and tried to connect again it worked\r\n\r\nOnce again thanks"
 name: nazaal
 email: ''
 hp: ''
 
@@ -1,7 +1,7 @@
 _id: 280f0b50-cfbb-11ea-89f3-dfba9d852500
 _parent: 'https://travisdowns.github.io/blog/2020/02/05/now-with-comments.html'
 replying_to_uid: 731184d0-cf51-11ea-adfe-e944f0d58139
-message: "Right, so this is kind of the least standard part of the integration, because unlike the API bridge, the data format, the integration with GitHub and so on, how you want to do this depends a lot on how you generate your blog, and how you want the comments section to look, etc.\r\n\r\nStill, I tried to cover the case of integrating it with a typical Jekyll blog [here](https://travisdowns.github.io/blog/2020/02/05/now-with-comments.html#integrate-comments-into-site). This is the integration I used: you have to copy the supporting files to your repository, and the modify some part of your post template or foot or whatever to include the comments.\r\n\r\nIs there any particular part you are stuck on? Note that I'm neither an HTML, CSS or JavaScript expert (which probably shows in the relatively vanilla appearance of the integration)."
+message: "Right, so this is kind of the least standard part of the integration, because unlike the API bridge, the data format, the integration with GitHub and so on, how you want to do this depends a lot on how you generate your blog, and how you want the comments section to look, etc.\r\n\r\nStill, I tried to cover [the case of integrating it with a typical Jekyll blog](https://travisdowns.github.io/blog/2020/02/05/now-with-comments.html#integrate-comments-into-site). This is the integration I used: you have to copy the supporting files to your repository, and the modify some part of your post template or foot or whatever to include the comments.\r\n\r\nIs there any particular part you are stuck on? Note that I'm neither an HTML, CSS or JavaScript expert (which probably shows in the relatively vanilla appearance of the integration)."
 name: Travis Downs
 email: c6937532928911c0dae3c9c89b658c09
 hp: ''