-
Notifications
You must be signed in to change notification settings - Fork 1.6k
USE_TLS=0 not fork safe on Cygwin #2002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You seem to be conflating two options/issues here - one is the "new" TLS code from last year and the other is the much older level 3 BLAS workload splitting that dates back to GotoBLAS2 ? It is entirely possible that additional threading from the level3 code exposes bugs in the TLS code. |
To clarify - #1765 corrects non-standard settings in Makefile.rule that were inadvertently imported from a local installation I had used to debug a particular problem earlier. USE_SIMPLE_THREADED_LEVEL3 is a fallback option that Kazushige Goto added 10+ years ago when he came up with the more elaborate level3 threading code, it can be used when one suspects a race or other thread safety problem. |
Err, are you building with USE_TLS=1 or am I just jumping to conclusions based on the "new" in the original issue title and your earlier interest in helping to debug that code ? |
I guess, what I should say is, I'm not conflating the two features: I understand that they are different. I only referred to "new" because the code that is used when Some combination between the two might be it though. Strangely, I just did a build with |
To clarify a bit more, whatever the problem is that I'm having began when upgrading openblas from v0.3.3 to v0.3.5. It was between those versions that the default build settings were changed. Interestingly, in v0.3.3 the default was |
Ahah, and that is still the magic combination it seems. I rebuilt The one combination I have not tried is just |
How does a build without USE_TLS fare ? (preferably without USE_SIMPLE_THREADED_LEVEL3 as well, and please do |
Okay, just compiling with So I wonder if, in all the shuffling around of It would still help if I knew exactly what the problem was or at least had a simpler way to reproduce it... |
Ooh yes, I see, you seem to have actually lost my patch to that file from #1450 in the process, so I was much more on the nose with this in the first place than I originally knew. Relatedly, I am working on another patch to make OpenBLAS stop treating Cygwin as just another Windows (i.e. |
Oh crap. Sorry. The USE_TLS section has it in the openblas_fork_handler, but the "legacy" does not as I "restored" that from an earlier snapshot (I do wonder now what I picked there, as your #1450 must have been in develop for several months before the TLS patches landed. Time to run a few diffs...) |
Seems I used ba1f91f as the source of the "legacy" memory.c, so the current abomination lacks a few OpenMP optimizations (#1468), constructor attribute fixes for old compilers (#1501), ifdef's for building on BSD(#1504) and a minor fix for cpu miscounting on mips (#1520) in addition to your Cygwin patches. |
It happens! And indeed, nothing serious. The problem I encountered is a relatively narrow case, and I'm just glad it's something that's already fixed and not some new, deep problem. The other good news is that setting Though I would like to better understand exactly what it is that that flag does. My broad understanding is that it moves some thread-specific memory allocation-related data structures into TLS variables and out of global variables so that they can be accessed more efficiently by their relevant threads without going through locks. But I'm curious about the specifics. Nevertheless, I'm going to recommend for now, for Sage, that we enable it by default again since AFAICT it works fine on platforms that we care about. |
Yes, that was the general idea, but it was coupled with a reassessment of when and how much to allocate in different use cases (with and without OpenMP) and where locking would still be necessary. Not all these assumptions were correct immediately, and the "teething pains" were compounded by initial confusion over whether to prefer the glibc implementation of TLS or whatever the compiler offered. The current implementation is somewhat in limbo with the discussion on the unmerged PRs #1726 and #1739 (and issues referenced therein) providing some background. It is probably not too far from the truth to sum it up as "it worked wonders for some but was completely broken for others, I never quite understood the new code and oon3m0oo got a few nasty surprises from the old code calling into his new functions in unexpected ways". As the unexpected breakage affected packages ranging from Julia to Blender I sort-of threw in the towel after a crude attempt at "fixing" the latest iteration of #1739 and disabled the TLS code by default. |
As discussed at https://trac.sagemath.org/ticket/27213, enabling the
new(I understand now that it is not actually "new" at all) level 3 threading code as done in #1765 has started to cause random hangs on Cygwin, particularly in code that forks new processes, likely leaving some thread-related primitives in an invalid state.This isn't the first time I've had issues like this (#1450) but it seems there's something in the new threading code that also has bugs like this.
I'm afraid I haven't had time yet to track down the exact bug, although I intend to. But in the meantime I think it would be best to set
USE_SIMPLE_THREADED_LEVEL3=1
by default on Cygwin, as was previously the case.The text was updated successfully, but these errors were encountered: