Skip to content

USE_TLS=0 not fork safe on Cygwin #2002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
embray opened this issue Feb 6, 2019 · 13 comments
Closed

USE_TLS=0 not fork safe on Cygwin #2002

embray opened this issue Feb 6, 2019 · 13 comments
Milestone

Comments

@embray
Copy link
Contributor

embray commented Feb 6, 2019

As discussed at https://trac.sagemath.org/ticket/27213, enabling the new (I understand now that it is not actually "new" at all) level 3 threading code as done in #1765 has started to cause random hangs on Cygwin, particularly in code that forks new processes, likely leaving some thread-related primitives in an invalid state.

This isn't the first time I've had issues like this (#1450) but it seems there's something in the new threading code that also has bugs like this.

I'm afraid I haven't had time yet to track down the exact bug, although I intend to. But in the meantime I think it would be best to set USE_SIMPLE_THREADED_LEVEL3=1 by default on Cygwin, as was previously the case.

@embray embray changed the title New level 3 multi-threading code not fork safe on Cygwin Non-simple level 3 multi-threading code not fork safe on Cygwin Feb 6, 2019
@martin-frbg
Copy link
Collaborator

You seem to be conflating two options/issues here - one is the "new" TLS code from last year and the other is the much older level 3 BLAS workload splitting that dates back to GotoBLAS2 ? It is entirely possible that additional threading from the level3 code exposes bugs in the TLS code.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 6, 2019

To clarify - #1765 corrects non-standard settings in Makefile.rule that were inadvertently imported from a local installation I had used to debug a particular problem earlier. USE_SIMPLE_THREADED_LEVEL3 is a fallback option that Kazushige Goto added 10+ years ago when he came up with the more elaborate level3 threading code, it can be used when one suspects a race or other thread safety problem.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 6, 2019

Err, are you building with USE_TLS=1 or am I just jumping to conclusions based on the "new" in the original issue title and your earlier interest in helping to debug that code ?

@embray
Copy link
Contributor Author

embray commented Feb 6, 2019

I guess, what I should say is, I'm not conflating the two features: I understand that they are different. I only referred to "new" because the code that is used when USE_SIMPLE_THEADED_LEVEL3=0 is new to me, since previously that flag was always 1 by default.

Some combination between the two might be it though. Strangely, I just did a build with USE_SIMPLE_THREADED_LEVEL3=1 and it did not fix the issue as I expected it might.

@embray
Copy link
Contributor Author

embray commented Feb 6, 2019

To clarify a bit more, whatever the problem is that I'm having began when upgrading openblas from v0.3.3 to v0.3.5. It was between those versions that the default build settings were changed.

Interestingly, in v0.3.3 the default was USE_SIMPLE_THREADED_LEVEL3=1 and USE_TLS=1 so that worked.

@embray
Copy link
Contributor Author

embray commented Feb 6, 2019

Ahah, and that is still the magic combination it seems. I rebuilt driver/others/memory.c with -DUSE_SIMPLE_THREADED_LEVEL3 -DUSE_TLS and now the problem has miraculously gone away.

The one combination I have not tried is just USE_TLS=1, without USE_SIMPLE_THREADED_LEVEL3=1 so I will try that as well just to see if it narrows things down any better.

@martin-frbg
Copy link
Collaborator

How does a build without USE_TLS fare ? (preferably without USE_SIMPLE_THREADED_LEVEL3 as well, and please do make clean between builds as the makefile logic is too involved to always do the right thing after some setting is changed). The TLS code cannot be considered stable yet, and the combination of settings that slipped into 0.3.3 came from my unsuccessful attempts to find a workaround. I take it you never used a version earlier than 0.3.3 ?

@embray embray changed the title Non-simple level 3 multi-threading code not fork safe on Cygwin USE_TLS=0 not fork safe on Cygwin Feb 6, 2019
@embray
Copy link
Contributor Author

embray commented Feb 6, 2019

Okay, just compiling with USE_TLS=1 seems to "fix" the hang I'm getting and level-3 threading doesn't seem to directly have anything to do with it.

So I wonder if, in all the shuffling around of memory.c, something actually broke in the old code. All the more reason to clean things up in that file.

It would still help if I knew exactly what the problem was or at least had a simpler way to reproduce it...

@embray
Copy link
Contributor Author

embray commented Feb 6, 2019

Ooh yes, I see, you seem to have actually lost my patch to that file from #1450 in the process, so I was much more on the nose with this in the first place than I originally knew.

Relatedly, I am working on another patch to make OpenBLAS stop treating Cygwin as just another Windows (i.e. OS_WINDOWS would not be defined on Cygwin). This is a little more complicated though and it isn't ready yet. This is because we do still use Windows x64 calling conventions on Cygwin.

@martin-frbg
Copy link
Collaborator

Oh crap. Sorry. The USE_TLS section has it in the openblas_fork_handler, but the "legacy" does not as I "restored" that from an earlier snapshot (I do wonder now what I picked there, as your #1450 must have been in develop for several months before the TLS patches landed. Time to run a few diffs...)

@martin-frbg
Copy link
Collaborator

Seems I used ba1f91f as the source of the "legacy" memory.c, so the current abomination lacks a few OpenMP optimizations (#1468), constructor attribute fixes for old compilers (#1501), ifdef's for building on BSD(#1504) and a minor fix for cpu miscounting on mips (#1520) in addition to your Cygwin patches.
What a mess I made, but typically nothing serious enough to make it blow up immediately...

@embray
Copy link
Contributor Author

embray commented Feb 7, 2019

It happens! And indeed, nothing serious. The problem I encountered is a relatively narrow case, and I'm just glad it's something that's already fixed and not some new, deep problem.

The other good news is that setting USE_TLS=1 works fine, and in fact despite being an "experimental" feature we've been using it implicitly for a while anyways and haven't found any problems.

Though I would like to better understand exactly what it is that that flag does. My broad understanding is that it moves some thread-specific memory allocation-related data structures into TLS variables and out of global variables so that they can be accessed more efficiently by their relevant threads without going through locks. But I'm curious about the specifics. Nevertheless, I'm going to recommend for now, for Sage, that we enable it by default again since AFAICT it works fine on platforms that we care about.

martin-frbg added a commit that referenced this issue Feb 7, 2019
* Restore dropped patches in the non-TLS branch of memory.c

As discovered in #2002, the reintroduction of the "original" non-TLS version of memory.c as an alternate branch had inadvertently used ba1f91f rather than a8002e2 , thereby dropping the commits for #1450, #1468, #1501, #1504 and #1520.
@martin-frbg
Copy link
Collaborator

Yes, that was the general idea, but it was coupled with a reassessment of when and how much to allocate in different use cases (with and without OpenMP) and where locking would still be necessary. Not all these assumptions were correct immediately, and the "teething pains" were compounded by initial confusion over whether to prefer the glibc implementation of TLS or whatever the compiler offered. The current implementation is somewhat in limbo with the discussion on the unmerged PRs #1726 and #1739 (and issues referenced therein) providing some background. It is probably not too far from the truth to sum it up as "it worked wonders for some but was completely broken for others, I never quite understood the new code and oon3m0oo got a few nasty surprises from the old code calling into his new functions in unexpected ways". As the unexpected breakage affected packages ranging from Julia to Blender I sort-of threw in the towel after a crude attempt at "fixing" the latest iteration of #1739 and disabled the TLS code by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants