Skip to content

Non-deterministic Libblastrampoline issue (?) when using SpecialFunctions, MKL #845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pablosanjose opened this issue May 11, 2021 · 26 comments
Labels
system:mac Affects only macOS

Comments

@pablosanjose
Copy link
Contributor

I've hit an issue on v1.7 (not in 1.6 or below) which I suspect has to do with the new libblastrampoline mechanism. The issue arises when using both SpecialFunctions.jl and MKL.jl, in that order. The symptom of the issue is the following A\B not giving correct or even deterministic results in the following minimal example.

julia> using SpecialFunctions

julia> using MKL

julia> A = [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5 -5 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5 -5 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5 -5 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5 -5 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5 -5 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5 -5; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 -5; -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0; -5 -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 -5 -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 -5 -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 -5 -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 -5 -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 -5 -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 -5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -5 -5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1];

julia> B = [1 0 0 0 0; 0 1 0 0 0; 0 0 1 0 0; 0 0 0 1 0; 0 0 0 0 1; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0; 0 0 0 0 0];

julia> sum(abs, A \ B)
64.17866365948649

julia> sum(abs, A \ B)
109.33456919955461

julia> sum(abs, A \ B)
83.00102106669063

The correct result is 24.772054506344052

The issue disappears if I remove either of the using lines, or if I load MKL before SpecialFunctions. Quite strange!

I haven't been able to make a more minimal example of the issue yet, so I'm not sure where the problem lies (perhaps it is not even in Base, I'm not sure).

@pablosanjose pablosanjose changed the title Non-deterministic Libblastrampoline issue (?) when using SpecialFunctions, MKL Non-deterministic Libblastrampoline issue (?) when using SpecialFunctions, MKL May 11, 2021
@ViralBShah
Copy link
Member

cc @staticfloat

@staticfloat
Copy link
Member

@pablosanjose Does this still happen for you? If so:

  • What operating system are you using?
  • Can you post a Manifest.toml with just the two packages (SpecialFunctions and MKL) added that reproduces the issue for you?
  • What gitsha of Julia are you using?

I can't reproduce this on x86_64 Linux with current master (58ffe7e3ed3a93a9d816097548e785284f57fbd4) and the following Manifest.toml/test file: https://gist.github.com/staticfloat/346f3786c1981f8714df93b7d7027d90

@ViralBShah
Copy link
Member

ViralBShah commented Jun 2, 2021

I'm getting the non-deterministic behaviour on mac, in exactly the same way as reported. It does not happen without MKL or if I load MKL before SpecialFunctions.

I have MKL 0.4.0 and SpecialFunctions 1.4.2, and julia 28e30a3953

I also get the same exact incorrect answers for the first two evaluations, but then different ones (and then they all repeat)

@pablosanjose
Copy link
Contributor Author

Hi, yes, still happening here, with latest nightly from the julialang webpage on macOS, and using your exact Manifest, @staticfloat. I also tried on Windows, and there the issue does not arise. My versioninfo:

Julia Version 1.7.0-DEV.1185
Commit d692b897be (2021-05-28 07:35 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.0 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code

@ViralBShah
Copy link
Member

I'm updating openspecfun_jll and SpecialFunctions to address JuliaPackaging/BinaryBuilderBase.jl#92. This is just a hunch and needed to be done anyways - but I haven't done any debugging to ascertain this might be the issue.

@ViralBShah
Copy link
Member

Doesn't help. Because that only comes into effect with strings in Fortran.

@staticfloat
Copy link
Member

I tracked the root cause of this down to this issue: JuliaPackaging/BinaryBuilder.jl#700 (comment)

Essentially, there is a conflict between libiomp and libgomp, which both export the same set of symbols. The reason this "worked" on Julia 1.6- is that MKL.jl required you to rebuild the system image and then loaded libiomp (as part of MKL) at startup, so there was no way for you to load libgomp through SpecialFunctions.jl first. The "solutions" that we can enact to fix this are the same as described in that comment.

@ViralBShah
Copy link
Member

Could you possibly use LBT to pick between libiomp and libgomp, or will everything freak out?

@ViralBShah
Copy link
Member

Is loading MKL first a good and reliable solution? Is that something we should document in the MKL.jl README?

@staticfloat
Copy link
Member

Could you possibly use LBT to pick between libiomp and libgomp, or will everything freak out?

This issue has nothing to do with LBT; it is purely an incompatibility between libiomp (which is a dependency of MKL) and libgomp (which is a dependency of libopenspecfun).

Is loading MKL first a good and reliable solution? Is that something we should document in the MKL.jl README?

We could.... but as said in the comment I linked to, this only works because most software isn't using libgomp very thoroughly. As soon as you have a package that does use it thoroughly, this won't work and there will be no way to fix it. The libraries are just fundamentally incompatible. There's a possibility we can link things in such a way that they don't conflict as much, but I'm not that hopeful.

@giordano
Copy link
Contributor

giordano commented Jun 2, 2021

Documenting the fundamental incompatibility may still be a good idea, since it pops up every now and then

@staticfloat
Copy link
Member

Could have MKL.jl check to see if libgomp is loaded, and if it is, throw out a warning.

@vtjnash
Copy link
Member

vtjnash commented Jun 3, 2021

Are we causing mac to use different RTLD_LOCAL flags from the default dyld behavior? It seems that 2-level linking should prevent either library from seeing the other is loaded

@giordano
Copy link
Contributor

giordano commented Jun 3, 2021

The default flags in JLL packages are RTLD_LAZY | RTLD_DEEPBIND

@pablosanjose
Copy link
Contributor Author

I don't know if it is related to this issue, but FWIW I keep seeing non-deterministic (wrong) behavior using MKL on 1.7rc1, regardless of whether SpecialFunctions is loaded. Everything works fine with OpenBLAS... I know this is not very useful, sorry, the issue is difficult to isolate.

@ViralBShah
Copy link
Member

ViralBShah commented Nov 17, 2021

MKL has a libmkl_gf_ilp64 library. That is the version of MKL that is compatible with gfortran.

@staticfloat Would this be a straightforward issue of just using libmkl_gf_ilp64 instead of libmkl_rt in libblastrampoline?

@ViralBShah
Copy link
Member

ViralBShah commented Nov 17, 2021

I only see the issue on mac, but there is no such gf version of MKL available on mac. On Linux, everything seems to work fine.

Given how grave the issue is, would it be wiser not to provide MKL on mac at all? Can someone also try this out on Windows?

@ViralBShah
Copy link
Member

ViralBShah commented Dec 27, 2021

@pabloferz If I do MKL.set_threading_layer(MKL.THREADING_GNU), I get reliable results. Can you try? I don't know if that effectively turns off threading, but it at least works correctly.

@pablosanjose
Copy link
Contributor Author

No, I get the same as before

@ViralBShah
Copy link
Member

Hmm, that fixed it for me. What about THREADING_SEQUENTIAL? You may have to change it in MKL.jl to be certain. See:

https://github.com/JuliaLinearAlgebra/MKL.jl/pull/99/files

@pablosanjose
Copy link
Contributor Author

Ah! It does seem that your PR fixes this! Care to explain why/how?

@ViralBShah
Copy link
Member

Sequential MKL fixes the issue with threading discussed above in #845

We do need to only do this for MKL on mac to the best of my knowledge.

@pablosanjose
Copy link
Contributor Author

Uhm, I see. But this has severe performance implications. With this patch we loose multithreaded matmul, for example, even if we don't use SpecialFunctions. I don't think that's an acceptable tradeoff.

@ViralBShah
Copy link
Member

ViralBShah commented Dec 27, 2021

This is not in our control, unfortunately. Maybe there are other ways to fix this. Note that this is only a mac issue, and it is important for the default behaviour to be correct.

@ViralBShah ViralBShah added the system:mac Affects only macOS label Dec 27, 2021
@ViralBShah
Copy link
Member

The solution is to always load MKL before everything else if you end up using fortran libraries. It is now documented in https://github.com/JuliaLinearAlgebra/MKL.jl#usage

I have also emailed Intel if there are better ways for libgomp and Intel OMP to co-exist.

@ctkelley
Copy link

The problem with #845 is not happening for me with 1.8.2 but is with 1.9. I'm loading everything in the same order.

@KristofferC KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:mac Affects only macOS
Projects
None yet
Development

No branches or pull requests

6 participants