-
Notifications
You must be signed in to change notification settings - Fork 1.6k
OpenBLAS has issues solving triangular matrices on win32 #1270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do I read that correctly - you are using a binary distribution of 0.2.19 from some third party site (which could be one of the sourceforge mirrors sites, but then again could be something else) ? Could you retest with the current "develop" snapshot which has a bunch of fixes vs. 0.2.19 (or at least build 0.2.19 yourself on your platform to see if it passes the built-in tests) ? In any case I guess a minimal example run on a well-defined system would be preferable to a big unedited dump from an appveyor vm. |
That was built 12 days ago: scipy/scipy#7616 (comment) |
Is there any way to turn on debug output? |
excuse me; the URL was not apparently correct. Retrying with 2.20 now. |
not sure if 0.2.20 will build in your environment (depending on how and where you build - a cross-compile on linux like the binaries provided on sourceforge will probably go okay. the issue is a botched commit that breaks things on non-glibc systems, waiting for xianyi to release 0.2.21) which is why I suggested taking a snapshot of "develop" (easily done by clicking on the green "Clone or download" button on the "Code" tab here). Normal build from source should run a few tests at the end automatically (though as a Linux guy I am not completely sure it does this on Windows as well. |
Thanks for the link to the scipy issue. From reading that, it seems your tests were (at least at some point in the history of that PR) already done with 0.2.20 (?) |
That's what I thought, but HEAD had 2.19. I updated to 2.20 just now.
That is my suspicion; @carlkl used a completely different build method and still had issues with OpenBLAS on win32 but not win_amd64. Of course it would be helpful to get more debug output in order to understand what is happening. |
Specifically he used cross compiling with gcc rather than building with MSVC. |
Cross compiling with gcc will currently get you the hand-optimized assembly functions (which use inline assembly and AT&T notation that MSVC does not support), building with MSVC will use generic C implementations instead (which are possibly still a bit faster than the netlib reference implementation). |
Specifically what lapack is imported by arpack? |
Lapack and BLAS both come from OpenBLAS. |
https://ci.appveyor.com/project/scipy/scipy/build/1.0.44/job/wix716oq76ft44r4 Looks like we still have issues:
It's not to difficult to reproduce; just download and install python 32 bit, install the wheel from appveyor and run the test. I can also instrument this if you want, just tell me what tooling I need to use. |
Yeah, I actually opened an issue on flang for that; if anyone has time, I would be greatly appreciative. |
Sure, and your binary numpy too, ten times. |
Sorry but I am not so familiar with debugging OpenBLAS. I am happy to provide a backtrace but I need assistance here. |
The problem that I don't know how to debug an MSVC lib that has called into a gcc lib. |
@matthew-brett Perhaps you would be better at obtaining the backtrace? |
The OpenBLAS build is with a slightly aged Mingw variant over at https://anaconda.org/carlkl/mingwpy. See: https://mingwpy.github.io. The build runs on Windows (not a cross-compile). I've just set off a build from OpenBLAS master, in case that's useful: |
@xoviat - sadly I'm not experienced with this either. |
Excuse me? You pull binary numpy linked against unknown BLAS and say that specifically OpenBLAS is at fault. Without backtrace that sadly goes nowhere. |
Numpy is not being called here. We are calling OpenBLAS directly. |
Numpy is only used for numpy.distutils. |
I agree that this impossible to solve without a backtrace. How can I do this? |
Usually you get debugger next to compiler. |
My guess is that this is going to be undebuggable: MSVC EXE (python) ==> MSVC DLL (*.pyd) ==> gcc DLL (OpenBLAS) |
Just to check we're on the same page:
Using this same rig, all running on 64-bit, we get no test failures or crashes. |
Can you break into frozen process with a debugger? Is it reproducable? |
I think what would be ideal is a fortran test case linked against OpenBLAS with no numpy/scipy involved at all. This should also eliminate any risk of a conflict between mingw and msvc runtime libraries. |
While betting on buffer sizing that sounds feasible theory. |
Thinking of #1141 here ? This definitely needs looking into. |
It seems that newer mingw-w64 versions can build Lapack without this error. This could be either an gcc bug or more likely a bug with mingw-w64 itself. Compiling OpenBLAS with a recent version of mingw-w64 and testing on this bugs seems to be a reasonable workaround right now. |
@martin-frbg it needs backtrace and/or coretype, currently it is in state of trying to induce same testcase problem for whole day. |
For comparison I compiled openblas_v0.2.20_1e9247c (develop) with mingw-build gcc-7.1 for 32bit. The dpotrf testcase #1270 (comment) now runs without error. And all 32bit scipy wheels compiled with mingwpy also runs without errors and failures. (these builds are different from that created by @xoviat and mentioned in the comment above). It seems, that mingw-w64 toolchains based on gcc-5.3 and gcc-4.9 has problems with Lapack. |
It should be all fortran77 from netlib |
The mingw-w64 based builds mentioned in scipy/scipy#7616 (comment) do not exhibit this erroneous behaviour. Is there a need to experiment with netlib Lapack as well? I may find some time this weekend to do some debugging the faulty build, lets see. |
we have almost no windows. Just to blame original build scripts or small adjustments made for building w openblas, or well windows compilers you use. |
OpenBLAS uses |
Guess we could add something like
in interface/lapack/potrf.c to work around the apparent miscompilation on affected platforms ? |
Actually all logs come from -march=native builds. Probably omitting all -O99 series would complete the build without a single problem |
I take it you are suggesting that compiling interface/lapack/potrf.c with -O0 instead of the default -O2 would solve the problem with older mingw versions ? Not sure how easy it would be to cater for that in the Makefiles (not sure if -march=native is actually what they are using or intending to use, if my understanding is correct they are preparing a distribution package of some sort). |
-march=pentium4 -msse2 -fpmath=sse ... |
Once again I am not sure I can follow your cryptic messages. My suggestion was to omit compilation of the problematic potrf interface if there is reason to believe the compiler is not up to that task. |
It compiles just fine with gcc 4.6 included with windows rtools (compilers for r packages). |
I'm not going to create a PR to OpenBLAS with a patch to something without further diagnosis and debugging. However, using mingw-w64 with gcc-7.1 seems to be a workaround to this issue right now, |
You dont even have exact compiler options for each failed case. |
@brada4 alienating developers of other projects helps noone. Even if your gcc 4.6 from rtools happens to be identical to the mingw-w64 build of the same version number this does nothing to refute the claim that at least the 4.9 and 5.3 versions miscompile potrf.c . If we really want to get to the bottom of this I guess we would need the intermediate assembler (gcc -S) output for this file from both "good" and "bad" compilers and someone who is capable of spotting the relevant differences. |
I dont have compiler on windows. I can try to cross-compile with old and new crosscompiler and run test case on 64bit windows... |
Please note that this issue appears to be specifically about mingw-w64 on Windows 32bit then. |
Yes, yes, 32bit dll+exe running on 64bit windows. |
I'm willing to dig into this. However, This takes some time as there are some topics with higher priority for me right now. |
You are not alone. As a minimum find a linux distribution with good and recent gcc that produces production quality 32bit DLL |
Have there been any new insights since then ? I tried to track any seemingly related numpy/scipy issues but as far as I could tell the discussions there seem to have moved on to inadvertent use of the 80bit extended precision Intel FPU mode causing problems in cdflib. So should we add a warning in the wiki to use at least version 7.1 of the mingw gcc for 32bit builds on Windows, or are there any indications that the problem may be bigger and in OpenBLAS itself ? |
It turns out that we were compiling with a non-standard toolchain called mingwpy. We decided to drop that toolchain in favor of vanilla mingw. That seemed to resolve the problems. |
Using the 32 bit openblas causes timeouts while running the scipy tests:
The text was updated successfully, but these errors were encountered: