Skip to content

Use of long double in OpenBLAS #1711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brianborchers opened this issue Aug 4, 2018 · 14 comments
Closed

Use of long double in OpenBLAS #1711

brianborchers opened this issue Aug 4, 2018 · 14 comments

Comments

@brianborchers
Copy link

Pull Request #1709 surprised me by showing that long double variables are used in some of the OpenBLAS code.

Although gcc on x86_64 uses 80 bit Intel extended precision for these, the performance is slow in comparison to AVX2 or AVX512 SIMD. Other x86_64 compilers might just use IEEE double precision for long double. On many other architectures there's no hardware support for an extended precision long double (and the compiler often turns long double into double precision.) It's also possible that some compiler might implement a quadruple precision long double in software, but that would be far slower than the x86_64 80 bit extended precision.

Thus there's no real way to count on this helping with precision, and furthermore, if you do get some kind of extended precision using long double it might have poor performance. A final concern is that users of BLAS/LAPACK routines don't expect extended precision and might be surprised by results which are too accurate.

Which of the BLAS/LAPACK functions in OpenBLAS make use of long double? How is it handled by the different architectures and compilers?

@martin-frbg
Copy link
Collaborator

From a quick grep, there appears to be an x86-specific reimplementation of sqrt() in common_x86.h that uses long double, and rotg.c/zrotg.c that were highlighted in #1709 (where the use of "long double" is confined to x86/x86_64 platforms). Possibly the motivation for using long doubles there may have been the extended data range rather than the extended precision, but as I wrote in the PR this code is unchanged from K.Goto's work.
(There are provisions for compiling OpenBLAS with EXPRECISION=1, which apparently turns on support for 128bit "long doubles", but this option appears to be off by default and again dates back to GotoBLAS2.)

@carlkl
Copy link

carlkl commented Aug 5, 2018

long double == double for the Microsoft MSVC compilers for x86/x86_64 platforms.
long double = extended precision for the GCC compilers for x86/x86_64 platforms (mingw-w64, cygwin) unless -mlong-double-64 is given.

Is there really a need to use extended precision in rotg.c/zrotg.c?

@martin-frbg
Copy link
Collaborator

Apparently Goto saw a need, though he may have been working around a compiler bug for all we know. The long double appears to be used only to calculate a scale factor from the sum of the absolute values of two doubles, but the code actually contains #ifdef'd sections for x86 (with long double) and "everything else".

@brianborchers
Copy link
Author

brianborchers commented Aug 5, 2018 via email

@carlkl
Copy link

carlkl commented Aug 5, 2018

I don't understand the (long)((double) usage in blas_set_parameter (driver/others/parameter.c) either:

  factor=openblas_block_factor();
  if (factor>0) {
    if (factor <  10) factor =  10;
    if (factor > 200) factor = 200;

    sgemm_p = ((long)((double)sgemm_p * (double)factor * 1.e-2)) & ~7L;
    dgemm_p = ((long)((double)dgemm_p * (double)factor * 1.e-2)) & ~7L;
    cgemm_p = ((long)((double)cgemm_p * (double)factor * 1.e-2)) & ~7L;
    zgemm_p = ((long)((double)zgemm_p * (double)factor * 1.e-2)) & ~7L;
#ifdef EXPRECISION
    qgemm_p = ((long)((double)qgemm_p * (double)factor * 1.e-2)) & ~7L;
    xgemm_p = ((long)((double)xgemm_p * (double)factor * 1.e-2)) & ~7L;
#endif

I guess all OpenBLAS should get rid of long double usage.

@martin-frbg
Copy link
Collaborator

@carlkl the usage in blas_set_parameter is something completely different I think - it does not involve "long double" at all, but scaling of an integer value by real numbers, with the result cast to long again.
(And I suspect neither the referece BLAS nor a blanket statement is going to be particularly helpful in
trying to interpret legacy code that does not appear to be demonstrably wrong

@carlkl
Copy link

carlkl commented Feb 7, 2019

@martin-frbg , another issue that could be closed now?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 7, 2019

I had been meaning to revisit this - the use of "long double" does appear to be confined to rot/rotg but it is still not clear to me if it was used to fix an actual problem on any (pre-Haswell) x86 hardware or just experimentally as a (then) "new toy" (x87 extended precision). I am still inclined to treat this as a harmless oddity inherited from the GotoBLAS2 codebase

@brianborchers
Copy link
Author

brianborchers commented Feb 7, 2019 via email

@carlkl
Copy link

carlkl commented Feb 7, 2019

ARM doesn't use 80 bit precision on long double either. I guess (but I cannot prove it) that 80 bit precision is meaningless for rot/rotg.

@carlkl
Copy link

carlkl commented Feb 7, 2019

there is i.e. the following code snippet in openblas_config_template.h:

#ifdef OPENBLAS_QUAD_PRECISION
typedef struct {
  unsigned long x[2];
}  xdouble;
#elif defined OPENBLAS_EXPRECISION
#define xdouble long double
#else
#define xdouble double
#endif

This place is the only occurence of these macros, so xdouble is always defined as double. My impression is, that OPENBLAS_EXPRECISION and OPENBLAS_QUAD_PRECISION is an artefact of some unfinished work from GOTOBLAS.

@martin-frbg
Copy link
Collaborator

@carlkl yes, I think we already arrived at that conclusion in #1769. But neither this snippet nor any ARM features appear to have much bearing on the specific few lines of x86-only code used in rot(g)

@carlkl
Copy link

carlkl commented Feb 19, 2019

@martin-frbg, there is another place with use of extended precision I don't fully understand:

common_x86.h:

static __inline long double sqrt_long(long double val) {
#if defined(_MSC_VER) && !defined(__clang__)
  return sqrt(val); // not sure if this will use fsqrt
#else
  long double result;

  __asm__ __volatile__ ("fldt %1\n"
		    "fsqrt\n"
		    "fstpt %0\n" : "=m" (result) : "m"(val));
  return result;
#endif
}
#define SQRT(a)  sqrt_long(a)

x86 architecture (32 bit): for Cygwin and for mingw-w64 SQRT is defined as sqrt_long with extended precision. Following LAPACK files: potf2_L.c, potf2_U.c, zpotf2_L.c, zpotf2_U.c, make use of SQRT.

This code mentioned here could be a potential explanation for the issue: #1270 :
The OpenBLAS used for scipy test build at this time was compiled with a mingwpy version, that itself was build by me with: long double = double defined (using -mlong-double-64). Using assembler code for extended loads or stores is obviously not compatible with such a configuration.

Maybe all this long double code in OpenBLAS should stay as it is right now. However potential implications should be kept in mind.

So what about simply close this issue?

@martin-frbg
Copy link
Collaborator

I had already mentioned that x86 sqrt in #1711 (comment) above - this is another GotoBLAS2 odditiy without documented explanation. (The comment about something being due to a gcc bug a few lines down refers to the cpuid prototype - it is already in GotoBLAS-1.00 while the sqrt_long only appeared in GotoBLAS2 as far as I can tell from my incomplete collection of old GotoBLAS versions).
It is entirely possible that this was just an idle experiment by K.Goto exploring x87 extended precision - the ifdef MSVC shortcut was added only a few years ago by hpanderson.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants