Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP][OMPT] initial_device_num on initialize callback set incorrectly #134451

Open
Thyre opened this issue Apr 4, 2025 · 2 comments
Open

[OpenMP][OMPT] initial_device_num on initialize callback set incorrectly #134451

Thyre opened this issue Apr 4, 2025 · 2 comments
Labels

Comments

@Thyre
Copy link

Thyre commented Apr 4, 2025

We've recently released Score-P v9.0, having support for the OpenMP target via the OpenMP Tools Interface for the first time.

However, a user quickly noticed an issue, which boils down to a bug in LLVM.
When the OpenMP runtime initializes a connected tool, the following callback will be dispatched:

typedef int (*ompt_initialize_t) (ompt_function_lookup_t lookup, int initial_device_num, ompt_data_t *tool_data);

initial_device_num is defined as "the value that a call to omp_get_initial_device would return". This itself is defined as being "the value of the device number is the value of omp_initial_device or the value returned by the omp_get_num_devices routine".
Looking at the actual returned values however, this is not the case. Instead, LLVM always reports initial_device_num = 0. Since LLVM also does not implement ompt_get_num_devices correctly

return 1; // only one device (the current device) is available

a tool cannot safely determine if a device identifier in some target callback is actually the host.

LLVM typically uses -1 for this, but tools should not have to rely on guessing this, as it might or might not change in later versions. This also makes supporting different runtimes more complicated.


To reproduce the issue:

#include <omp.h>

#include <omp-tools.h>
#include <stdlib.h>
#include <assert.h>
#include <inttypes.h>
#include <string.h>

#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdatomic.h>

/* MAIN */

int main( int argc, char** argv )
{
#pragma omp target
    {
        printf( "Hello from target region\n" );
    }
}

/* OMPT INTERFACE */

#if ( defined( __ppc64__ ) || defined( __powerpc64__ ) || defined( __PPC64__ ) )
#define OMPT_TOOL_CPU_RELAX ( ( void )0 )
#elif ( defined( __x86_64 ) || defined( __x86_64__ ) || defined( __amd64 ) || defined( _M_X64 ) )
#define OMPT_TOOL_CPU_RELAX __asm__ volatile ( "pause" )
#elif ( defined( __aarch64__ ) || defined( __ARM64__ ) || defined( _M_ARM64 ) )
#define OMPT_TOOL_CPU_RELAX __asm__ volatile ( "yield" )
#else
#define OMPT_TOOL_CPU_RELAX ( ( void )0 )
#endif


/* Test-and-test-and-set lock. Mutexes are of type bool */
#define OMPT_TOOL_LOCK( MUTEX ) \
    while ( true ) \
    { \
        if ( atomic_flag_test_and_set_explicit( &( MUTEX ), memory_order_acquire ) != true ) \
        { \
            break; \
        } \
        OMPT_TOOL_CPU_RELAX; \
    }
#define OMPT_TOOL_UNLOCK( MUTEX ) atomic_flag_clear_explicit( &( MUTEX ), memory_order_release );

#define OMPT_TOOL_GUARDED_PRINTF( ... ) \
    OMPT_TOOL_LOCK( ompt_tool_printf_mutex ) \
    printf( __VA_ARGS__ ); \
    OMPT_TOOL_UNLOCK( ompt_tool_printf_mutex )


_Thread_local int32_t ompt_tool_tid          = -1; /* thread counter. >= 1 after thread_begin  */
atomic_flag           ompt_tool_printf_mutex = ATOMIC_FLAG_INIT;


static int
my_initialize_tool( ompt_function_lookup_t lookup,
                    int                    initial_device_num,
                    ompt_data_t*           tool_data )
{
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 " | initial_device_num %d\n",
                              __FUNCTION__,
                              ompt_tool_tid,
                              initial_device_num );
    return 1; /* non-zero indicates success */
}

static void
my_finalize_tool( ompt_data_t* tool_data )
{
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 "\n",
                              __FUNCTION__,
                              ompt_tool_tid );
}

ompt_start_tool_result_t*
ompt_start_tool( unsigned int omp_version,
                 const char*  runtime_version )
{
    setbuf( stdout, NULL );
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 " | omp_version %d | runtime_version = \'%s\'\n",
                              __FUNCTION__,
                              ompt_tool_tid,
                              omp_version,
                              runtime_version );
    static ompt_start_tool_result_t tool = { &my_initialize_tool,
                                             &my_finalize_tool,
                                             ompt_data_none };
    return &tool;
}

Building the reproducer on a system with at least one GPU (maybe -fopenmp-targets=x86_64 works too), one will see the following result:

$ clang -fopenmp --offload-arch=sm_75 test.c
$ ./a.out
[ompt_start_tool] tid = -1 | omp_version 201611 | runtime_version = 'LLVM OMP version: 5.0.20140926'
[my_initialize_tool] tid = -1 | initial_device_num 0
Hello from target region
[my_finalize_tool] tid = -1

As one can see initial_device_num is 0. In my case, it should be either -1 or 1.

@llvmbot
Copy link
Member

llvmbot commented Apr 4, 2025

@llvm/issue-subscribers-openmp

Author: Jan André Reuter (Thyre)

We've recently released Score-P v9.0, having support for the OpenMP target via the OpenMP Tools Interface for the first time.

However, a user quickly noticed an issue, which boils down to a bug in LLVM.
When the OpenMP runtime initializes a connected tool, the following callback will be dispatched:

typedef int (*ompt_initialize_t) (ompt_function_lookup_t lookup, int initial_device_num, ompt_data_t *tool_data);

initial_device_num is defined as "the value that a call to omp_get_initial_device would return". This itself is defined as being "the value of the device number is the value of omp_initial_device or the value returned by the omp_get_num_devices routine".
Looking at the actual returned values however, this is not the case. Instead, LLVM always reports initial_device_num = 0. Since LLVM also does not implement ompt_get_num_devices correctly

return 1; // only one device (the current device) is available

a tool cannot safely determine if a device identifier in some target callback is actually the host.

LLVM typically uses -1 for this, but tools should not have to rely on guessing this, as it might or might not change in later versions. This also makes supporting different runtimes more complicated.


To reproduce the issue:

#include &lt;omp.h&gt;

#include &lt;omp-tools.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;assert.h&gt;
#include &lt;inttypes.h&gt;
#include &lt;string.h&gt;

#include &lt;stdbool.h&gt;
#include &lt;stdint.h&gt;
#include &lt;stdio.h&gt;
#include &lt;stdatomic.h&gt;

/* MAIN */

int main( int argc, char** argv )
{
#pragma omp target
    {
        printf( "Hello from target region\n" );
    }
}

/* OMPT INTERFACE */

#if ( defined( __ppc64__ ) || defined( __powerpc64__ ) || defined( __PPC64__ ) )
#define OMPT_TOOL_CPU_RELAX ( ( void )0 )
#elif ( defined( __x86_64 ) || defined( __x86_64__ ) || defined( __amd64 ) || defined( _M_X64 ) )
#define OMPT_TOOL_CPU_RELAX __asm__ volatile ( "pause" )
#elif ( defined( __aarch64__ ) || defined( __ARM64__ ) || defined( _M_ARM64 ) )
#define OMPT_TOOL_CPU_RELAX __asm__ volatile ( "yield" )
#else
#define OMPT_TOOL_CPU_RELAX ( ( void )0 )
#endif


/* Test-and-test-and-set lock. Mutexes are of type bool */
#define OMPT_TOOL_LOCK( MUTEX ) \
    while ( true ) \
    { \
        if ( atomic_flag_test_and_set_explicit( &amp;( MUTEX ), memory_order_acquire ) != true ) \
        { \
            break; \
        } \
        OMPT_TOOL_CPU_RELAX; \
    }
#define OMPT_TOOL_UNLOCK( MUTEX ) atomic_flag_clear_explicit( &amp;( MUTEX ), memory_order_release );

#define OMPT_TOOL_GUARDED_PRINTF( ... ) \
    OMPT_TOOL_LOCK( ompt_tool_printf_mutex ) \
    printf( __VA_ARGS__ ); \
    OMPT_TOOL_UNLOCK( ompt_tool_printf_mutex )


_Thread_local int32_t ompt_tool_tid          = -1; /* thread counter. &gt;= 1 after thread_begin  */
atomic_flag           ompt_tool_printf_mutex = ATOMIC_FLAG_INIT;


static int
my_initialize_tool( ompt_function_lookup_t lookup,
                    int                    initial_device_num,
                    ompt_data_t*           tool_data )
{
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 " | initial_device_num %d\n",
                              __FUNCTION__,
                              ompt_tool_tid,
                              initial_device_num );
    return 1; /* non-zero indicates success */
}

static void
my_finalize_tool( ompt_data_t* tool_data )
{
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 "\n",
                              __FUNCTION__,
                              ompt_tool_tid );
}

ompt_start_tool_result_t*
ompt_start_tool( unsigned int omp_version,
                 const char*  runtime_version )
{
    setbuf( stdout, NULL );
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 " | omp_version %d | runtime_version = \'%s\'\n",
                              __FUNCTION__,
                              ompt_tool_tid,
                              omp_version,
                              runtime_version );
    static ompt_start_tool_result_t tool = { &amp;my_initialize_tool,
                                             &amp;my_finalize_tool,
                                             ompt_data_none };
    return &amp;tool;
}

Building the reproducer on a system with at least one GPU (maybe -fopenmp-targets=x86_64 works too), one will see the following result:

$ clang -fopenmp --offload-arch=sm_75 test.c
$ ./a.out
[ompt_start_tool] tid = -1 | omp_version 201611 | runtime_version = 'LLVM OMP version: 5.0.20140926'
[my_initialize_tool] tid = -1 | initial_device_num 0
Hello from target region
[my_finalize_tool] tid = -1

As one can see initial_device_num is 0. In my case, it should be either -1 or 1.

@Thyre
Copy link
Author

Thyre commented Apr 4, 2025

Looking at the source code a bit, there are two places where OMPT is initialized:

ompt_fn_lookup, omp_get_initial_device(),

/* initial_device_num */ 0, /* tool_data */ nullptr);

The first case looks correct. However, I think we never see it in the case where offload is used, right?
The second case should never use 0, but also omp_get_initial_device. There are two possible values which can then be returned:

int omp_get_initial_device(void) { return -1; }

EXTERN int omp_get_initial_device(void) {

Here, I don't understand when which one is chosen...
Callbacks later on also use omp_get_initial_device, so we should stick with the same value everywhere. For sake of consistency with the non-tool side, having omp_get_num_devices would be nice. From a tool perspective however, I wouldn't really care.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants