Skip to content

Joint_inclusive_scan returns the wrong result #4336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Michoumichmich opened this issue Aug 13, 2021 · 2 comments · Fixed by #4337
Closed

Joint_inclusive_scan returns the wrong result #4336

Michoumichmich opened this issue Aug 13, 2021 · 2 comments · Fixed by #4337
Labels
bug Something isn't working cuda CUDA back-end

Comments

@Michoumichmich
Copy link
Contributor

Michoumichmich commented Aug 13, 2021

Describe the bug
Calling joint_inclusive_scan with sycl::multiplies returns the wrong result.

To Reproduce

Small code that performs a joint_inclusive_scan on a vector of 32 ones.

#include <sycl/sycl.hpp>

int main() {
    sycl::queue q{sycl::gpu_selector{}};

    constexpr size_t size = 32;
    using T = uint;
    using op = sycl::multiplies<>;

    std::vector<T> v(size, T(1));
    std::vector<T> out(size, T(1));
    {
        sycl::buffer<T, 1> v_buffer(v.data(), sycl::range<1>(v.size()));
        sycl::buffer<T, 1> out_buffer(out.data(), sycl::range<1>(out.size()));
        q.submit([&](sycl::handler &cgh) {
            auto in = v_buffer.get_access<sycl::access::mode::read>(cgh);
            auto out = out_buffer.get_access<sycl::access::mode::write>(cgh);
            cgh.parallel_for(sycl::nd_range<1>(32, 32), [=](sycl::nd_item<1> it) {
                auto begin = in.get_pointer();
                sycl::joint_inclusive_scan(it.get_group(), begin, begin + in.size(), out.get_pointer(), op());
            });
        });
    }

    for (const T &e: out) {
        std::cout << e << " ";
    }
    std::cout << std::endl;
}

Results with the CUDA back-end

We should expect a vector of 1, but we get:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 

(and why do we have a one at the end? I suspect a flip somewhere)

If we change the binary operator to sycl::plus<>, the code works as expected and prints:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 

Results on openCL

For sycl::plus<>, the result is right, but for multiplies if won't compile and fails with:

Unsupported SPIR-V module
SPIRV module requires unsupported capability 63
Compilation failed
 -11 (CL_BUILD_PROGRAM_FAILURE)

If someone could try the reproducer on openCL it could be nice!

Environment (please complete the following information):

  • OS: rhel8.4
  • Target device and vendor: NVIDIA GPU
@Michoumichmich Michoumichmich added the bug Something isn't working label Aug 13, 2021
@Michoumichmich
Copy link
Contributor Author

Michoumichmich commented Aug 13, 2021

Okay, it could be helpful, if I set the initial value to 3, I get, on the cuda back-end:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 

How did it got here? 🤔

@Michoumichmich
Copy link
Contributor Author

Solved on CUDA!

@bader bader added the cuda CUDA back-end label Aug 15, 2021
bader pushed a commit that referenced this issue Aug 16, 2021
This resolves #4336 issue which is a bug related to 0 being used as the identity in the CUDA back-end.

Signed-off-by: Michel Migdal [email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda CUDA back-end
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants