Initial support of SYCL CUTLASS for XPU backend through Inductor #2

OuadiElfarouki · 2025-04-04T16:51:36Z

Summary : This patch enables an initial execution of torch.mm through an inductor generated SYCL CUTLASS kernel for intel PVC.
Following the reference CUDA implementation, this implements the following key functionalities :

For Template Generation & rendering :
- SYCLTemplate, CUTLASSTemplate, CUTLASSGemmTemplate, CUTLASS3xGemmTemplate : Handles generating the full c++ code from the call to GeneratePVC to get the GemmOperations (exposed by cutlass_library), filtering the operations, constructing the Manifest & extracting the gemm instance from the emitter (exposed by cutlass_library) until the full wrapping of the c++ template code using runtime arguments & final kernel launch.
- cutlass_utils.py : utility file containing relevant functions used across the codegen process.
- SYCLKernel, SYCLTemplateKernel : Handles higher level kernel template and kernel calling from host side. Used within the previous Template classes.
For Autotuning :
- SYCLBenchmarkRequest : Currently added as an almost dummy, not really benchmarking since we're selecting a single generated configuration for this PoC.
For wrapping & triggering the above :
- SYCLTemplateCaller : Wrapper holding a ready to compile, execute, benchmark SYCL Template Kernel. This is the higher level construct that's added to the list of "choices" in the autotuning process for selecting the best configuration.
For scheduling/Execution :
- SYCLCPPScheduling & SYCLCombinedScheduling : Orchestrator of kernel calls across eventually nodes with different lowerings (Triton & CUTLASS SYCL for instance). Few changes have been made to this compared to original CUDA implementation.

Current state was fine-tuned to support the only type configuration exposed by cutlass on PVC so far, a.k.a bfloat16 input and fp32 accumulation, forcing some workarounds on pytorch side, namely related to D(layout/output node) & C(source/input_node[2]) dtypes.

Unsupported features &/or partially implemented ones are highlighted as comments TODO (SYCL).

…lass for gemm operator

sommerlukas

Overall implementation looks good, also based on the changes made in comparison to the CUDA version of these files.

I've added some small comments and a few questions inline.

As I've found multiple cases of trailing whitespace and wrong ordering of imports, I think these files have not run through a formatter yet. I think it's important to do that before adding them into the repository to avoid unrelated formatting changes on future PRs.

sommerlukas · 2025-04-06T09:49:37Z

torch/_inductor/codegen/xpu/cutlass_utils.py

+                assert os.path.islink(
+                    dst_link
+                ), f"{dst_link} is not a symlink. Try to remove {dst_link} manually and try again."


I noticed that formatting here deviates from the CUDA version of this file.
I don't think that's a problem in general, but made me wonder whether you ran the Pytorch formatters on these files?

Sure I'll run the proper pytorch formatter next. I did run black before with default config once on these new files which might explain the divergence.

sommerlukas · 2025-04-06T09:52:09Z

torch/_inductor/codegen/xpu/cutlass_utils.py

+
+    args = CUTLASSArgs(
+        architectures=arch,
+        instantiation_level = "0",      # TODO (SYCL) : Make it config param once enabled in cutlass_library/generator.py


What does the instantiation level express?

The instantiation level is a 4 digits number used to control the number of randomly generated configurations from cutlass_library side, it's not used by GeneratePVC yet as we do an explicit cartesian product with known configs, while it's used for SM90 etc.. More about it here : https://github.com/codeplaysoftware/cutlass-fork/blob/041d78b4d8c30722b2c2e14e858114cca273b6d7/python/cutlass_library/manifest.py#L575

sommerlukas · 2025-04-06T09:56:25Z

torch/_inductor/codegen/xpu/cutlass_utils.py

+def torch_dtype_to_cutlass_type(
+    torch_dtype: torch.dtype,
+) -> "cutlass_library.library.DataType":  # type: ignore[name-defined] # noqa: F821
+    # Import cutlass python scripts.
+    assert try_import_cutlass()
+    import cutlass_library  # type: ignore[import]
+
+    if torch_dtype == torch.bfloat16:
+        return cutlass_library.library.DataType.bf16
+    elif torch_dtype == torch.float:
+        return cutlass_library.library.DataType.f32
+    else:
+        raise NotImplementedError(f"Unsupported data type: {torch_dtype}")
+
+
+def dtype_match(
+    torch_dtype: Optional[torch.dtype],
+    cutlass_dtype: "cutlass_library.library.DataType",  # type: ignore[name-defined]  # noqa: F821
+) -> bool:
+    # Import cutlass python scripts.
+    assert try_import_cutlass()
+    import cutlass_library
+
+    if torch_dtype == torch.bfloat16:
+        return cutlass_dtype == cutlass_library.library.DataType.bf16
+    elif torch_dtype == torch.float:
+        return cutlass_dtype == cutlass_library.library.DataType.f32
+    else:
+        return False


I think it would be nice to keep the same order of cases in the if as the corresponding CUDA version of this file, We currently have less supported cases, but with the same order, comparability of files is better.

Noted will restore the order & cases, it won't harm I guess, we only need to be careful with the examples we're running.

sommerlukas · 2025-04-06T10:02:09Z

torch/_inductor/codegen/xpu/cutlass_utils.py

+        raise NotImplementedError(f"unsupported {torch_dtype=} for alignments")
+
+
+def get_max_alignment(inductor_layout: Layout) -> int:


This is currently identical to the CUDA version. Do we expect any differences here?

That's true, didn't get the chance to check it further but will discuss this with @aacostadiaz and the team to make sure.

sommerlukas · 2025-04-06T10:03:21Z

torch/_inductor/codegen/xpu/gemm_template.py

+from . import cutlass_utils
+from .sycl_kernel import SYCLTemplateKernel
+from .sycl_template import CUTLASSTemplate
+import torch


Why do we need this import here? Couldn't we use relative imports instead?

torch/_inductor/codegen/xpu/gemm_template.py

torch/_inductor/codegen/xpu/sycl_template.py

torch/_inductor/config.py

sommerlukas · 2025-04-06T10:29:52Z

torch/_inductor/utils.py

+    gemm_size = V.graph.sizevars.size_hint(m * n * k, fallback=-1)
+    if gemm_size <= 0 or gemm_size < config.sycl.cutlass_backend_min_gemm_size:
+        return False
+    from .codegen.xpu.cutlass_utils import try_import_cutlass


Should we move this import closer to the actual use of try_import_cutlass?

Co-authored-by: Lukas Sommer <[email protected]>

sommerlukas

Couple of additional minor comments, some of my previous comments are also still open.

sommerlukas · 2025-04-08T12:22:29Z

torch/_inductor/codegen/xpu/cutlass_utils.py

+        if not a_factor_of(size[contiguous_dim], alignment) or not a_factor_of(
+            offset, alignment
+        ):


Nit: Formatting is a bit weird here.

I did run the linter on it but it doesn't make change. CUDA version has the same.

sommerlukas · 2025-04-08T12:25:08Z

torch/_inductor/codegen/xpu/gemm_template.py

+    *workspace_size = gemm_op.get_workspace_size(arguments);
+    return 0;
+  }
+  // check for null pointers after workspace size, since querying workspace size doesn't require valid data pointers


Are we actually checking for null pointers here?

I went through the next block and didn't really find any "real" valid ptrs checking, but it might be somewhere down the callstack. I guess the goal here is just to make sure the ordering of if (workspace_size) block and the can_implement(arguments) is not switched by mistake eventhough it can be without causing problem when the if condition is false (which is always the case during execution as None is passed here : https://github.com/OuadiElfarouki/pytorch/blob/cc172171ea1bafe2138ea741fe30b00f12609bd8/torch/_inductor/codegen/xpu/sycl_kernel.py#L305

sommerlukas · 2025-04-08T12:26:09Z

torch/_inductor/codegen/xpu/gemm_template.py

+    CUTLASS_CHECK(status);
+  }
+  {
+    auto status = gemm_op.run();


I think the second point is still open: Is there downsides here of using run instead of operator()?

sommerlukas · 2025-04-08T12:28:36Z

torch/_inductor/codegen/xpu/gemm_template.py

+
+        if op.gemm_kind not in self._get_supported_ops():
+            return None
+


This is still open.

sommerlukas · 2025-04-08T12:28:44Z

torch/_inductor/codegen/xpu/gemm_template.py

+            and self.layout_match(W.get_layout(), op.B.layout)
+        ):
+            return None
+


This is still open.

torch/_inductor/codegen/xpu/gemm_template.py

sommerlukas · 2025-04-08T12:30:08Z

torch/_inductor/codegen/xpu/gemm_template.py

+        """
+        sizes = self.output_node.get_size()
+        if len(sizes) > 2:
+            return "cutlass::gemm::GemmUniversalMode::kBatched"


Does batched GEMM currently work with our CUTLASS and CollectiveBuilder?

I don't think we have it enabled yet, @aacostadiaz would confirm.

FMarno · 2025-04-07T10:25:12Z

torch/_inductor/codegen/xpu/cutlass_utils.py

+    return False
+
+
+@functools.lru_cache(8)


Does this need a cache?

FMarno · 2025-04-08T13:04:15Z

torch/_inductor/utils.py

+        and _use_autotune_backend("CUTLASS")
+    )
+
+    from .codegen.xpu.cutlass_utils import try_import_cutlass


should this be after if res to avoid running it when not necessary?

Agree just wanted to respect the usual way modules are imported in pytorch & that seems to be at the top of functions/classes scope rather than within conditional blocks even when unused for sure.

FMarno · 2025-04-08T13:10:49Z

torch/_inductor/config.py

+    cutlass_max_profiling_configs: Optional[int] = None
+
+    # The L2 swizzle values to consider when profiling CUTLASS configs in max_autotune.
+    cutlass_max_profiling_swizzle_options: list[int] = [1] # TODO(SYCL): Currently set to 1 value until benchmarking is supported


I believe we discussed that this is unlikely to make a difference since we are not using SLM, maybe you should add a comment about that?

Yes will do.

FMarno · 2025-04-08T13:19:56Z

torch/_inductor/codegen/xpu/gemm_template.py

+    CUTLASS_CHECK(status);
+  }
+  {
+    auto status = gemm_op.run();


operator() just dispatches to run : https://github.com/codeplaysoftware/cutlass-fork/blob/59cc478e9a2910901ecb360f9d044ac568cad5b2/include/cutlass/gemm/device/gemm_universal_adapter.h#L645

FMarno · 2025-04-08T13:21:04Z

torch/_inductor/codegen/xpu/gemm_template.py

+#endif
+#endif
+  {
+    auto status = gemm_op.initialize(arguments, workspace);


You can still add the queue now, it just might not be used yet. The interface takes a sycl::queue* cast to void*. The PR to fix that is just waiting for review.

FMarno · 2025-04-08T14:06:33Z

torch/_inductor/codegen/xpu/gemm_template.py

+        import cutlass_library.library as cutlass_lib  # noqa: F401
+
+        assert isinstance(op, cutlass_gemm_op.GemmOperation), (
+            "op argument is required and has to be an instance of GemmOperation"


why is it defaulted to None then?

Yes quite confusing but I believe it has to do with the inheritance from the SYCLTemplate class which doesn't have an op argument in its render method, so kind of keeping the inherited & overrided interface (# type: ignore[override]) consistent in declaration but forcing the assertion at execution time (hence the assert on op type).

FMarno · 2025-04-08T14:18:14Z

torch/_inductor/codegen/xpu/gemm_template.py

+        if len(A_size) < 2:
+            A_size.insert(0, 1)
+        if len(B_size) < 2:
+            A_size.insert(1, 1)


is this supposed to be A_size again?

FMarno · 2025-04-08T14:46:08Z

torch/_inductor/autotune_process.py

+        if self._workspace_size_updated:
+            return
+        # Harcoded temporarily for testing with known kernels
+        self.workspace_size = 4096  # Fixed size for PoC


The pvc gemm implementation don't use SLM, so this could be 0

FMarno · 2025-04-08T14:56:19Z

torch/_inductor/codegen/xpu/cutlass_utils.py

+        return None
+
+    if all(dtype == torch.bfloat16 for dtype in input_torch_dtypes):
+        return torch.float


add a TODO I think

FMarno · 2025-04-08T14:57:37Z

torch/_inductor/codegen/xpu/cutlass_utils.py

+    """
+    # TODO (SYCL): Extend for other types & double-check alignments
+    if torch_dtype == torch.bfloat16:
+        return [8, 4, 2, 1]


is this in bytes? I would guess bf16 needs to be atleast 2 and float needs to be atleast 4.

sommerlukas

Approving this PR. There are some open questions, please address them or at least reply to them in the comments for future reference when merging the PR.

FMarno

Thanks for all your work, Ouie

OuadiElfarouki added 4 commits April 1, 2025 08:36

Initial implementation of XPU inductor codegen logic through SYCL Cut…

b93af8e

…lass for gemm operator

Added xpu guard on cuda cutlass selection for mm

5bdeddd

Fixed some bugs

7a447d7

Output type workarounds for an initial bfloat16 -> float32 working PoC

a033420

OuadiElfarouki requested a review from sommerlukas April 4, 2025 16:51

sommerlukas reviewed Apr 6, 2025

View reviewed changes

OuadiElfarouki and others added 3 commits April 6, 2025 14:36

Apply suggestions from code review - Formatting

ceb32d2

Co-authored-by: Lukas Sommer <[email protected]>

Fixed .py files formatting & imports ordering

e38ce36

Addressed PR reviews

cc17217

sommerlukas reviewed Apr 8, 2025

View reviewed changes

FMarno reviewed Apr 8, 2025

View reviewed changes

OuadiElfarouki added 2 commits April 10, 2025 14:05

Addressed review comments : disabled swizzle / RT cutlass parameters

995a2be

Addressed review comments : Template, ops filters & workspace size

021deb0

sommerlukas approved these changes Apr 11, 2025

View reviewed changes

FMarno approved these changes Apr 11, 2025

View reviewed changes

sommerlukas merged commit a47e05c into codeplaysoftware:sycl-develop Apr 17, 2025
47 checks passed

		raise NotImplementedError(f"unsupported {torch_dtype=} for alignments")


		def get_max_alignment(inductor_layout: Layout) -> int:


		if op.gemm_kind not in self._get_supported_ops():
		return None

Initial support of SYCL CUTLASS for XPU backend through Inductor #2

Initial support of SYCL CUTLASS for XPU backend through Inductor #2

Conversation

OuadiElfarouki commented Apr 4, 2025 • edited Loading

sommerlukas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sommerlukas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sommerlukas left a comment

Choose a reason for hiding this comment

FMarno left a comment

Choose a reason for hiding this comment

OuadiElfarouki commented Apr 4, 2025 •

edited

Loading