-
Notifications
You must be signed in to change notification settings - Fork 553
Experiment: coordinate LLVM commits between torch-mlir and onnx-mlir #1135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Based on last night's run, here's the green LLVM commit for this week: ec5def5. However, MHLO is just a tad behind this commit, causing the Torch-MLIR and ONNX-MLIR build to break when I use that commit. Luckily, the patch to MHLO is short. @joker-eph is it possible to get this patch merged into MHLO?
|
I'm happy to offer you a branch in the mlir-hlo repo for managing this, but the HEAD of the repo is cadenced by Google internal integration: we bump it when it passes all of our tests internally. |
This is one of the things I've struggled with trying to define a cadence. If the cadence is short (say daily), then you can proceed with a fix-forward approach. Take fixes like this and push to head, then wait for tomorrow's build. With a monthly cadence, you've reached the point where it makes more sense to branch the dependent repos and patch in the branch, otherwise there's no chance of converging. The down side is more branch management. BTW, super-excited to see this experiment happen, we'll definitely be looking at aligning our work to these points. |
@joker-eph Thanks, if you could create a branch, that'd be great. I'm a bit worried that managing an extra branch might get cumbersome, but I'm happy to try it and see how often we need to make this short-term patches. |
Great to see this convergence @sstamenova ! TOSA has common code used in the TorchToTosa and ONNXToTOSA efforts and alignment here makes it much more reasonable to proceed with putting that code in the core MLIR llvm-project side so I can then pick it up from both Torch-MLIR and ONNX-MLIR. |
One other thing to consider is if we can take on an optional dependency on onnx/onnx-mlir "importer" so our CIs can run simple exports to onnx Dialect. I am not sure if that would require the deep dependency on LLVM/MHLO etc. Maybe eventually someone could wire up the onnx Dialect --> Linalg-on-tensors etc in torch-mlir too. |
@joker-eph For patches required to MHLO to work with new LLVM commits, in addition to pushing them to a branch in the MHLO repo, I'm happy send PRs (that can be tested using your internal CI) if that'd mean less work for y'all during a subsequent LLVM update. For us, the benefit would be that these patches would live outside the master branch for a shorter time frame than otherwise. Let me know if that sounds like an idea worth pursuing. |
Just in case if people aren't aware, in order to have a submodule dependency, it is sufficient to have the commit reachable by any branch in the source repo. On the IREE side, when we create a cherrypick patch, we push it to a branch on the source repo (which we maintain mirrors of) with a date and index number in the name and then just don't overwrite them. A similar thing can be done here (ie. Push each week's patches to a dated branch in the mlir-hlo repo). Further, we always start fresh on the next integrate. If the cherrypicks haven't landed in the respective upstream, then they can still be retrieved from the prior branch and reapplied. But making it manual biases away from long term divergence while still leaving a patch record that can be mined for future work. |
Thanks @stellaraccident, that's a great idea, assuming the multitude of branches in the MHLO repo isn't a problem. |
@ashay you should have an invite to joint the TensorFlow organization, which will give you write access to the mlir-hlo repository: I created a |
Thanks @joker-eph! I think @stellaraccident's suggestion of keeping branches in the mlir-hlo repo to correspond to each week's merge (as necessary) makes sense. If we have a single branch that moves forward, we run the risk of breaking previous versions of torch-mlir which will no longer correspond to a branch that works for them. |
Edit: I realized that if we were to run At the risk of sounding ignorant, can't we link the MHLO submodule in Torch-MLIR to different commit hashes in the greencommit branch? Since the commits don't go away, the old versions will continue to work, wouldn't they? |
Essentially you want to commit the mhlo greencommit roll into Torch-MLIR main (along with the corresponding greencommit LLVM) ? As long as the CI passes that should be ok (and if not we should increase the CI coverage). But that doesnt prevent an MHLO SHA disappearing with |
Thanks @powderluv, I see the reason for the branch now. The green LLVM commit for this week is 061e018. |
I agree that it makes sense to follow a pattern such as |
@joker-eph Since I don't have the permission to rename protected branches, could you rename the
|
You should be able to push a new branch instead of renaming it right? |
Actually let me adjust the permission, seems like it is blocked... |
Yes, pushing to a new branch works, but that'd change the commit hash, thus breaking commits from last week where we used the old commit hash. |
I deleted greencommit so you can push under the namespace now |
Excluded a header file from warnings. Signed-off-by: ian Bearman <[email protected]> Co-authored-by: Alexandre Eichenberger <[email protected]>
This experiment is under way and quite successful in #1178 We can close this issue for now and reopen / create a new issue when we need to course-correct. |
We have been seen some breakages down stream in SHARK (https://github.com/nod-ai/SHARK/actions/runs/3671729600/jobs/6207221342) and IREE-torch (https://github.com/iree-org/iree-torch/actions/runs/3578370372/jobs/6018456328) because IREE moves forward with LLVM rebases more aggressively than our current torch-mlir LLVM updates and that causes binary compatibility issues for torch-mlir downstream projects like. Since IREE is already in sync with the TF / MHLO green commit would it be ok to pick the IREE LLVM SHA (https://github.com/iree-org/iree-llvm-fork) and try to run our "green commit" checks on it so the tagged commit. So we could pick the IREE LLVM SHA on Sunday night and try to get a green commit on Monday. |
I don't think there would be an issue with picking the LLVM commits from the iree-llvm-fork, as long as the LLVM commit in that repo refreshes at least once daily. Right now, we have a scheduled job that runs once every day and picks But as far as I can tell, we won't have green commits until this patch is either reverted or fixed. I added a comment to that patch explaining the problem and how to reproduce it. |
Btw, since iree-llvm-fork can contain patches that aren't upstreamed, I don't think the resulting green commit can be shared broadly, can it? Are you asking about running an additional set of green commit checks besides the ones that we run currently? |
I was thinking we only need sync to the root commit iree-llvm-fork is rebased on (the local patches can be ignored). |
How different is that from the weekly commit that we identify? How often does it update? Also, I'm trying to understand the ask - is it to run tests on that commit to see if it is also green? Or is the ask to use that commit for the llvm update as it would have already run through sufficient validation (including Windows)? |
I think the requirements to bump LLVM would be more lightweight with the IREE / MHLO check. It doesn't roll every day but the requirements may not be as strict as the current green commit roll. Currently we could roll to the IREE / MHLO LLVM and iree-torch and SHARK will all work ok (we have a local fork like that for Nod.ai customers) but is held up because of the revert needed to this to get a green commit. So currently rolling torch-mlir LLVM would break windows in the ONNX flow but instead we have broken iree-torch and SHARK (and our downstream customers who use IREE). Open to any suggestions how how to move this forward so any thoughts / ideas welcome. We just can't be stuck on Dec 2nd LLVM if they don't revert that commit. |
If I understand correctly, iree-torch and SHARK are broken because they are on a newer commit of LLVM than torch-mlir. But moving to any newer commit will break Windows. So your proposal is to move torch-mlir forward to unbreak iree-torch and SHARK and break Windows instead? This seems like a poor trade-off in general - situations like this with various platforms can happen at any time in the future and we don't really want to trade one failure for another. A better solution would be to address the issue in the commit that is causing the failure in the first place. Is there a reason we can't revert it? |
I agree the tradeoff is not ideal. I think if we aggressively address the hold up the status quo should still work - and has mostly worked and maybe we have to wait a few days for the next bump. |
In this particular case, the root cause seems to be the upstream LLVM patch and I do not see an immediate need to change our policy. That patch breaks existing upstream tests in the LLVM Core Support Tier (not to mention us downstream) and should be reverted immediately. We do not need to wait for the author and should perform the revert ourselves (of course we should inform the author and provide repro instructions). It is in line with LLVM practices to "revert to green" in this case (see LLVM patch reversion policy). One thing I am trying to understand is how this patch is still not reverted. Is |
Stella discovered last night that the MSVC compiler version is different between the pre-merge checks (v14.29, which is from Visual Studio 2019) versus our internal builds (v14.33 and v14.34, which is from Visual Studio 2022). I am about to try the older compiler version at my end to see if that is indeed the issue. I had heard of previous efforts to update the Windows build bots, but folks ran into issues that I fail to recollect. I will check if that effort can be restarted now. |
Should we revert though while we fix the bot? Or we can temporarily sync to a newer commit. It's hard for me to quantify in this message the hardship this is causing downstream. If we are not able to revert this today we will have to create a temporary fork to bump LLVM to ship to our downstream users which brings its own headaches. |
Yes let's please revert this patch immediately. |
I'm in favor of reverting the patch. The alternative (creating a temporary fork) would be a lot of trouble. |
Is the compiler version the problem here? Unless MSVC v14.29 is miscompiling LLVM, I think the |
That's my hunch, but my local build is still running. I should be able to confirm soon. I realized that |
The build took a while, but the tests ran successfully when compiled with the older compiler. The revert for the original patch is running pre-merge checks and once it is reverted, either Stella or I should be able to run the green commit checks. |
Hi, guys. I am doing the recent llvm rebase. I have a problem on
I try to reproduce it on my mac, but it works. I can see the same clang version as in test jobs. The output is as follow:
Meanwhile, looking into log of other jobs, they both pass cmake configuration phase successfully. |
Following up on a couple of conversations we've had about coordinating LLVM commits across projects (in onnx-mlir and in LLVM), we had a follow up discussion in the Torch MLIR community meeting yesterday (see this comment for more details).
The short-term solution that we decided to run as an experiment is to use Torch MLIR as the "source of truth" for what commit to use for other projects with the caveat that when the commit of LLVM is chosen for Torch MLIR, it would have gone through some testing that would pretty much guarantee that the commit would satisfy any requirements that onnx-mlir has as well. Based on the discussion, it should be fairly trivial to find a version of MHLO that uses the same LLVM commit as well, though MHLO might need some fixes for onnx-mlir (to make the shared library build work, for example).
We are going to pick the commit based on some internal testing that we do at Microsoft when we pick the LLVM commit that we merge with. This includes several platforms (Ubuntu, Windows, CentOS) and build tools (clang, VS, gcc) and both static and shared library builds. We'll communicate the commit here and we'll use this commit to update Torch MLIR as we are able (though ideally after we communicate the commit someone else will do the update some of the time). Then when it is time for onnx-mlir to update to a newer LLVM commit (which could also be done by us, though ideally not always), that project (as well as any others that want to stay in sync) can choose any of the recent LLVM commits on this list and we'll have at least one point a month where Torch MLIR and onnx-mlir are synced to the same LLVM commit.
The goal is to improve the process for everyone for the short-term while also working on a longer-term solution in LLVM (for example, pre-commit checks, etc.) and this is a relatively cheap first step towards all of these projects being in sync with the same LLVM dependencies.
A second step would be to start tagging the "green" commits automatically (see https://discourse.llvm.org/t/coordinate-llvm-commits-for-different-project/63990/65) instead of "announcing" them manually, but that will require some additional work we need to evaluate.
@AlexandreEichenberger @sjarus @stellaraccident @silvasean @joker-eph @ashay
So to recap:
Let me know if I missed anything or if you have any other questions.
The text was updated successfully, but these errors were encountered: