feat: refactoring segmentation in partitioning #1067

bowang007 · 2022-05-14T07:13:42Z

Signed-off-by: Bo Wang [email protected]

Description

Fixes #1031
We are doing a mess in resolveNonTensorInput and resolveTensorListInput, there are many different issues related to these 2 functions. Moreover, keep fixing these issues are introducing more and more complexities in our code base. We hope to solve this related issues in a gentle way and could simplify our code base as well.

Type of change

Please delete options that are not relevant and/or add your own.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes

Signed-off-by: Bo Wang <[email protected]>

facebook-github-bot · 2022-05-14T07:13:46Z

Hi @bowang007!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot · 2022-05-17T00:04:23Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot · 2022-05-17T00:53:06Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Signed-off-by: Bo Wang <[email protected]>

mfeliz-cruise · 2022-06-02T22:44:34Z

core/partitioning/partitioning.cpp

@@ -75,7 +75,8 @@ void find_all_fallback_nodes(std::unordered_map<torch::jit::Node*, int>& fallbac
    q.pop();
    // for every node that produces this fallback node's NonTensor input, they should fallback too
    for (auto input : cur_node->inputs()) {
-      if (!isTensor(input) && fallback_nodes.insert({input->node(), 4}).second) {
+      if (!isTensor(input) && input->node()->kind() != torch::jit::prim::Constant &&


I think this input->node() traversal misses cases where an op modifies an input.

Ex.
%0 : ListConstruct()
%1 : aten::append(%0, %val)
%2 : aten::append(%1, %val)
%3 : aten::cat(%0)

Looking at just the input->node() of the cat skips over the appends which could result in a change in model behavior.

Hi @mfeliz-cruise did you hit this kind of issue in your model?
I built a model locally with this graph:

graph(%x.1 : Tensor, %y.1 : Tensor): %2 : int = prim::Constant[value=0]() %mod_list.1 : Tensor[] = prim::ListConstruct(%x.1) %z.1 : Tensor = aten::__getitem__(%mod_list.1, %2) %5 : Tensor[] = aten::append(%mod_list.1, %x.1) %6 : Tensor[] = aten::append(%mod_list.1, %y.1) %7 : Tensor[] = aten::append(%mod_list.1, %z.1) %8 : Tensor = aten::cat(%mod_list.1, %2) return (%8)

with forced fallback on aten::cat, the model runs fine. It's true that the input->node() of aten::cat could skip over the aten::append, however, since aten::cat dependency node is prim::ListConstruct, and prim::ListConstruct's output is used later by aten::append, as a result, aten::append will also fallback.

I think there are several cases and these cases could all be covered:

%0: nodeA
%1: nodeB_modifyinput(%0)
%2: nodeC(%1)

1: nodeC fallbacks, then there are 2 senarios:
1.1: %1 is Tensor, in this case, nodeA should not fallback, nodeB which modifies input should also not fallback. We cannot get them fallback by traversing.
1.2: %1 is not Tensor, which could be a TensorList like what happens here. So, in this case, nodeA will fallback. Since nodeA fallback, nodeB which modifies %0 will also fallback finally because nodeB is in nodeA's output list.

2: nodeB fallbacks, 2 cases too:
2.1 if the input that nodeB is modifying is a Tensor, then nodeA and nodeC will not fallback.
2.2 if the input that nodeB is modifying is not a Tensor, then nodeA will fallback since nodeA is nodeB's input, since nodeA fallbacks, nodeC will also fallback since it's in nodeA's output. I also did a local test with forced fallback on aten::append, which also works fine.

Did I miss any case here in the analysis?

I think you're right that these cases should be handled by the forward fallback propagation from LIstConstruct. The cases where I've seen issues are caused by GetDependencyNodes where the ListConstruct node is identified as a dependency and copied into the segment of the cat without the appends, changing the behavior of the model.

Hi @mfeliz-cruise thanks for your feedback.
I updated the getDependencyNodes and now it should find all the modifying nodes and then take them as dependency nodes as well.
Please notify that since modifying nodes are not exhaustive so maybe we need to keep updating this list https://github.com/bowang007/TRTorch/blob/1df7cbb4f703d399a97901779fbc59034a8c8932/core/partitioning/partitioning.cpp#L43 as well.
Please try to check if it works. Thanks!

Thanks Bo, could we use the AliasInfo embedded in the node->schema to find these ops on the fly rather than looking it up from a static list? That seems like it could be more robust.

I'm looking at what's done in AliasDB here to get AliasInfo for an input: https://github.com/pytorch/pytorch/blob/d26c575ff581e4df0e9c72b339a25999c6cae59e/torch/csrc/jit/ir/alias_analysis.cpp#L816

And then check if it's a write:
https://github.com/pytorch/pytorch/blob/d26c575ff581e4df0e9c72b339a25999c6cae59e/torch/csrc/jit/ir/alias_analysis.cpp#L847

https://caffe2.ai/doxygen-c/html/classc10_1_1_alias_info.html

This would also let you check which inputs are modified by an op rather than assuming if a value is used by a modifying op it will be modified.

Thanks @mfeliz-cruise ! This is really helpful!
The method your proposed is much more gentle than the current solution.
Thanks for sharing this AliasInfo APIs, I'm going to take a look and try to use them to find the modifying nodes.

mfeliz-cruise · 2022-06-02T22:47:03Z

core/partitioning/partitioning.cpp

-      if (segmented_blocks[i].contain_raw_value(use.first)) {
-        use.second.produce_id = i;
+      if (!inputs_to_resolve.empty()) {
+        std::vector<torch::jit::Node*> dependency_nodes = getDependencyNodes(inputs_to_resolve);


getDependencyNodes will miss any modifying dependency ops as it is currently written.

Ex.
%0 : ListConstruct()
%1 : aten::append(%0, %val)
%2 : aten::append(%1, %val)
%3 : aten::cat(%0)

getDependencyNodes for the aten::cat will only return the ListConstruct.

thanks @mfeliz-cruise !
Let me check if there is any APIs that we can use in Torchscript to find all the modifying ops as well.

Signed-off-by: Bo Wang <[email protected]>

…r value that's produced in outer block Signed-off-by: Bo Wang <[email protected]>

Signed-off-by: Bo Wang <[email protected]>

peri044 · 2022-06-22T14:07:42Z

Closing in favor of #1140

feat: refactoring segmentation in partitioning

da281ad

Signed-off-by: Bo Wang <[email protected]>

facebook-github-bot added the cla signed label May 17, 2022

feat: cover more cases after refactoring segmentation

4f974fd

Signed-off-by: Bo Wang <[email protected]>

narendasan added component: partitioning and removed component: partitioning labels May 20, 2022

github-actions bot requested a review from peri044 May 23, 2022 16:59

bowang007 requested a review from narendasan June 2, 2022 16:58

bowang007 added 2 commits June 2, 2022 11:43

Merge remote-tracking branch 'origin' into refactor_segmentation

ccb826e

fix: fix the bug that tag Constant node as fallback node

22d91f5

Signed-off-by: Bo Wang <[email protected]>

github-actions bot added the component: core Issues re: The core compiler label Jun 2, 2022

mfeliz-cruise reviewed Jun 2, 2022

View reviewed changes

bowang007 added 3 commits June 8, 2022 01:26

fix: fix the bug that getDependencyNodes misses modifying nodes

1df7cbb

Signed-off-by: Bo Wang <[email protected]>

refactor: refactored the method of getting modifying nodes

0bd8a98

Signed-off-by: Bo Wang <[email protected]>

fix: nodes in If block should fallback it it dependes on any nonTenso…

8d9a8d4

…r value that's produced in outer block Signed-off-by: Bo Wang <[email protected]>

peri044 mentioned this pull request Jun 15, 2022

✨[Feature] Do not remove nodes which are in-place operations (classified in TRT segments) in fallback #1121

Closed

tests: add test for inplace op in if block

da9b13c

Signed-off-by: Bo Wang <[email protected]>

github-actions bot added the component: tests Issues re: Tests label Jun 16, 2022

bowang007 added 2 commits June 16, 2022 13:54

Merge branch 'master' into refactor_segmentation

a557d14

chore: apply linting

e929b65

Signed-off-by: Bo Wang <[email protected]>

peri044 mentioned this pull request Jun 22, 2022

fix(tests/core/partitioning): Fix tests of refactoring segmentation in partitioning #1140

Merged

7 tasks

peri044 closed this Jun 22, 2022

BrettRyland mentioned this pull request Jul 1, 2022

🐛 [Bug] torchvision.ops.roi_align is throwing an internal assert failed #1157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: refactoring segmentation in partitioning #1067

feat: refactoring segmentation in partitioning #1067

bowang007 commented May 14, 2022 •

edited

Loading

facebook-github-bot commented May 14, 2022

facebook-github-bot commented May 17, 2022

facebook-github-bot commented May 17, 2022

mfeliz-cruise Jun 2, 2022

bowang007 Jun 6, 2022

mfeliz-cruise Jun 6, 2022

bowang007 Jun 8, 2022

mfeliz-cruise Jun 8, 2022

mfeliz-cruise Jun 8, 2022

bowang007 Jun 8, 2022

mfeliz-cruise Jun 2, 2022 •

edited

Loading

bowang007 Jun 4, 2022

peri044 commented Jun 22, 2022

feat: refactoring segmentation in partitioning #1067

feat: refactoring segmentation in partitioning #1067

Conversation

bowang007 commented May 14, 2022 • edited Loading

Description

Type of change

Checklist:

facebook-github-bot commented May 14, 2022

Action Required

Process

facebook-github-bot commented May 17, 2022

facebook-github-bot commented May 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfeliz-cruise Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peri044 commented Jun 22, 2022

bowang007 commented May 14, 2022 •

edited

Loading

mfeliz-cruise Jun 2, 2022 •

edited

Loading