Skip to content

✨[Feature] Perform constant folding of operations in the graph if nodes can be evaluated at compile time #1266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
peri044 opened this issue Aug 15, 2022 · 11 comments
Assignees
Labels
component: core Issues re: The core compiler feature request New feature or request

Comments

@peri044
Copy link
Collaborator

peri044 commented Aug 15, 2022

Is your feature request related to a problem? Please describe.

%28 : int = aten::size(%22, %27)
...
%31 : Tensor[] = tensorrt::execute_engine(%30, %trt_engine_0x55cc735470f0)
%32 : Tensor, %33 : Tensor = prim::ListUnpack(%31)

%34 : int = prim::Constant[value=15]()
%35 : int = prim::Constant[value=7]()
%36 : int = prim::Constant[value=0]()
%37 : int[] = prim::ListConstruct(%28, %34, %34, %35)

...
%41 : Tensor[] = tensorrt::execute_engine(%40, %trt_engine_0x55cc73547290)
%bar : Tensor = aten::reshape(%foo, %37)

In the snippet below we have a reshape node %bar which is forced to fallback by min_block_size. Because %bar is forced to fallback its input %37 = prim::ListConstruct is also forced to fallback along with %28 : int = aten::size. In both cases these ops are supported by evaluators and (at least with static shapes) should be resolved to constants during conversion. Currently these ops are creating unnecessary breaks between TRT regions that will impact performance.

Describe the solution you'd like
Could evaluatable nodes be resolved to constants in the torchscript graph before partitioning to avoid this impact on the partition?

Describe alternatives you've considered
Could fallback nodes with no dependencies on active TRT nodes be consolidated to avoid breaking TRT regions? (In this case %37 is not used anywhere other than in the reshape node %bar and could be moved to just before its use)

@peri044 peri044 added the feature request New feature or request label Aug 15, 2022
@narendasan
Copy link
Collaborator

Shouldn’t OpSupported report back that the node can be evaluated?

bool OpSupported(const torch::jit::Node* n) {

@narendasan
Copy link
Collaborator

#1263 This pr aims to log every state change in target executor for nodes, might help clear up what is happening

@peri044
Copy link
Collaborator Author

peri044 commented Aug 15, 2022

Shouldn’t OpSupported report back that the node can be evaluated?

bool OpSupported(const torch::jit::Node* n) {

Yes. But the request here is to remove these nodes (make it a constant list which is passed to reshape node directly) altogether in the torchscript graph before partitioning phase so that ListConstruct won't obstruct partitioning and create additional segments.

%28 : int = aten::size(%22, %27)
%37 : int[] = prim::ListConstruct(%28, %34, %34, %35)

@narendasan
Copy link
Collaborator

Is this because of the input signature changes?

@ncomly-nvidia ncomly-nvidia added the component: core Issues re: The core compiler label Aug 15, 2022
@mfeliz-cruise
Copy link
Contributor

mfeliz-cruise commented Aug 15, 2022

My thinking here is that when an op is supported with an Evaluator, it is resolvable to a constant in TRT. If these ops are resolvable to constants in TRT they should also be resolvable to constants in TorchScript. If we implemented an evaluator pass which ran before partitioning and evaluated any evaluatable node (ex. aten::size with static shaped input) we could replace those nodes with constants in the graph. We could then rerun jit's constant propagation potentially simplifying the graph further.

I think this has a few benefits:

  1. Simplifies the graph before partitioning, avoids cases like the example in the original post where an evaluatable op is forced to fallback and breaks up a TRT segment
  2. Allows testing of evaluators independently of partitioning and conversion
  3. Improved debug, makes it clear and visible which ops are assumed to be constant
  4. May simplify torch fallback regions
  5. Allows for support for the same op with either a converter or evaluator (ex. aten::size). If the evaluator is contextually able to resolve a node to a constant (ex static shape aten::size) it will not be present in the graph at partitioning/conversion time. If it is, it will need to be implemented in fallback or through a converter which can handle dynamic shapes.

@narendasan
Copy link
Collaborator

narendasan commented Aug 15, 2022

My thinking here is that when an op is supported with an Evaluator, it is resolvable to a constant in TRT. If these ops are resolvable to constants in TRT they should also be resolvable to constants in TorchScript. If we implemented an evaluator pass which ran before partitioning and evaluated any evaluatable node (ex. aten::size with static shaped input) we could replace those nodes with constants in the graph. We could then rerun jit's constant propagation potentially simplifying the graph further.

I think this has a few benefits:

1. Simplifies the graph before partitioning, avoids cases like the example in the original post where an evaluatable op is forced to fallback and breaks up a TRT segment

2. Allows testing of evaluators independently of partitioning and conversion

3. Improved debug, makes it clear and visible which ops are assumed to be constant

4. May simplify torch fallback regions

5. Allows for support for the same op with either a converter or evaluator (ex. aten::size). If the evaluator is contextually able to resolve a node to a constant (ex static shape aten::size) it will not be present in the graph at partitioning/conversion time. If it is, it will need to be implemented in fallback or through a converter which can handle dynamic shapes.

This is true in some cases but not all. For example a prim::ListConstruct could take a number of tensors which are the result of preceding TRT converted nodes. In this case the nvinfer1::ITensors must exist to create the list.

Therefore, such a preprocessing system would need to be able to distinguish constant nodes using some characteristic other than node kind. All that is guaranteed by the torchscript IR is that once a value is produced it will never change but that is conditioned on one or many preceding ops being executed which may be dependencies of this value not all of which are going to be guaranteed to be evaluatable. I would think you could potentially evaluate to constants any nodes where you can trace-back to solely constants but I think that's what constant-prop/pool does. There is the slight detail that we have shape information, which could fold a few more ops, but we should then seek to fork the constant-prop/pool passes to take this into account vs other approaches I would think.

2. We do have an evaluator test rig, but not tests for all evaluators

3. I think I need some more detail here

5. We currently support per schema specialization, e.g. aten::add has many schemas, some are evaluators some are converters. But yes we cannot currently handle having the same schema registered in both places and picking between them. #1130 could help enable this.

@mfeliz-cruise
Copy link
Contributor

mfeliz-cruise commented Aug 15, 2022

Thanks @narendasan. are there more examples beyond prim::ListConstruct and prim::ListUnpack? I agree these do not fit the resolve-to-constant pattern I'm trying to address here. Could these be considered as converters?

Therefore, such a preprocessing system would need to be able to distinguish constant nodes using some characteristic other than node kind. All that is guaranteed by the torchscript IR is that once a value is produced it will never change but that is conditioned on one or many preceding ops being executed which may be dependencies of this value not all of which are going to be guaranteed to be evaluatable. I would think you could potentially evaluate to constants any nodes where you can trace-back to solely constants but I think that's what constant-prop/pool does. There is the slight detail that we have shape information, which could fold a few more ops, but we should then seek to fork the constant-prop/pool passes to take this into account vs other approaches I would think.

This addresses one of my concerns. I'm worried about cases where evaluators may resolve a value incorrectly to a constant or error out because the input is not constant. Ex. aten::mul.int of a value from an aten::size with dynamic input shapes should not be resolved to a constant. I think the case where we can resolve an aten::mul.int to a constant that is still in the Torch-TensorRT input after constant propagation is actually fairly rare and would only come up because of constants that we can identify that torchscript cannot (ex. size with static input shapes). If we removed these kinds of evaluators and instead relied on constant propagation after resolving known-to-torch-trt constants (size with static input shapes, dtype) then I think they would be automatically removed by constant propagation in all the cases where it would be valid for torch-trt to do the same. In the remaining cases the aten::mul would still be in the graph and could fallback correctly or be handled by a future converter that could correctly capture the non-constant input.

I think based on your input this is what I'm proposing:

  1. Add a jit pass which explicitly resolves aten::size and similar ops that torch-tensorrt can prove are constants based on input specification (ideally this would also be able to propagate in cases where individual values in a shape are constant ex. dynamic batch with other dimensions static). Rerun jit constant propagation to eliminate any ops which now have constant inputs.
  2. Remove most evaluators that today resolve to constant values (if they are not resolvable by constant propagation, assume inputs are non-constant and we must fallback or support with converters). ListConstruct, ListUnpack, Constant and small number of special-cases would remain.

This would allow us to explicitly call out what torch-trt is assuming is constant in the initial jit pass and give us a clear torchscript dump after constant propagation of what is still in the graph that will need to be converted or otherwise handled in partitioning. It removes the risk of evaluators incorrectly resolving a value to a constant, creates a path to preserve these ops in fallback regions without erroring out and it allows for the possibility of adding new converter support for nodes that remain in the graph with non-constant inputs (ex. aten::size outputing a shape-tensor which could be consumed by a dynamic reshape).

@narendasan
Copy link
Collaborator

narendasan commented Aug 16, 2022

Thanks @narendasan. are there more examples beyond prim::ListConstruct and prim::ListUnpack? I agree these do not fit the resolve-to-constant pattern I'm trying to address here. Could these be considered as converters?

prim::TupleConstruct/Unpack, aten::append, aten::size, aten::clone, aten::copy are all evaluators which are designed to handle ITensors i.e. potentially non constant data. They are not converters since converters are designed to purely emit ITensors. Evaluators produce IValues so types such as int, double, Tensor, Tuple, List (or wrapped ITensors). Some of these could be converters but they handle more than just ITensors. We have in the past discussed unifying Evaluators and Converters to all emit IValues but it did not seem like it added to much other than reducing the number of registries we have. Maybe we will do it some day but I don't think it helps too much here.

Re. 1. I think this sounds reasonable although it not sure how many cases it will hit, like you as point out as soon as there is dynamic shape we are on equal footing to TorchScript lowering information wise (give or take a few dimensions which are static across the input range). Regardless for the effort required I don't see any harm attempting to implement such a pass. The MVP I have in my head is really just inplace replacing directly known (i.e. what we get from the input specs) aten::size, aten::shape with prim::Constant tuples, or ints then seeing what constant pool/prop does. But as soon as we need to go through intermediate ops and start executing subgraphs ourselves I think that opens a can of worms I am not convinced is worth it yet.

Re. 2. I don't really see why we need to remove evaluators, the combo of injecting more shape information into the graph then running constant pooling won't necessarily resolve all ops that would be handled by evaluators. My understanding of the fundamental issue here after talking to @peri044 was that data like shapes which Torch-TensorTRT knows but TorchScript lowering doesn't is not taken into account. Seems like 1. solves that on its own. If evaluators just don't get called anymore I guess that's a plus but I don't really expect that to be the case.

Also from a while back we were chatting with the TS guys when we were trying to start implementing shape analysis and they said they were looking at baking shape info into the graph themselves, I don't know what came of that or if it was moved to a new project like functorch or fx but it would be worth taking a look at what they have these days that we can leverage.

@narendasan
Copy link
Collaborator

Seems like tracing is now including shape information:

graph(%self : __torch__.GatherNDListModel,
      %x : Float(3, 4, 5, strides=[20, 5, 1], requires_grad=0, device=cuda:0)):
  %4 : Long(2, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value= 1  0 [ CPULongType{2} ]]() # 1274.py:19:0
  %5 : Device = prim::Constant[value="cuda:0"]() # 1274.py:19:0
  %6 : int = prim::Constant[value=4]() # 1274.py:19:0
  %7 : bool = prim::Constant[value=0]() # 1274.py:19:0
  %8 : bool = prim::Constant[value=0]() # 1274.py:19:0
  %9 : NoneType = prim::Constant()
  %10 : Long(2, strides=[1], requires_grad=0, device=cuda:0) = aten::to(%4, %5, %6, %7, %8, %9) # 1274.py:19:0
  %11 : int = prim::Constant[value=1]() # 1274.py:19:0
  %12 : int = prim::Constant[value=0]() # 1274.py:19:0
  %13 : int = prim::Constant[value=9223372036854775807]() # 1274.py:19:0
  %14 : int = prim::Constant[value=1]() # 1274.py:19:0
  %15 : Float(3, 4, 5, strides=[20, 5, 1], requires_grad=0, device=cuda:0) = aten::slice(%x, %11, %12, %13, %14) # 1274.py:19:0
  %16 : Long(2, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value= 1  0 [ CPULongType{2} ]]() # 1274.py:19:0
  %17 : Device = prim::Constant[value="cuda:0"]() # 1274.py:19:0
  %18 : int = prim::Constant[value=4]() # 1274.py:19:0
  %19 : bool = prim::Constant[value=0]() # 1274.py:19:0
  %20 : bool = prim::Constant[value=0]() # 1274.py:19:0
  %21 : NoneType = prim::Constant()
  %22 : Long(2, strides=[1], requires_grad=0, device=cuda:0) = aten::to(%16, %17, %18, %19, %20, %21) # 1274.py:19:0
  %23 : NoneType = prim::Constant()
  %24 : Tensor?[] = prim::ListConstruct(%10, %23, %22)
  %25 : Float(2, 4, strides=[4, 1], requires_grad=0, device=cuda:0) = aten::index(%15, %24) # 1274.py:19:0
  return (%25)

@mfeliz-cruise
Copy link
Contributor

Would we hit issues with dynamic shapes if we rely on the traced shape values?

@narendasan
Copy link
Collaborator

Potentially, also I think this would force people to use trace. We need to investigate this further.

@peri044 peri044 closed this as completed Sep 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: core Issues re: The core compiler feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants