Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

✨[Feature] Print out Python file name, line number, and/or op for warnings, compilation failures #949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chaoz-dev opened this issue Mar 31, 2022 · 7 comments
Assignees
Labels
feature request New feature or request

Comments

@chaoz-dev
Copy link
Contributor

Is your feature request related to a problem? Please describe.
By default, when running Torch-TensorRT from Python, most errors encountered during compilation will print a stack trace originating from either the Python compilation command (eg. printing something like torch_tensorrt_pypi/torch_tensorrt/ts/_compiler.py", line 119, in compile compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec)). RuntimeError: ....) or the CPP file backing the implementation.

Warnings during runtime may be printed without any identifying information.

The current stack traces and output by themselves do not provide enough information to diagnose or pinpoint which PyTorch op is causing the offending behavior. This is especially egregious when attempting to apply Torch-TensorRT to large and complex models, where failures with the conversion process are currently difficult to identify without unwinding and stepping through the model op by op to locate the root cause of conversion failures.

Describe the solution you'd like
Ideally, when an error or warning occurs, we should print output identifying the offending op (at minimum), as well as the file and line number at which the op occurs, in addition to other stack traces, errors, and warnings.

Describe alternatives you've considered
Currently, I've tried log printing at the graph level to better identify errors during the conversion process. However, there are issues with this approach:

  1. We may encounter the error before printing the next node or subgraph where the error occurs;
  2. The output appears correct, and does not identify the offending op, anyway

Additional context
I think this ask is potentially complicated but somewhat possible... On the one hand, Torch-TensorRT converts using the JIT output, which means obtaining metadata could be more complex than just reading the model directly. On the other hand, looking at the Interpreting Graphs section of the documentation, it appears file and line number information should be accessible somewhere...

@chaoz-dev chaoz-dev added the feature request New feature or request label Mar 31, 2022
@narendasan
Copy link
Collaborator

We already print out the trace to the location of unsupported operators. In the case you require full compilation, these will be printed as errors. In the case that you are using partial compilation, you will see these messages as Debug/Info messages. I guess you are likely talking about issues like TORCHTRT_CHECK failures where these are likely issues with torch-tensorrt itself so IMO the source location is not especially relevant. Really here the key information is the operator at least in the UX model we envision where we want to avoid model modification as much as possible, since the first thing we would expect users to try is disabling that operator. This information is provided in the debug level messaging with messages like INFO: [TRTorch Conversion Context] - Adding Layer %2 : Tensor = aten::gt(%0, %1) (ctx.AddLayer) before entering any particular layer.

We could try to formalize this a bit so we can summarize the context in the converter, maybe something like:

Context: 
Converting: %2 : Tensor = aten::gt(%0, %1) 
Arguments: [%0: ..., %1: ...] 

@chaoz-dev
Copy link
Contributor Author

chaoz-dev commented Apr 1, 2022

Yeah, when fallback occurs due to unsupported operators, this already prints out the unsupported operator as you mentioned with the file and line number (eg. #803). For this request, I was thinking more along the failures seen in something like #922:

Traceback (most recent call last):
  File "/scripts/tril.py", line 39, in <module>
    model_trt = torchtrt.compile(
  File "/opt/conda/lib/python3.8/site-packages/torch_tensorrt/_compile.py", line 97, in compile
    return torch_tensorrt.ts.compile(ts_mod, inputs=inputs, enabled_precisions=enabled_precisions, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch_tensorrt/ts/_compiler.py", line 119, in compile
    compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError: [Error thrown at core/partitioning/shape_analysis.cpp:66] Expected ivalues_maps.count(input) to be true but got false
Could not find torch::jit::Value* 47 produced from %47 : int = prim::dtype(%52) in lowering graph for mini graph input.

Having better printouts here would be helpful in pinpointing the offending op in large models, if at least for error isolation and bug reporting. I believe this is in line with your statement (Really here the key information is the operator at least in the UX model we envision where we want to avoid model modification as much as possible, since the first thing we would expect users to try is disabling that operator. ); currently, to address any issue that does not print enough file/line number metadata requires heavy modification of the original graph to isolate and understand the root cause, which necessarily modifies the lowered graph and may hide the original failure. The above message gives no indication as to what the op (or Python structure) is causing the failure, and therefore it is impossible to disable ops (or report bugs) on this output alone.

Regarding debug level messaging, I am opposed to using this as the default solution for addressing issues in general, since large and complex graphs (where Torch-TensorRT will frequently be used) will output massive amounts of unnecessary information that is irrelevant to the root issue; having to parse this adds additional friction to the debugging process, when frequently the user is really only looking for one offending op. It is also easy to change the lowered graph when searching for a problematic op, or combination of ops, especially when attempting to isolate errors, which may hide certain failures, so being able to print out a failing op adds an additional layer of sanity checking. Additionally, frequently the output here looks successful and correct, so it's not apparent at a glance (if at all) where the issue lies. That said, however, I would still like to see your suggestion of formalizing the printout here realized, as this may indeed already be sufficient (and I'm just not reading the debug output correctly or well enough).

In summary, I'm in favor of giving these failures the same printout treatment as what we currently have for unsupported ops, since the end result is that the user must either 1. explicitly tell Torch-TensorRT to ignore the op, 2. modify the graph to remove the op, or 3. report the op conversion failure as a bug or feature request. I guess I would frame this as raising the most basic information (ie. op, file, line number that may already be present) in the lowered (debug) output to regular output in all situations where a warning or error appears, because this is usually the best place to start debugging.

@narendasan
Copy link
Collaborator

Yeah in general I am not against raising the severity of messages and providing more context, but I do want to be purposeful in what we report to the user in these error messages and ideally make our errors actionable. i.e. Here is an issue we detected, here is a potential solution. What I want to avoid is dumping a ton of information on people in cases where its misleading or there is no action. This would include your Could not find torch::jit::Value* 47 produced from %47 : int = prim::dtype(%52). This is likely an issue with the partitioning system itself, but at least for me, dumping a backtrace implies this is an issue with the users code and not Torch-TensorRT and may cause people to spend a lot of time doing blackbox debugging on their own code base. As an aside in general the partition system just needs a lot of work in general in better logging.

Totally agree cases like this https://github.com/NVIDIA/Torch-TensorRT/blob/ba9f730fe049ffd31cd1d30e7f99a74c82086342/core/conversion/evaluators/aten.cpp#L220 A backtrace would be super useful.

In cases like this https://github.com/NVIDIA/Torch-TensorRT/blob/ba9f730fe049ffd31cd1d30e7f99a74c82086342/core/partitioning/shape_analysis.cpp#L67 This to me does not seem actionable by the user. If there is a failure here not sure what options a user has. Perhaps dumping context at a lower level may help in reporting but just throwing out the trace here obfuscates the issue.

Philosophically, I would prefer logging to be as close to how people expect compiler logging to work. Warnings imply a potential risk and Errors imply some incompatibility with the source code and the expectations of the compiler. In both cases the action is on the user to accept the risk or modify their code. The issue really is that unlike GCC, Torch-TRT is still fairly new, so hitting bugs in the compiler happen more often and really the action is on us, the user's code is fine.

One thing that might help here is better distinguish between RuntimeErrors which come from TORCHTRT_CHECK and the TorchTRT exception system and Log Level "Errors" which are detected incompatibilities in the user source code.

therefore it is impossible to disable ops.
I would say the expectation here is just to throw the offending op in torch_executed_ops

TL;DR, Totally agree we can provide more information to make it easier to narrow down where failures are happening, but want to be careful to not send people off on wild bug hunts for things they are not expected to have to fix. I think its worth taking a full pass over the logging system and adding more information in general.

@chaoz-dev
Copy link
Contributor Author

chaoz-dev commented Apr 5, 2022

Hmm yeah that's fair; I'm in agreement with what's stated here, especially with regards to not sending users off on a wild goose chase. That said, would you say that just logging an op and file location (not the full stack trace) is still too much here? Even if it's an underlying issue with the conversion process (ie. the user's code is fine), this still seems like a good way (and easier) for the user to identify that their op usage is correct (because we're saying this op is failing, but on inspection usage is correct) and that it's the compilation that's failing, which also makes it easier to report the bug.

@github-actions
Copy link

github-actions bot commented Jul 5, 2022

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

@ncomly-nvidia
Copy link
Contributor

Bump @narendasan ^

@narendasan
Copy link
Collaborator

Converting this to a discussion as it is not a specific request but a general request for improvement

@pytorch pytorch locked and limited conversation to collaborators Sep 2, 2022
@narendasan narendasan converted this issue into discussion #1331 Sep 2, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants