Fix special scalar handling for addcdiv and addcmul #3953

ymwangg · 2022-08-30T23:00:08Z

This PR fixed the issue described in #3942.
cc @JackCaoG

JackCaoG · 2022-08-31T18:17:31Z

test_take_xla_uint8 also fail my pr, let me quickly check if it also fails at head.

JackCaoG · 2022-08-31T18:42:07Z

Test actually passed on master, so I guess rebase should solve the issue here.

ymwangg · 2022-09-01T01:02:04Z

The CI error appears to be irrelevant. We've seen this failure in our internal tests before but it's not quite reproducible. Any idea what's wrong here?

+ python3 /tmp/pytorch/xla/test/test_mp_sync_batch_norm.py
E0901 00:12:00.777439506  611072 server_chttp2.cc:40]        {"created":"@1661991120.777385369","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":395,"referenced_errors":[{"created":"@1661991120.777383508","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":342,"referenced_errors":[{"created":"@1661991120.777362999","description":"Address family not supported by protocol","errno":97,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":420,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:59883"},{"created":"@1661991120.777383088","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1661991120.777380660","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":189,"os_error":"Address already in use","syscall":"bind"}]}]}]}
2022-09-01 00:12:00.782695: E  611072 tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:580] UNKNOWN: Could not start gRPC server
Exception in device=GPU:1: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:58 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (UNKNOWN: Could not start gRPC server vs. OK)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    _setup_replication()
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 315, in _setup_replication
    device = xm.xla_device()
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 244, in xla_device
    devices = get_xla_supported_devices(devkind=devkind)
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 138, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 20, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:58 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (UNKNOWN: Could not start gRPC server vs. OK)
Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/test_mp_sync_batch_norm.py", line 146, in <module>
    xmp.spawn(_mp_fn, args=())
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
    start_method=start_method)
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 154, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with exit code 17

Exited with code exit status 1

JackCaoG · 2022-09-01T01:15:11Z

rerun should solve this issue, let me trigger it

yeounoh · 2022-09-02T23:30:53Z

torch_xla/csrc/init_python_bindings.cpp

@@ -741,6 +741,10 @@ absl::flat_hash_map<std::string, absl::variant<int>> ConvertDictToMap(
  return map;
 }

+// Override some upstream torch::lazy env vars for better performance.
+// Upstream lazy env vars defined in torch/csrc/lazy/core/config.h.
+void SetDefaultLazyEnvVars() { FLAGS_torch_lazy_handle_special_scalars = true; }


Thanks @ymwangg , do we know if it's always safe/better to use constants for special scalars?

This has been what pytorch/xla has been doing, it is just when we migrate to LTC, we also need to enable a FLAG to make this behavior consistent.

I guess the right thing to do here is to do FLAGS_torch_lazy_handle_special_scalars = ! xla::sys_util::GetEnvBool("XLA_NO_SPECIAL_SCALARS", false)

given in
https://github.com/pytorch/xla/blob/master/torch_xla/csrc/tensor.cpp#L257

we also have an env var to control this behavior

I will merge this one to unblock aws folks, but submit a new one to fix it.

Thanks for the catch! You are right I overlooked XLA_NO_SPECIAL_SCALARS. I can fix it here or in future PR, whichever works best for you.

If you don't mind, can you update in this pr?

Fix special scalar handling for addcdiv and addcmul

371f919

ymwangg force-pushed the fix_addcdiv branch from 4d24922 to 371f919 Compare August 31, 2022 20:39

yeounoh reviewed Sep 2, 2022

View reviewed changes

Address CR comments

8647c25

JackCaoG approved these changes Sep 13, 2022

View reviewed changes

JackCaoG merged commit 88d934e into pytorch:master Sep 13, 2022

aws-rhsoln mentioned this pull request Nov 17, 2022

Seeing a roughly 25% slow down with torch-xla=1.13 because of the addcdiv_ and addcmul_ ops #4213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix special scalar handling for addcdiv and addcmul #3953

Fix special scalar handling for addcdiv and addcmul #3953

ymwangg commented Aug 30, 2022 •

edited

Loading

JackCaoG commented Aug 31, 2022

JackCaoG commented Aug 31, 2022

ymwangg commented Sep 1, 2022

JackCaoG commented Sep 1, 2022

yeounoh Sep 2, 2022

JackCaoG Sep 2, 2022

JackCaoG Sep 2, 2022 •

edited

Loading

JackCaoG Sep 2, 2022

ymwangg Sep 2, 2022

JackCaoG Sep 2, 2022

Fix special scalar handling for addcdiv and addcmul #3953

Fix special scalar handling for addcdiv and addcmul #3953

Conversation

ymwangg commented Aug 30, 2022 • edited Loading

JackCaoG commented Aug 31, 2022

JackCaoG commented Aug 31, 2022

ymwangg commented Sep 1, 2022

JackCaoG commented Sep 1, 2022

yeounoh Sep 2, 2022

Choose a reason for hiding this comment

JackCaoG Sep 2, 2022

Choose a reason for hiding this comment

JackCaoG Sep 2, 2022 • edited Loading

Choose a reason for hiding this comment

JackCaoG Sep 2, 2022

Choose a reason for hiding this comment

ymwangg Sep 2, 2022

Choose a reason for hiding this comment

JackCaoG Sep 2, 2022

Choose a reason for hiding this comment

ymwangg commented Aug 30, 2022 •

edited

Loading

JackCaoG Sep 2, 2022 •

edited

Loading