Skip to content

Fix special scalar handling for addcdiv and addcmul #3953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 13, 2022

Conversation

ymwangg
Copy link
Contributor

@ymwangg ymwangg commented Aug 30, 2022

This PR fixed the issue described in #3942.
cc @JackCaoG

@JackCaoG
Copy link
Collaborator

test_take_xla_uint8 also fail my pr, let me quickly check if it also fails at head.

@JackCaoG
Copy link
Collaborator

Test actually passed on master, so I guess rebase should solve the issue here.

@ymwangg
Copy link
Contributor Author

ymwangg commented Sep 1, 2022

The CI error appears to be irrelevant. We've seen this failure in our internal tests before but it's not quite reproducible. Any idea what's wrong here?

+ python3 /tmp/pytorch/xla/test/test_mp_sync_batch_norm.py
E0901 00:12:00.777439506  611072 server_chttp2.cc:40]        {"created":"@1661991120.777385369","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":395,"referenced_errors":[{"created":"@1661991120.777383508","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":342,"referenced_errors":[{"created":"@1661991120.777362999","description":"Address family not supported by protocol","errno":97,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":420,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:59883"},{"created":"@1661991120.777383088","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1661991120.777380660","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":189,"os_error":"Address already in use","syscall":"bind"}]}]}]}
2022-09-01 00:12:00.782695: E  611072 tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:580] UNKNOWN: Could not start gRPC server
Exception in device=GPU:1: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:58 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (UNKNOWN: Could not start gRPC server vs. OK)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    _setup_replication()
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 315, in _setup_replication
    device = xm.xla_device()
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 244, in xla_device
    devices = get_xla_supported_devices(devkind=devkind)
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 138, in get_xla_supported_devices
    xla_devices = _DEVICES.value
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/utils/utils.py", line 32, in value
    self._value = self._gen_fn()
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/core/xla_model.py", line 20, in <lambda>
    _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:58 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (UNKNOWN: Could not start gRPC server vs. OK)
Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/test_mp_sync_batch_norm.py", line 146, in <module>
    xmp.spawn(_mp_fn, args=())
  File "/opt/conda/lib/python3.7/site-packages/torch_xla-1.13-py3.7-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
    start_method=start_method)
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 154, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with exit code 17

Exited with code exit status 1

@JackCaoG
Copy link
Collaborator

JackCaoG commented Sep 1, 2022

rerun should solve this issue, let me trigger it

@@ -741,6 +741,10 @@ absl::flat_hash_map<std::string, absl::variant<int>> ConvertDictToMap(
return map;
}

// Override some upstream torch::lazy env vars for better performance.
// Upstream lazy env vars defined in torch/csrc/lazy/core/config.h.
void SetDefaultLazyEnvVars() { FLAGS_torch_lazy_handle_special_scalars = true; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ymwangg , do we know if it's always safe/better to use constants for special scalars?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been what pytorch/xla has been doing, it is just when we migrate to LTC, we also need to enable a FLAG to make this behavior consistent.

Copy link
Collaborator

@JackCaoG JackCaoG Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the right thing to do here is to do FLAGS_torch_lazy_handle_special_scalars = ! xla::sys_util::GetEnvBool("XLA_NO_SPECIAL_SCALARS", false)

given in
https://github.com/pytorch/xla/blob/master/torch_xla/csrc/tensor.cpp#L257

we also have an env var to control this behavior

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will merge this one to unblock aws folks, but submit a new one to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch! You are right I overlooked XLA_NO_SPECIAL_SCALARS. I can fix it here or in future PR, whichever works best for you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, can you update in this pr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants