Skip to content

Use tfrt cpu client #3898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 18, 2022
Merged

Use tfrt cpu client #3898

merged 4 commits into from
Aug 18, 2022

Conversation

darisoy
Copy link
Collaborator

@darisoy darisoy commented Aug 16, 2022

  • Add the CPU_ASYNC_CLIENT env var
  • Change PjRtComputationClient to use GetTfrtCpuClient when testing multiple processes with a CPU
  • Add unit tests

@darisoy darisoy requested a review from will-cromar August 16, 2022 22:41
@will-cromar will-cromar requested a review from JackCaoG August 16, 2022 23:03
Copy link
Collaborator

@will-cromar will-cromar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@will-cromar will-cromar removed the request for review from JackCaoG August 17, 2022 17:59
@JackCaoG
Copy link
Collaborator

error seems real

test (__main__.TestParallelTensorMNIST) ... 2022-08-17 02:23:06.009032: E  498776 ./tensorflow/compiler/xla/service/collective_ops_utils.h:224] This thread has been waiting for 5000ms for and may be stuck: participant AllReduceParticipantData{buffers=[{element_count=250},{element_count=10},{element_count=5000},{element_count=20},{element_count=16000},{element_count=50},{element_count=500},{element_count=10},{element_count=1}], rendezvous_key=RendezvousKey{run_id=RunId: 564, global_devices=[0,1,2,3], num_local_participants=4, collective_op_kind=cross_replica, op_id=246}, device_ordinal=0, stream=(nil)} waiting for all participants to arrive at rendezvous RendezvousKey{run_id=RunId: 564, global_devices=[0,1,2,3], num_local_participants=4, collective_op_kind=cross_replica, op_id=246}
2022-08-17 02:23:06.021322: E  498773 ./tensorflow/compiler/xla/service/collective_ops_utils.h:224] This thread has been waiting for 5000ms for and may be stuck: participant AllReduceParticipantData{buffers=[{element_count=250},{element_count=10},{element_count=5000},{element_count=20},{element_count=16000},{element_count=50},{element_count=500},{element_count=10},{element_count=1}], rendezvous_key=RendezvousKey{run_id=RunId: 566, global_devices=[0,1,2,3], num_local_participants=4, collective_op_kind=cross_replica, op_id=247}, device_ordinal=2, stream=(nil)} waiting for all participants to arrive at rendezvous RendezvousKey{run_id=RunId: 566, global_devices=[0,1,2,3], num_local_participants=4, collective_op_kind=cross_replica, op_id=247}
2022-08-17 02:23:06.037646: E  498775 ./tensorflow/compiler/xla/service/collective_ops_utils.h:224] This thread has been waiting for 5000ms for and may be stuck: participant AllReduceParticipantData{buffers=[{element_count=250},{element_count=10},{element_count=5000},{element_count=20},{element_count=16000},{element_count=50},{element_count=500},{element_count=10},{element_count=1}], rendezvous_key=RendezvousKey{run_id=RunId: 568, global_devices=[0,1,2,3], num_local_participants=4, collective_op_kind=cross_replica, op_id=248}, device_ordinal=3, stream=(nil)} waiting for all participants to arrive at rendezvous RendezvousKey{run_id=RunId: 568, global_devices=[0,1,2,3], num_local_participants=4, collective_op_kind=cross_replica, op_id=248}
2022-08-17 02:23:06.045691: E  498774 ./tensorflow/compiler/xla/service/collective_ops_utils.h:224] This thread has been waiting for 5000ms for and may be stuck: participant AllReduceParticipantData{buffers=[{element_count=250},{element_count=10},{element_count=5000},{element_count=20},{element_count=16000},{element_count=50},{element_count=500},{element_count=10},{element_count=1}], rendezvous_key=RendezvousKey{run_id=RunId: 570, global_devices=[0,1,2,3], num_local_participants=4, collective_op_kind=cross_replica, op_id=249}, device_ordinal=1, stream=(nil)} waiting for all participants to arrive at rendezvous RendezvousKey{run_id=RunId: 570, global_devices=[0,1,2,3], num_local_participants=4, collective_op_kind=cross_replica, op_id=249}

Copy link
Collaborator

@will-cromar will-cromar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collectives hang forever on CPU with multiple devices. This test passed before because we ignored CPU_NUM_DEVICES. @darisoy is investigating if there is an easy fix. Otherwise, we can implement a workaround to go back to testing on one CPU device.

@darisoy darisoy force-pushed the use_tfrt_cpu_client branch from 46192a1 to 9f09c7e Compare August 18, 2022 18:27
Copy link
Collaborator

@will-cromar will-cromar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@will-cromar will-cromar requested a review from JackCaoG August 18, 2022 19:42
@darisoy darisoy merged commit a6ea0f3 into master Aug 18, 2022
@darisoy darisoy deleted the use_tfrt_cpu_client branch August 18, 2022 23:07
JackCaoG added a commit that referenced this pull request Sep 3, 2022
JackCaoG added a commit that referenced this pull request Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants