Much slower for inference, even when traced?

## ❓ Questions & Help

When running inference using BERT-large on a T4 GPU using bert-as-a-service, I could get well over 100/s on sentence pair classification. (I am aware that this utilized TF's graph freezing and pruning)

When running inference with Roberta-large on a T4 GPU using native pytorch and fairseq, I was able to get 70-80/s for inference on sentence pairs. 

Even with using the torchscript JIT tracing, **I still am only able to get 17/s on a T4** using the transformers implementation of Bert-large, using a batch size of 8 (which fills most of the memory).

The training performance is similarly worse (about 40% - 100% longer even with apex vs no apex before).

One of the primary differences I can think of is that now I am padding all up to max-seq length, and it does increase performance a lot to decrease this. Is there a way to not pad in transformers? And just pass a list of pytorch tensors in that can be dynamically sized?

Should I try the tensorflow implementations?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Much slower for inference, even when traced? #1477

❓ Questions & Help

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Much slower for inference, even when traced? #1477

Description

❓ Questions & Help

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions