Description
❓ Questions & Help
When running inference using BERT-large on a T4 GPU using bert-as-a-service, I could get well over 100/s on sentence pair classification. (I am aware that this utilized TF's graph freezing and pruning)
When running inference with Roberta-large on a T4 GPU using native pytorch and fairseq, I was able to get 70-80/s for inference on sentence pairs.
Even with using the torchscript JIT tracing, I still am only able to get 17/s on a T4 using the transformers implementation of Bert-large, using a batch size of 8 (which fills most of the memory).
The training performance is similarly worse (about 40% - 100% longer even with apex vs no apex before).
One of the primary differences I can think of is that now I am padding all up to max-seq length, and it does increase performance a lot to decrease this. Is there a way to not pad in transformers? And just pass a list of pytorch tensors in that can be dynamically sized?
Should I try the tensorflow implementations?
Thank you!