Skip to content

Much slower for inference, even when traced? #1477

Closed
@pertschuk

Description

@pertschuk

❓ Questions & Help

When running inference using BERT-large on a T4 GPU using bert-as-a-service, I could get well over 100/s on sentence pair classification. (I am aware that this utilized TF's graph freezing and pruning)

When running inference with Roberta-large on a T4 GPU using native pytorch and fairseq, I was able to get 70-80/s for inference on sentence pairs.

Even with using the torchscript JIT tracing, I still am only able to get 17/s on a T4 using the transformers implementation of Bert-large, using a batch size of 8 (which fills most of the memory).

The training performance is similarly worse (about 40% - 100% longer even with apex vs no apex before).

One of the primary differences I can think of is that now I am padding all up to max-seq length, and it does increase performance a lot to decrease this. Is there a way to not pad in transformers? And just pass a list of pytorch tensors in that can be dynamically sized?

Should I try the tensorflow implementations?

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions