BigQuery Storage: Disappointing performance when parsing Avro blocks

The `google-cloud-bigquery-storage` performance of parsing the Avro-encoded blocks of rows is disappointing, especially when compared to the Go implementation.

* Download and parse the protos, but not the contained Avro bytes: 12.3s
* Download and parse the Avro into Python dictionaries (with `fastavro`): 42.1s
* Download and parse the Avro into a pandas DataFrame (needs to convert to dictionaries first): 73.4s

All three of the following benchmarks read data from the `bigquery-public-data.usa_names.usa_1910_current` table. They were run on an n1-standard-8 instance (though only a single stream is used).

```
# coding: utf-8
import concurrent.futures
from google.cloud import bigquery_storage_v1beta1
client = bigquery_storage_v1beta1.BigQueryStorageClient()
project_id = 'swast-scratch'
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = 'bigquery-public-data'
table_ref.dataset_id = 'usa_names'
table_ref.table_id = 'usa_1910_current'
session = client.create_read_session(
    table_ref,
    'projects/{}'.format(project_id),
    requested_streams=1,
)
stream = session.streams[0]
position = bigquery_storage_v1beta1.types.StreamPosition(
    stream=stream,
)
rowstream = client.read_rows(position)
```

Where they differ is in what they do with the blocks.

Parse the proto, but not Avro bytes: `print(sum([page.num_items for page in rowstream.rows(session).pages]))`

```
swast@pandas-gbq-test:~/benchmark$ time python3 parse_proto_no_avro.py 
5933561

real    0m12.278s
user    0m3.496s
sys     0m2.376s
```

Parse the Avro into rows with `print(len(list(rowstream.rows(session))))`:

```
swast@pandas-gbq-test:~/benchmark$ time python3 parse_avro.py 
5933561

real    0m42.055s
user    0m37.784s
sys     0m3.504s
```

Parse the Avro bytes into a pandas DataFrame. 

```
df = rowstream.rows(session).to_dataframe()
print(len(df.index))
```

```
swast@pandas-gbq-test:~/benchmark$ time python3 parse_avro_to_dataframe.py 
5933561

real    1m13.449s
user    1m8.180s
sys     0m2.396s
```

CC @jadekler, since I'd like to track these metrics over time with the benchmarks project you're working on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BigQuery Storage: Disappointing performance when parsing Avro blocks #7805

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BigQuery Storage: Disappointing performance when parsing Avro blocks #7805

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions