Closed
Description
The google-cloud-bigquery-storage
performance of parsing the Avro-encoded blocks of rows is disappointing, especially when compared to the Go implementation.
- Download and parse the protos, but not the contained Avro bytes: 12.3s
- Download and parse the Avro into Python dictionaries (with
fastavro
): 42.1s - Download and parse the Avro into a pandas DataFrame (needs to convert to dictionaries first): 73.4s
All three of the following benchmarks read data from the bigquery-public-data.usa_names.usa_1910_current
table. They were run on an n1-standard-8 instance (though only a single stream is used).
# coding: utf-8
import concurrent.futures
from google.cloud import bigquery_storage_v1beta1
client = bigquery_storage_v1beta1.BigQueryStorageClient()
project_id = 'swast-scratch'
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = 'bigquery-public-data'
table_ref.dataset_id = 'usa_names'
table_ref.table_id = 'usa_1910_current'
session = client.create_read_session(
table_ref,
'projects/{}'.format(project_id),
requested_streams=1,
)
stream = session.streams[0]
position = bigquery_storage_v1beta1.types.StreamPosition(
stream=stream,
)
rowstream = client.read_rows(position)
Where they differ is in what they do with the blocks.
Parse the proto, but not Avro bytes: print(sum([page.num_items for page in rowstream.rows(session).pages]))
swast@pandas-gbq-test:~/benchmark$ time python3 parse_proto_no_avro.py
5933561
real 0m12.278s
user 0m3.496s
sys 0m2.376s
Parse the Avro into rows with print(len(list(rowstream.rows(session))))
:
swast@pandas-gbq-test:~/benchmark$ time python3 parse_avro.py
5933561
real 0m42.055s
user 0m37.784s
sys 0m3.504s
Parse the Avro bytes into a pandas DataFrame.
df = rowstream.rows(session).to_dataframe()
print(len(df.index))
swast@pandas-gbq-test:~/benchmark$ time python3 parse_avro_to_dataframe.py
5933561
real 1m13.449s
user 1m8.180s
sys 0m2.396s
CC @jadekler, since I'd like to track these metrics over time with the benchmarks project you're working on.