Skip to content

BigQuery Storage: Disappointing performance when parsing Avro blocks #7805

Closed
@tswast

Description

@tswast

The google-cloud-bigquery-storage performance of parsing the Avro-encoded blocks of rows is disappointing, especially when compared to the Go implementation.

  • Download and parse the protos, but not the contained Avro bytes: 12.3s
  • Download and parse the Avro into Python dictionaries (with fastavro): 42.1s
  • Download and parse the Avro into a pandas DataFrame (needs to convert to dictionaries first): 73.4s

All three of the following benchmarks read data from the bigquery-public-data.usa_names.usa_1910_current table. They were run on an n1-standard-8 instance (though only a single stream is used).

# coding: utf-8
import concurrent.futures
from google.cloud import bigquery_storage_v1beta1
client = bigquery_storage_v1beta1.BigQueryStorageClient()
project_id = 'swast-scratch'
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = 'bigquery-public-data'
table_ref.dataset_id = 'usa_names'
table_ref.table_id = 'usa_1910_current'
session = client.create_read_session(
    table_ref,
    'projects/{}'.format(project_id),
    requested_streams=1,
)
stream = session.streams[0]
position = bigquery_storage_v1beta1.types.StreamPosition(
    stream=stream,
)
rowstream = client.read_rows(position)

Where they differ is in what they do with the blocks.

Parse the proto, but not Avro bytes: print(sum([page.num_items for page in rowstream.rows(session).pages]))

swast@pandas-gbq-test:~/benchmark$ time python3 parse_proto_no_avro.py 
5933561

real    0m12.278s
user    0m3.496s
sys     0m2.376s

Parse the Avro into rows with print(len(list(rowstream.rows(session)))):

swast@pandas-gbq-test:~/benchmark$ time python3 parse_avro.py 
5933561

real    0m42.055s
user    0m37.784s
sys     0m3.504s

Parse the Avro bytes into a pandas DataFrame.

df = rowstream.rows(session).to_dataframe()
print(len(df.index))
swast@pandas-gbq-test:~/benchmark$ time python3 parse_avro_to_dataframe.py 
5933561

real    1m13.449s
user    1m8.180s
sys     0m2.396s

CC @jadekler, since I'd like to track these metrics over time with the benchmarks project you're working on.

Metadata

Metadata

Assignees

Labels

api: bigquerystorageIssues related to the BigQuery Storage API.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions