Parquet Table Creation Issues with AWS Datawrangler (AWS Layer) #1420

ramasamy-seranthaiya · 2022-06-27T16:18:32Z

ramasamy-seranthaiya
Jun 27, 2022

Hi, I am writing a AWS Lambda function to create a AWS Glue Database, Table and write parquet files to S3 buckets based on a S3 event notification.

Lambda Layer - AWS Datawranger (AWS Layer) - AWS Datawrangler-Python39-Arm64

I tried the following approaches and ran into issues with both.

Calling to_parquet with database name and table name
wr.s3.to_parquet(df=df,path=path,dataset=True,database='metadata-database',table='serverlessmetadata_index'
to_parquet-datawrangler logs.csv
) Attaching the logs from cloudwatch. Neither Glue database/table gets created nor the parquet file is written to the target S3 bucket. Permissions are enabled and double checked.
Creating the database, table and writing the parquet
`#Function to create Glue Database
def create_database():
databases = wr.catalog.databases()
if "metadata-database" not in databases.values:
wr.catalog.create_database("metadata-database")
else:
print("Database metadata-database already exists")

def create_table():
tableExists = wr.catalog.does_table_exist(database='metadata-database',table='serverlessmetadata_index')
if not tableExists:
wr.catalog.create_parquet_table(database="metadata-database",table="serverlessmetadata_index",
path="s3://serverlessmetadata/", columns_types=columns,compression="snappy",
description=desc, mode='append',columns_comments=comments)
else:
print("Table serverlessmetadata_index already exists")
def lambda_handler(event, context):
create_database()
create_table()
for s3_record in event['Records']:
# Extract the Key and Bucket names for the asset uploaded to S3
filekey = s3_record['s3']['object']['key']
bucket = s3_record['s3']['bucket']['name']
# Retrieve S3 object metadata attributes
object = s3.Object(bucket,filekey)
object_metadata = object.metadata
# Add filekey as a metadata
object_metadata['filekey']=filekey
# Generate metadata parquet file
metadata_path = bucket + '-index'
key_path = 'index.parquet'
path = f"s3://{metadata_path}/"
df = pd.DataFrame(object_metadata,index=[0])
#Writing the dataframe as a parquet file to metadata-database and table serverlessmetadata_index
wr.s3.to_parquet(df=df,path=path,dataset=True)`

With this, the very first execution creates the database and table but doesn't write the parquet file to the S3 bucket. If createdatabase and createTable functions are commented out after the execution then the parquet files are being written to the S3 bucket.

I couldn't gather much from the logs. Appreciate your help with any guidance for this issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet Table Creation Issues with AWS Datawrangler (AWS Layer) #1420

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Parquet Table Creation Issues with AWS Datawrangler (AWS Layer) #1420

ramasamy-seranthaiya Jun 27, 2022

Replies: 0 comments

ramasamy-seranthaiya
Jun 27, 2022