Parquet Table Creation Issues with AWS Datawrangler (AWS Layer) #1420
Unanswered
ramasamy-seranthaiya
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I am writing a AWS Lambda function to create a AWS Glue Database, Table and write parquet files to S3 buckets based on a S3 event notification.
Lambda Layer - AWS Datawranger (AWS Layer) - AWS Datawrangler-Python39-Arm64
I tried the following approaches and ran into issues with both.
Calling to_parquet with database name and table name
wr.s3.to_parquet(df=df,path=path,dataset=True,database='metadata-database',table='serverlessmetadata_index'
to_parquet-datawrangler logs.csv
) Attaching the logs from cloudwatch. Neither Glue database/table gets created nor the parquet file is written to the target S3 bucket. Permissions are enabled and double checked.
Creating the database, table and writing the parquet
`#Function to create Glue Database
def create_database():
databases = wr.catalog.databases()
if "metadata-database" not in databases.values:
wr.catalog.create_database("metadata-database")
else:
print("Database metadata-database already exists")
def create_table():
tableExists = wr.catalog.does_table_exist(database='metadata-database',table='serverlessmetadata_index')
if not tableExists:
wr.catalog.create_parquet_table(database="metadata-database",table="serverlessmetadata_index",
path="s3://serverlessmetadata/", columns_types=columns,compression="snappy",
description=desc, mode='append',columns_comments=comments)
else:
print("Table serverlessmetadata_index already exists")
def lambda_handler(event, context):
create_database()
create_table()
for s3_record in event['Records']:
# Extract the Key and Bucket names for the asset uploaded to S3
filekey = s3_record['s3']['object']['key']
bucket = s3_record['s3']['bucket']['name']
# Retrieve S3 object metadata attributes
object = s3.Object(bucket,filekey)
object_metadata = object.metadata
# Add filekey as a metadata
object_metadata['filekey']=filekey
# Generate metadata parquet file
metadata_path = bucket + '-index'
key_path = 'index.parquet'
path = f"s3://{metadata_path}/"
df = pd.DataFrame(object_metadata,index=[0])
#Writing the dataframe as a parquet file to metadata-database and table serverlessmetadata_index
wr.s3.to_parquet(df=df,path=path,dataset=True)`
With this, the very first execution creates the database and table but doesn't write the parquet file to the S3 bucket. If createdatabase and createTable functions are commented out after the execution then the parquet files are being written to the S3 bucket.
I couldn't gather much from the logs. Appreciate your help with any guidance for this issue
Beta Was this translation helpful? Give feedback.
All reactions