GraphRAG supports using Amazon S3 as a storage backend for various components of the system, including input data, output artifacts, cache, reporting, and prompts. This document explains how to configure and use S3 storage in your GraphRAG projects.
S3 storage can be used for the following GraphRAG components:
- Input: Load input data from S3 buckets
- Output: Store output artifacts in S3 buckets
- Cache: Cache LLM invocation results in S3 buckets
- Reporting: Store reports in S3 buckets
- Prompts: Load prompt files from S3 buckets
You can configure S3 storage in your settings.yml
file. Each component (input, output, cache, reporting) can be configured independently to use S3 storage.
All S3 storage configurations share these common parameters:
Parameter | Description | Type | Required |
---|---|---|---|
type |
Set to s3 to use S3 storage |
str |
Yes |
bucket_name |
The name of the S3 bucket | str |
Yes |
prefix |
The prefix to use for all keys in the bucket | str |
No (default: "" ) |
encoding |
The encoding to use for text files | str |
No (default: "utf-8" ) |
aws_access_key_id |
The AWS access key ID | str |
No* |
aws_secret_access_key |
The AWS secret access key | str |
No* |
region_name |
The AWS region name | str |
No |
*Note: If aws_access_key_id
and aws_secret_access_key
are not provided, boto3's credential chain will be used. This means AWS credentials will be searched in the following order:
- Environment variables (
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
) - Shared credential file (
~/.aws/credentials
) - AWS config file (
~/.aws/config
) - IAM role for Amazon EC2 or ECS task role
- Boto session (if running in AWS Lambda)
To configure GraphRAG to read input data from an S3 bucket:
input:
type: s3
bucket_name: my-input-bucket
prefix: data/input
file_type: csv # or text, json
file_pattern: ".*\\.csv$"
text_column: text
title_column: title
metadata:
- author
- date
aws_access_key_id: ${AWS_ACCESS_KEY_ID} # Using environment variable
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
To store output artifacts in an S3 bucket:
output:
type: s3
bucket_name: my-output-bucket
prefix: data/output
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
To use S3 for caching LLM invocation results:
cache:
type: s3
bucket_name: my-cache-bucket
prefix: graphrag/cache
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
To store reports in an S3 bucket:
reporting:
type: s3
bucket_name: my-reporting-bucket
prefix: graphrag/logs
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
It's recommended to use environment variables for AWS credentials rather than hardcoding them in your configuration files. You can use the ${ENV_VAR}
syntax in your YAML configuration to reference environment variables:
# settings.yml
output:
type: s3
bucket_name: ${S3_BUCKET_NAME}
prefix: ${S3_PREFIX}
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: ${AWS_REGION}
Then, in your .env
file:
AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_REGION=us-west-2
S3_BUCKET_NAME=my-graphrag-bucket
S3_PREFIX=data/output
If you're running GraphRAG on an AWS service that supports IAM roles (such as EC2, ECS, or Lambda), you can omit the aws_access_key_id
and aws_secret_access_key
parameters. GraphRAG will use the credentials provided by the IAM role attached to the service.
output:
type: s3
bucket_name: my-output-bucket
prefix: data/output
region_name: us-west-2
# No AWS credentials - will use IAM role
Here's a complete example of a GraphRAG configuration using S3 for all storage components:
models:
default_chat_model:
api_key: ${OPENAI_API_KEY}
type: openai_chat
model: gpt-4o
model_supports_json: true
default_embedding_model:
api_key: ${OPENAI_API_KEY}
type: openai_embedding
model: text-embedding-3-small
input:
type: s3
file_type: csv
bucket_name: my-graphrag-bucket
prefix: data/input
file_pattern: ".*\\.csv$"
text_column: content
title_column: title
metadata:
- author
- date
- category
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
output:
type: s3
bucket_name: my-graphrag-bucket
prefix: data/output
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
cache:
type: s3
bucket_name: my-graphrag-bucket
prefix: data/cache
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
reporting:
type: s3
bucket_name: my-graphrag-bucket
prefix: data/logs
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
# Other GraphRAG configuration...
GraphRAG now supports loading prompt files directly from S3 buckets. This allows you to store your custom prompts in S3 and reference them in your configuration.
To use prompts stored in S3, you need to provide the full S3 URI in your configuration. The format for S3 URIs is:
s3://bucket_name/path/to/prompt.txt
For example, in your settings.yml
file:
extract_graph:
model_id: default_chat_model
prompt: s3://my-graphrag-bucket/prompts/extract_graph.txt
entity_types:
- person
- organization
- location
max_gleanings: 3
extract_claims:
enabled: true
model_id: default_chat_model
prompt: s3://my-graphrag-bucket/prompts/extract_claims.txt
description: "Extract factual claims from the text."
max_gleanings: 3
summarize_descriptions:
model_id: default_chat_model
prompt: s3://my-graphrag-bucket/prompts/summarize_descriptions.txt
max_length: 100
community_reports:
model_id: default_chat_model
graph_prompt: s3://my-graphrag-bucket/prompts/community_reports_graph.txt
text_prompt: s3://my-graphrag-bucket/prompts/community_reports_text.txt
max_length: 500
max_input_length: 4000
When accessing prompts from S3, GraphRAG uses the same authentication methods as other S3 operations. The AWS credentials will be searched in the following order:
- Environment variables (
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
) - Shared credential file (
~/.aws/credentials
) - AWS config file (
~/.aws/config
) - IAM role for Amazon EC2 or ECS task role
- Boto session (if running in AWS Lambda)
To access prompts from S3, your AWS credentials must have at least the s3:GetObject
permission for the specified bucket and objects.
-
Access Denied: Ensure that the AWS credentials have the necessary permissions to access the S3 bucket. The IAM user or role should have at least
s3:GetObject
,s3:PutObject
,s3:ListBucket
, ands3:DeleteObject
permissions for the specified bucket. -
No Such Bucket: Verify that the bucket exists in the specified region.
-
Credential Chain Errors: If you're not providing explicit credentials, ensure that your environment has valid AWS credentials configured through one of the methods in boto3's credential chain.
-
Region Issues: If you encounter region-related errors, explicitly specify the
region_name
parameter in your configuration. -
Invalid S3 URI Format: When using S3 for prompts, ensure that the URI follows the format
s3://bucket_name/path/to/file
. If the bucket name cannot be extracted from the URI, you'll receive an error.
GraphRAG logs S3 operations at the INFO level. You can enable more verbose logging by configuring the Python logging system to show DEBUG level logs for the boto3 and botocore libraries:
import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('boto3').setLevel(logging.DEBUG)
logging.getLogger('botocore').setLevel(logging.DEBUG)