Skip to content

Latest commit

 

History

History
283 lines (219 loc) · 8.55 KB

s3.md

File metadata and controls

283 lines (219 loc) · 8.55 KB

Using Amazon S3 Storage with GraphRAG

GraphRAG supports using Amazon S3 as a storage backend for various components of the system, including input data, output artifacts, cache, reporting, and prompts. This document explains how to configure and use S3 storage in your GraphRAG projects.

Overview

S3 storage can be used for the following GraphRAG components:

  • Input: Load input data from S3 buckets
  • Output: Store output artifacts in S3 buckets
  • Cache: Cache LLM invocation results in S3 buckets
  • Reporting: Store reports in S3 buckets
  • Prompts: Load prompt files from S3 buckets

Configuration

You can configure S3 storage in your settings.yml file. Each component (input, output, cache, reporting) can be configured independently to use S3 storage.

Common S3 Configuration Parameters

All S3 storage configurations share these common parameters:

Parameter Description Type Required
type Set to s3 to use S3 storage str Yes
bucket_name The name of the S3 bucket str Yes
prefix The prefix to use for all keys in the bucket str No (default: "")
encoding The encoding to use for text files str No (default: "utf-8")
aws_access_key_id The AWS access key ID str No*
aws_secret_access_key The AWS secret access key str No*
region_name The AWS region name str No

*Note: If aws_access_key_id and aws_secret_access_key are not provided, boto3's credential chain will be used. This means AWS credentials will be searched in the following order:

  1. Environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)
  2. Shared credential file (~/.aws/credentials)
  3. AWS config file (~/.aws/config)
  4. IAM role for Amazon EC2 or ECS task role
  5. Boto session (if running in AWS Lambda)

Example Configurations

Input Configuration

To configure GraphRAG to read input data from an S3 bucket:

input:
  type: s3
  bucket_name: my-input-bucket
  prefix: data/input
  file_type: csv  # or text, json
  file_pattern: ".*\\.csv$"
  text_column: text
  title_column: title
  metadata:
    - author
    - date
  aws_access_key_id: ${AWS_ACCESS_KEY_ID}  # Using environment variable
  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
  region_name: us-west-2

Output Configuration

To store output artifacts in an S3 bucket:

output:
  type: s3
  bucket_name: my-output-bucket
  prefix: data/output
  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
  region_name: us-west-2

Cache Configuration

To use S3 for caching LLM invocation results:

cache:
  type: s3
  bucket_name: my-cache-bucket
  prefix: graphrag/cache
  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
  region_name: us-west-2

Reporting Configuration

To store reports in an S3 bucket:

reporting:
  type: s3
  bucket_name: my-reporting-bucket
  prefix: graphrag/logs
  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
  region_name: us-west-2

Using Environment Variables

It's recommended to use environment variables for AWS credentials rather than hardcoding them in your configuration files. You can use the ${ENV_VAR} syntax in your YAML configuration to reference environment variables:

# settings.yml
output:
  type: s3
  bucket_name: ${S3_BUCKET_NAME}
  prefix: ${S3_PREFIX}
  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
  region_name: ${AWS_REGION}

Then, in your .env file:

AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_REGION=us-west-2
S3_BUCKET_NAME=my-graphrag-bucket
S3_PREFIX=data/output

Using AWS IAM Roles

If you're running GraphRAG on an AWS service that supports IAM roles (such as EC2, ECS, or Lambda), you can omit the aws_access_key_id and aws_secret_access_key parameters. GraphRAG will use the credentials provided by the IAM role attached to the service.

output:
  type: s3
  bucket_name: my-output-bucket
  prefix: data/output
  region_name: us-west-2
  # No AWS credentials - will use IAM role

Complete Example

Here's a complete example of a GraphRAG configuration using S3 for all storage components:

models:
  default_chat_model:
    api_key: ${OPENAI_API_KEY}
    type: openai_chat
    model: gpt-4o
    model_supports_json: true
  default_embedding_model:
    api_key: ${OPENAI_API_KEY}
    type: openai_embedding
    model: text-embedding-3-small

input:
  type: s3
  file_type: csv
  bucket_name: my-graphrag-bucket
  prefix: data/input
  file_pattern: ".*\\.csv$"
  text_column: content
  title_column: title
  metadata:
    - author
    - date
    - category
  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
  region_name: us-west-2

output:
  type: s3
  bucket_name: my-graphrag-bucket
  prefix: data/output
  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
  region_name: us-west-2

cache:
  type: s3
  bucket_name: my-graphrag-bucket
  prefix: data/cache
  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
  region_name: us-west-2

reporting:
  type: s3
  bucket_name: my-graphrag-bucket
  prefix: data/logs
  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
  region_name: us-west-2

# Other GraphRAG configuration...

Using S3 for Prompt Storage

GraphRAG now supports loading prompt files directly from S3 buckets. This allows you to store your custom prompts in S3 and reference them in your configuration.

Configuring S3 Prompts

To use prompts stored in S3, you need to provide the full S3 URI in your configuration. The format for S3 URIs is:

s3://bucket_name/path/to/prompt.txt

For example, in your settings.yml file:

extract_graph:
  model_id: default_chat_model
  prompt: s3://my-graphrag-bucket/prompts/extract_graph.txt
  entity_types:
    - person
    - organization
    - location
  max_gleanings: 3

extract_claims:
  enabled: true
  model_id: default_chat_model
  prompt: s3://my-graphrag-bucket/prompts/extract_claims.txt
  description: "Extract factual claims from the text."
  max_gleanings: 3

summarize_descriptions:
  model_id: default_chat_model
  prompt: s3://my-graphrag-bucket/prompts/summarize_descriptions.txt
  max_length: 100

community_reports:
  model_id: default_chat_model
  graph_prompt: s3://my-graphrag-bucket/prompts/community_reports_graph.txt
  text_prompt: s3://my-graphrag-bucket/prompts/community_reports_text.txt
  max_length: 500
  max_input_length: 4000

Authentication for S3 Prompts

When accessing prompts from S3, GraphRAG uses the same authentication methods as other S3 operations. The AWS credentials will be searched in the following order:

  1. Environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)
  2. Shared credential file (~/.aws/credentials)
  3. AWS config file (~/.aws/config)
  4. IAM role for Amazon EC2 or ECS task role
  5. Boto session (if running in AWS Lambda)

Required Permissions

To access prompts from S3, your AWS credentials must have at least the s3:GetObject permission for the specified bucket and objects.

Troubleshooting

Common Issues

  1. Access Denied: Ensure that the AWS credentials have the necessary permissions to access the S3 bucket. The IAM user or role should have at least s3:GetObject, s3:PutObject, s3:ListBucket, and s3:DeleteObject permissions for the specified bucket.

  2. No Such Bucket: Verify that the bucket exists in the specified region.

  3. Credential Chain Errors: If you're not providing explicit credentials, ensure that your environment has valid AWS credentials configured through one of the methods in boto3's credential chain.

  4. Region Issues: If you encounter region-related errors, explicitly specify the region_name parameter in your configuration.

  5. Invalid S3 URI Format: When using S3 for prompts, ensure that the URI follows the format s3://bucket_name/path/to/file. If the bucket name cannot be extracted from the URI, you'll receive an error.

Logging

GraphRAG logs S3 operations at the INFO level. You can enable more verbose logging by configuring the Python logging system to show DEBUG level logs for the boto3 and botocore libraries:

import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('boto3').setLevel(logging.DEBUG)
logging.getLogger('botocore').setLevel(logging.DEBUG)