Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(aws): add s3 support to input, storage, output, cache, etc. #1830

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
4 changes: 4 additions & 0 deletions .semversioner/next-release/minor-20250320144125923710.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "minor",
"description": "Add s3 support and documentation"
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20250320153221.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"description": "Fixed S3 configuration validation to use the correct enum types for each storage configuration.",
"type": "patch"
}
1 change: 1 addition & 0 deletions docs/config/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ The default configuration mode is the simplest way to get started with the Graph

- [Init command](init.md) (recommended)
- [Using YAML for deeper control](yaml.md)
- [Using Amazon S3 for storage](s3.md)
- [Purely using environment variables](env_vars.md) (not recommended)
283 changes: 283 additions & 0 deletions docs/config/s3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
# Using Amazon S3 Storage with GraphRAG

GraphRAG supports using Amazon S3 as a storage backend for various components of the system, including input data, output artifacts, cache, reporting, and prompts. This document explains how to configure and use S3 storage in your GraphRAG projects.

## Overview

S3 storage can be used for the following GraphRAG components:

- **Input**: Load input data from S3 buckets
- **Output**: Store output artifacts in S3 buckets
- **Cache**: Cache LLM invocation results in S3 buckets
- **Reporting**: Store reports in S3 buckets
- **Prompts**: Load prompt files from S3 buckets

## Configuration

You can configure S3 storage in your `settings.yml` file. Each component (input, output, cache, reporting) can be configured independently to use S3 storage.

### Common S3 Configuration Parameters

All S3 storage configurations share these common parameters:

| Parameter | Description | Type | Required |
|-----------|-------------|------|----------|
| `type` | Set to `s3` to use S3 storage | `str` | Yes |
| `bucket_name` | The name of the S3 bucket | `str` | Yes |
| `prefix` | The prefix to use for all keys in the bucket | `str` | No (default: `""`) |
| `encoding` | The encoding to use for text files | `str` | No (default: `"utf-8"`) |
| `aws_access_key_id` | The AWS access key ID | `str` | No* |
| `aws_secret_access_key` | The AWS secret access key | `str` | No* |
| `region_name` | The AWS region name | `str` | No |

*Note: If `aws_access_key_id` and `aws_secret_access_key` are not provided, boto3's credential chain will be used. This means AWS credentials will be searched in the following order:
1. Environment variables (`AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`)
2. Shared credential file (`~/.aws/credentials`)
3. AWS config file (`~/.aws/config`)
4. IAM role for Amazon EC2 or ECS task role
5. Boto session (if running in AWS Lambda)

## Example Configurations

### Input Configuration

To configure GraphRAG to read input data from an S3 bucket:

```yaml
input:
type: s3
bucket_name: my-input-bucket
prefix: data/input
file_type: csv # or text, json
file_pattern: ".*\\.csv$"
text_column: text
title_column: title
metadata:
- author
- date
aws_access_key_id: ${AWS_ACCESS_KEY_ID} # Using environment variable
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
```

### Output Configuration

To store output artifacts in an S3 bucket:

```yaml
output:
type: s3
bucket_name: my-output-bucket
prefix: data/output
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
```

### Cache Configuration

To use S3 for caching LLM invocation results:

```yaml
cache:
type: s3
bucket_name: my-cache-bucket
prefix: graphrag/cache
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
```

### Reporting Configuration

To store reports in an S3 bucket:

```yaml
reporting:
type: s3
bucket_name: my-reporting-bucket
prefix: graphrag/logs
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2
```

## Using Environment Variables

It's recommended to use environment variables for AWS credentials rather than hardcoding them in your configuration files. You can use the `${ENV_VAR}` syntax in your YAML configuration to reference environment variables:

```yaml
# settings.yml
output:
type: s3
bucket_name: ${S3_BUCKET_NAME}
prefix: ${S3_PREFIX}
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: ${AWS_REGION}
```

Then, in your `.env` file:

```
AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_REGION=us-west-2
S3_BUCKET_NAME=my-graphrag-bucket
S3_PREFIX=data/output
```

## Using AWS IAM Roles

If you're running GraphRAG on an AWS service that supports IAM roles (such as EC2, ECS, or Lambda), you can omit the `aws_access_key_id` and `aws_secret_access_key` parameters. GraphRAG will use the credentials provided by the IAM role attached to the service.

```yaml
output:
type: s3
bucket_name: my-output-bucket
prefix: data/output
region_name: us-west-2
# No AWS credentials - will use IAM role
```

## Complete Example

Here's a complete example of a GraphRAG configuration using S3 for all storage components:

```yaml
models:
default_chat_model:
api_key: ${OPENAI_API_KEY}
type: openai_chat
model: gpt-4o
model_supports_json: true
default_embedding_model:
api_key: ${OPENAI_API_KEY}
type: openai_embedding
model: text-embedding-3-small

input:
type: s3
file_type: csv
bucket_name: my-graphrag-bucket
prefix: data/input
file_pattern: ".*\\.csv$"
text_column: content
title_column: title
metadata:
- author
- date
- category
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2

output:
type: s3
bucket_name: my-graphrag-bucket
prefix: data/output
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2

cache:
type: s3
bucket_name: my-graphrag-bucket
prefix: data/cache
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2

reporting:
type: s3
bucket_name: my-graphrag-bucket
prefix: data/logs
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
region_name: us-west-2

# Other GraphRAG configuration...
```

## Using S3 for Prompt Storage

GraphRAG now supports loading prompt files directly from S3 buckets. This allows you to store your custom prompts in S3 and reference them in your configuration.

### Configuring S3 Prompts

To use prompts stored in S3, you need to provide the full S3 URI in your configuration. The format for S3 URIs is:

```
s3://bucket_name/path/to/prompt.txt
```

For example, in your `settings.yml` file:

```yaml
extract_graph:
model_id: default_chat_model
prompt: s3://my-graphrag-bucket/prompts/extract_graph.txt
entity_types:
- person
- organization
- location
max_gleanings: 3

extract_claims:
enabled: true
model_id: default_chat_model
prompt: s3://my-graphrag-bucket/prompts/extract_claims.txt
description: "Extract factual claims from the text."
max_gleanings: 3

summarize_descriptions:
model_id: default_chat_model
prompt: s3://my-graphrag-bucket/prompts/summarize_descriptions.txt
max_length: 100

community_reports:
model_id: default_chat_model
graph_prompt: s3://my-graphrag-bucket/prompts/community_reports_graph.txt
text_prompt: s3://my-graphrag-bucket/prompts/community_reports_text.txt
max_length: 500
max_input_length: 4000
```

### Authentication for S3 Prompts

When accessing prompts from S3, GraphRAG uses the same authentication methods as other S3 operations. The AWS credentials will be searched in the following order:

1. Environment variables (`AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`)
2. Shared credential file (`~/.aws/credentials`)
3. AWS config file (`~/.aws/config`)
4. IAM role for Amazon EC2 or ECS task role
5. Boto session (if running in AWS Lambda)

### Required Permissions

To access prompts from S3, your AWS credentials must have at least the `s3:GetObject` permission for the specified bucket and objects.

## Troubleshooting

### Common Issues

1. **Access Denied**: Ensure that the AWS credentials have the necessary permissions to access the S3 bucket. The IAM user or role should have at least `s3:GetObject`, `s3:PutObject`, `s3:ListBucket`, and `s3:DeleteObject` permissions for the specified bucket.

2. **No Such Bucket**: Verify that the bucket exists in the specified region.

3. **Credential Chain Errors**: If you're not providing explicit credentials, ensure that your environment has valid AWS credentials configured through one of the methods in boto3's credential chain.

4. **Region Issues**: If you encounter region-related errors, explicitly specify the `region_name` parameter in your configuration.

5. **Invalid S3 URI Format**: When using S3 for prompts, ensure that the URI follows the format `s3://bucket_name/path/to/file`. If the bucket name cannot be extracted from the URI, you'll receive an error.

### Logging

GraphRAG logs S3 operations at the INFO level. You can enable more verbose logging by configuring the Python logging system to show DEBUG level logs for the boto3 and botocore libraries:

```python
import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('boto3').setLevel(logging.DEBUG)
logging.getLogger('botocore').setLevel(logging.DEBUG)
```
10 changes: 5 additions & 5 deletions docs/config/yaml.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ Our pipeline can ingest .csv, .txt, or .json data from an input folder. See the

#### Fields

- `type` **file|blob** - The input type to use. Default=`file`
- `type` **file|blob|s3** - The input type to use. Default=`file`
- `file_type` **text|csv|json** - The type of input data to load. Default is `text`
- `base_dir` **str** - The base directory to read input from, relative to the root.
- `connection_string` **str** - (blob only) The Azure Storage connection string.
Expand Down Expand Up @@ -110,7 +110,7 @@ This section controls the storage mechanism used by the pipeline used for export

#### Fields

- `type` **file|memory|blob|cosmosdb** - The storage type to use. Default=`file`
- `type` **file|memory|blob|cosmosdb|s3** - The storage type to use. Default=`file`
- `base_dir` **str** - The base directory to write output artifacts to, relative to the root.
- `connection_string` **str** - (blob/cosmosdb only) The Azure Storage connection string.
- `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
Expand All @@ -123,7 +123,7 @@ The section defines a secondary storage location for running incremental indexin

#### Fields

- `type` **file|memory|blob|cosmosdb** - The storage type to use. Default=`file`
- `type` **file|memory|blob|cosmosdb|s3** - The storage type to use. Default=`file`
- `base_dir` **str** - The base directory to write output artifacts to, relative to the root.
- `connection_string` **str** - (blob/cosmosdb only) The Azure Storage connection string.
- `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
Expand All @@ -136,7 +136,7 @@ This section controls the cache mechanism used by the pipeline. This is used to

#### Fields

- `type` **file|memory|blob|cosmosdb** - The storage type to use. Default=`file`
- `type` **file|memory|blob|cosmosdb|s3** - The storage type to use. Default=`file`
- `base_dir` **str** - The base directory to write output artifacts to, relative to the root.
- `connection_string` **str** - (blob/cosmosdb only) The Azure Storage connection string.
- `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
Expand All @@ -149,7 +149,7 @@ This section controls the reporting mechanism used by the pipeline, for common e

#### Fields

- `type` **file|console|blob** - The reporting type to use. Default=`file`
- `type` **file|console|blob|s3** - The reporting type to use. Default=`file`
- `base_dir` **str** - The base directory to write reports to, relative to the root.
- `connection_string` **str** - (blob only) The Azure Storage connection string.
- `container_name` **str** - (blob only) The Azure Storage container name.
Expand Down
6 changes: 6 additions & 0 deletions graphrag/cache/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from graphrag.storage.blob_pipeline_storage import create_blob_storage
from graphrag.storage.cosmosdb_pipeline_storage import create_cosmosdb_storage
from graphrag.storage.file_pipeline_storage import FilePipelineStorage
from graphrag.storage.s3_pipeline_storage import create_s3_storage

if TYPE_CHECKING:
from graphrag.cache.pipeline_cache import PipelineCache
Expand Down Expand Up @@ -56,6 +57,11 @@ def create_cache(
return JsonPipelineCache(create_blob_storage(**kwargs))
case CacheType.cosmosdb:
return JsonPipelineCache(create_cosmosdb_storage(**kwargs))
case CacheType.s3:
storage = create_s3_storage(**kwargs)
if "base_dir" in kwargs:
storage = storage.child(kwargs["base_dir"])
return JsonPipelineCache(storage)
case _:
if cache_type in cls.cache_types:
return cls.cache_types[cache_type](**kwargs)
Expand Down
10 changes: 10 additions & 0 deletions graphrag/callbacks/reporting.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from graphrag.callbacks.blob_workflow_callbacks import BlobWorkflowCallbacks
from graphrag.callbacks.console_workflow_callbacks import ConsoleWorkflowCallbacks
from graphrag.callbacks.file_workflow_callbacks import FileWorkflowCallbacks
from graphrag.callbacks.s3_workflow_callbacks import S3WorkflowCallbacks
from graphrag.config.enums import ReportingType
from graphrag.config.models.reporting_config import ReportingConfig

Expand All @@ -37,3 +38,12 @@ def create_pipeline_reporter(
base_dir=config.base_dir,
storage_account_blob_url=config.storage_account_blob_url,
)
case ReportingType.s3:
if not config.bucket_name:
msg = "No bucket name provided for S3 storage."
raise ValueError(msg)
return S3WorkflowCallbacks(
bucket_name=config.bucket_name,
base_dir=config.prefix or "",
log_file_name=config.base_dir,
)
Loading