microsoft · knguyen1 · Mar 20, 2025 · Mar 20, 2025 · Mar 20, 2025 · Mar 20, 2025
@@ -0,0 +1,4 @@
+{
+  "type": "minor",
+  "description": "Add s3 support and documentation"
+}
@@ -0,0 +1,4 @@
+{
+  "description": "Fixed S3 configuration validation to use the correct enum types for each storage configuration.",
+  "type": "patch"
+}
@@ -8,4 +8,5 @@ The default configuration mode is the simplest way to get started with the Graph
 
 - [Init command](init.md) (recommended)
 - [Using YAML for deeper control](yaml.md)
+- [Using Amazon S3 for storage](s3.md)
 - [Purely using environment variables](env_vars.md) (not recommended)
@@ -0,0 +1,283 @@
+# Using Amazon S3 Storage with GraphRAG
+
+GraphRAG supports using Amazon S3 as a storage backend for various components of the system, including input data, output artifacts, cache, reporting, and prompts. This document explains how to configure and use S3 storage in your GraphRAG projects.
+
+## Overview
+
+S3 storage can be used for the following GraphRAG components:
+
+- **Input**: Load input data from S3 buckets
+- **Output**: Store output artifacts in S3 buckets
+- **Cache**: Cache LLM invocation results in S3 buckets
+- **Reporting**: Store reports in S3 buckets
+- **Prompts**: Load prompt files from S3 buckets
+
+## Configuration
+
+You can configure S3 storage in your `settings.yml` file. Each component (input, output, cache, reporting) can be configured independently to use S3 storage.
+
+### Common S3 Configuration Parameters
+
+All S3 storage configurations share these common parameters:
+
+| Parameter | Description | Type | Required |
+|-----------|-------------|------|----------|
+| `type` | Set to `s3` to use S3 storage | `str` | Yes |
+| `bucket_name` | The name of the S3 bucket | `str` | Yes |
+| `prefix` | The prefix to use for all keys in the bucket | `str` | No (default: `""`) |
+| `encoding` | The encoding to use for text files | `str` | No (default: `"utf-8"`) |
+| `aws_access_key_id` | The AWS access key ID | `str` | No* |
+| `aws_secret_access_key` | The AWS secret access key | `str` | No* |
+| `region_name` | The AWS region name | `str` | No |
+
+*Note: If `aws_access_key_id` and `aws_secret_access_key` are not provided, boto3's credential chain will be used. This means AWS credentials will be searched in the following order:
+1. Environment variables (`AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`)
+2. Shared credential file (`~/.aws/credentials`)
+3. AWS config file (`~/.aws/config`)
+4. IAM role for Amazon EC2 or ECS task role
+5. Boto session (if running in AWS Lambda)
+
+## Example Configurations
+
+### Input Configuration
+
+To configure GraphRAG to read input data from an S3 bucket:
+
+```yaml
+input:
+  type: s3
+  bucket_name: my-input-bucket
+  prefix: data/input
+  file_type: csv  # or text, json
+  file_pattern: ".*\\.csv$"
+  text_column: text
+  title_column: title
+  metadata:
+    - author
+    - date
+  aws_access_key_id: ${AWS_ACCESS_KEY_ID}  # Using environment variable
+  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
+  region_name: us-west-2
+```
+
+### Output Configuration
+
+To store output artifacts in an S3 bucket:
+
+```yaml
+output:
+  type: s3
+  bucket_name: my-output-bucket
+  prefix: data/output
+  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
+  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
+  region_name: us-west-2
+```
+
+### Cache Configuration
+
+To use S3 for caching LLM invocation results:
+
+```yaml
+cache:
+  type: s3
+  bucket_name: my-cache-bucket
+  prefix: graphrag/cache
+  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
+  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
+  region_name: us-west-2
+```
+
+### Reporting Configuration
+
+To store reports in an S3 bucket:
+
+```yaml
+reporting:
+  type: s3
+  bucket_name: my-reporting-bucket
+  prefix: graphrag/logs
+  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
+  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
+  region_name: us-west-2
+```
+
+## Using Environment Variables
+
+It's recommended to use environment variables for AWS credentials rather than hardcoding them in your configuration files. You can use the `${ENV_VAR}` syntax in your YAML configuration to reference environment variables:
+
+```yaml
+# settings.yml
+output:
+  type: s3
+  bucket_name: ${S3_BUCKET_NAME}
+  prefix: ${S3_PREFIX}
+  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
+  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
+  region_name: ${AWS_REGION}
+```
+
+Then, in your `.env` file:
+
+```
+AWS_ACCESS_KEY_ID=your_access_key_id
+AWS_SECRET_ACCESS_KEY=your_secret_access_key
+AWS_REGION=us-west-2
+S3_BUCKET_NAME=my-graphrag-bucket
+S3_PREFIX=data/output
+```
+
+## Using AWS IAM Roles
+
+If you're running GraphRAG on an AWS service that supports IAM roles (such as EC2, ECS, or Lambda), you can omit the `aws_access_key_id` and `aws_secret_access_key` parameters. GraphRAG will use the credentials provided by the IAM role attached to the service.
+
+```yaml
+output:
+  type: s3
+  bucket_name: my-output-bucket
+  prefix: data/output
+  region_name: us-west-2
+  # No AWS credentials - will use IAM role
+```
+
+## Complete Example
+
+Here's a complete example of a GraphRAG configuration using S3 for all storage components:
+
+```yaml
+models:
+  default_chat_model:
+    api_key: ${OPENAI_API_KEY}
+    type: openai_chat
+    model: gpt-4o
+    model_supports_json: true
+  default_embedding_model:
+    api_key: ${OPENAI_API_KEY}
+    type: openai_embedding
+    model: text-embedding-3-small
+
+input:
+  type: s3
+  file_type: csv
+  bucket_name: my-graphrag-bucket
+  prefix: data/input
+  file_pattern: ".*\\.csv$"
+  text_column: content
+  title_column: title
+  metadata:
+    - author
+    - date
+    - category
+  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
+  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
+  region_name: us-west-2
+
+output:
+  type: s3
+  bucket_name: my-graphrag-bucket
+  prefix: data/output
+  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
+  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
+  region_name: us-west-2
+
+cache:
+  type: s3
+  bucket_name: my-graphrag-bucket
+  prefix: data/cache
+  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
+  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
+  region_name: us-west-2
+
+reporting:
+  type: s3
+  bucket_name: my-graphrag-bucket
+  prefix: data/logs
+  aws_access_key_id: ${AWS_ACCESS_KEY_ID}
+  aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
+  region_name: us-west-2
+
+# Other GraphRAG configuration...
+```
+
+## Using S3 for Prompt Storage
+
+GraphRAG now supports loading prompt files directly from S3 buckets. This allows you to store your custom prompts in S3 and reference them in your configuration.
+
+### Configuring S3 Prompts
+
+To use prompts stored in S3, you need to provide the full S3 URI in your configuration. The format for S3 URIs is:
+
+```
+s3://bucket_name/path/to/prompt.txt
+```
+
+For example, in your `settings.yml` file:
+
+```yaml
+extract_graph:
+  model_id: default_chat_model
+  prompt: s3://my-graphrag-bucket/prompts/extract_graph.txt
+  entity_types:
+    - person
+    - organization
+    - location
+  max_gleanings: 3
+
+extract_claims:
+  enabled: true
+  model_id: default_chat_model
+  prompt: s3://my-graphrag-bucket/prompts/extract_claims.txt
+  description: "Extract factual claims from the text."
+  max_gleanings: 3
+
+summarize_descriptions:
+  model_id: default_chat_model
+  prompt: s3://my-graphrag-bucket/prompts/summarize_descriptions.txt
+  max_length: 100
+
+community_reports:
+  model_id: default_chat_model
+  graph_prompt: s3://my-graphrag-bucket/prompts/community_reports_graph.txt
+  text_prompt: s3://my-graphrag-bucket/prompts/community_reports_text.txt
+  max_length: 500
+  max_input_length: 4000
+```
+
+### Authentication for S3 Prompts
+
+When accessing prompts from S3, GraphRAG uses the same authentication methods as other S3 operations. The AWS credentials will be searched in the following order:
+
+1. Environment variables (`AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`)
+2. Shared credential file (`~/.aws/credentials`)
+3. AWS config file (`~/.aws/config`)
+4. IAM role for Amazon EC2 or ECS task role
+5. Boto session (if running in AWS Lambda)
+
+### Required Permissions
+
+To access prompts from S3, your AWS credentials must have at least the `s3:GetObject` permission for the specified bucket and objects.
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Access Denied**: Ensure that the AWS credentials have the necessary permissions to access the S3 bucket. The IAM user or role should have at least `s3:GetObject`, `s3:PutObject`, `s3:ListBucket`, and `s3:DeleteObject` permissions for the specified bucket.
+
+2. **No Such Bucket**: Verify that the bucket exists in the specified region.
+
+3. **Credential Chain Errors**: If you're not providing explicit credentials, ensure that your environment has valid AWS credentials configured through one of the methods in boto3's credential chain.
+
+4. **Region Issues**: If you encounter region-related errors, explicitly specify the `region_name` parameter in your configuration.
+
+5. **Invalid S3 URI Format**: When using S3 for prompts, ensure that the URI follows the format `s3://bucket_name/path/to/file`. If the bucket name cannot be extracted from the URI, you'll receive an error.
+
+### Logging
+
+GraphRAG logs S3 operations at the INFO level. You can enable more verbose logging by configuring the Python logging system to show DEBUG level logs for the boto3 and botocore libraries:
+
+```python
+import logging
+logging.basicConfig(level=logging.INFO)
+logging.getLogger('boto3').setLevel(logging.DEBUG)
+logging.getLogger('botocore').setLevel(logging.DEBUG)
+```
@@ -75,7 +75,7 @@ Our pipeline can ingest .csv, .txt, or .json data from an input folder. See the
 
 #### Fields
 
-- `type` **file|blob** - The input type to use. Default=`file`
+- `type` **file|blob|s3** - The input type to use. Default=`file`
 - `file_type` **text|csv|json** - The type of input data to load. Default is `text`
 - `base_dir` **str** - The base directory to read input from, relative to the root.
 - `connection_string` **str** - (blob only) The Azure Storage connection string.
@@ -110,7 +110,7 @@ This section controls the storage mechanism used by the pipeline used for export
 
 #### Fields
 
-- `type` **file|memory|blob|cosmosdb** - The storage type to use. Default=`file`
+- `type` **file|memory|blob|cosmosdb|s3** - The storage type to use. Default=`file`
 - `base_dir` **str** - The base directory to write output artifacts to, relative to the root.
 - `connection_string` **str** - (blob/cosmosdb only) The Azure Storage connection string.
 - `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
@@ -123,7 +123,7 @@ The section defines a secondary storage location for running incremental indexin
 
 #### Fields
 
-- `type` **file|memory|blob|cosmosdb** - The storage type to use. Default=`file`
+- `type` **file|memory|blob|cosmosdb|s3** - The storage type to use. Default=`file`
 - `base_dir` **str** - The base directory to write output artifacts to, relative to the root.
 - `connection_string` **str** - (blob/cosmosdb only) The Azure Storage connection string.
 - `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
@@ -136,7 +136,7 @@ This section controls the cache mechanism used by the pipeline. This is used to
 
 #### Fields
 
-- `type` **file|memory|blob|cosmosdb** - The storage type to use. Default=`file`
+- `type` **file|memory|blob|cosmosdb|s3** - The storage type to use. Default=`file`
 - `base_dir` **str** - The base directory to write output artifacts to, relative to the root.
 - `connection_string` **str** - (blob/cosmosdb only) The Azure Storage connection string.
 - `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
@@ -149,7 +149,7 @@ This section controls the reporting mechanism used by the pipeline, for common e
 
 #### Fields
 
-- `type` **file|console|blob** - The reporting type to use. Default=`file`
+- `type` **file|console|blob|s3** - The reporting type to use. Default=`file`
 - `base_dir` **str** - The base directory to write reports to, relative to the root.
 - `connection_string` **str** - (blob only) The Azure Storage connection string.
 - `container_name` **str** - (blob only) The Azure Storage container name.

@@ -11,6 +11,7 @@
 from graphrag.storage.blob_pipeline_storage import create_blob_storage
 from graphrag.storage.cosmosdb_pipeline_storage import create_cosmosdb_storage
 from graphrag.storage.file_pipeline_storage import FilePipelineStorage
+from graphrag.storage.s3_pipeline_storage import create_s3_storage
 
 if TYPE_CHECKING:
     from graphrag.cache.pipeline_cache import PipelineCache
@@ -56,6 +57,11 @@ def create_cache(
                 return JsonPipelineCache(create_blob_storage(**kwargs))
             case CacheType.cosmosdb:
                 return JsonPipelineCache(create_cosmosdb_storage(**kwargs))
+            case CacheType.s3:
+                storage = create_s3_storage(**kwargs)
+                if "base_dir" in kwargs:
+                    storage = storage.child(kwargs["base_dir"])
+                return JsonPipelineCache(storage)
             case _:
                 if cache_type in cls.cache_types:
                     return cls.cache_types[cache_type](**kwargs)

@@ -11,6 +11,7 @@
 from graphrag.callbacks.blob_workflow_callbacks import BlobWorkflowCallbacks
 from graphrag.callbacks.console_workflow_callbacks import ConsoleWorkflowCallbacks
 from graphrag.callbacks.file_workflow_callbacks import FileWorkflowCallbacks
+from graphrag.callbacks.s3_workflow_callbacks import S3WorkflowCallbacks
 from graphrag.config.enums import ReportingType
 from graphrag.config.models.reporting_config import ReportingConfig
 
@@ -37,3 +38,12 @@ def create_pipeline_reporter(
                 base_dir=config.base_dir,
                 storage_account_blob_url=config.storage_account_blob_url,
             )
+        case ReportingType.s3:
+            if not config.bucket_name:
+                msg = "No bucket name provided for S3 storage."
+                raise ValueError(msg)
+            return S3WorkflowCallbacks(
+                bucket_name=config.bucket_name,
+                base_dir=config.prefix or "",
+                log_file_name=config.base_dir,
+            )