Skip to content

feat(bedrock): expose bda parsing strategy for data sources #1096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 23, 2025

Conversation

krokoko
Copy link
Collaborator

@krokoko krokoko commented Apr 23, 2025

Fixes #

  • Fix a typo
  • Expose BDA as a parsing strategy for data sources

Testing with the following code snippet:

const docBucket = new s3.Bucket(this, 'DocBucket', {
      enforceSSL: true,
      versioned: true,
      publicReadAccess: false,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      encryption: s3.BucketEncryption.S3_MANAGED,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      autoDeleteObjects: true,
      serverAccessLogsPrefix: 'inputsAssetsBucketLogs/',
    });

    const supplementalStorage = new s3.Bucket(this, 'SupplementalStorage', {
      enforceSSL: true,
      versioned: true,
      publicReadAccess: false,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      encryption: s3.BucketEncryption.S3_MANAGED,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      autoDeleteObjects: true,
      serverAccessLogsPrefix: 'inputsAssetsBucketLogs/',
    });

    const supplementalStorageS3 = bedrock.SupplementalDataStorageLocation.s3({
      uri: `s3://${supplementalStorage.bucketName}/`
    });

    const kb = new bedrock.VectorKnowledgeBase(this, 'KnowledgeBase', {
      embeddingsModel: bedrock.BedrockFoundationModel.TITAN_EMBED_TEXT_V2_1024,
      instruction: 'Use this knowledge base to answer questions about books. ' + 'It contains the full text of novels.',
      supplementalDataStorageLocations: [supplementalStorageS3],
    });

    // Grant the role access to the document bucket with the provided supplemental data storage bucket
    supplementalStorage.grantReadWrite(kb.role);

    const dataSource = new bedrock.S3DataSource(this, 'DataSource', {
      bucket: docBucket,
      knowledgeBase: kb,
      dataSourceName: 'books',
      chunkingStrategy: bedrock.ChunkingStrategy.fixedSize({
        maxTokens: 500,
        overlapPercentage: 20
      }),
      parsingStrategy: bedrock.ParsingStrategy.bedrockDataAutomation()
    });

Deploys correctly:

image

Sync of the data source works as expected:
image

Images from documents are extracted and placed in the supplemental data storage:
Screenshot 2025-04-23 at 11 36 44 AM


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

@krokoko krokoko marked this pull request as ready for review April 23, 2025 16:39
@krokoko krokoko requested a review from a team as a code owner April 23, 2025 16:39
@krokoko krokoko enabled auto-merge (squash) April 23, 2025 16:48
Copy link
Contributor

@MichaelWalker-git MichaelWalker-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@krokoko krokoko merged commit c50e62c into awslabs:main Apr 23, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants