|
1 |
| -# Coming soon! |
| 1 | +# Ingest Strategy |
| 2 | + |
| 3 | +When building RAG applications, you often need to load and refresh content from multiple sources. This can involve: |
| 4 | +- Expensive API calls |
| 5 | +- Large document processing |
| 6 | +- Concurrent embedding operations |
| 7 | + |
| 8 | +We use [Prefect](https://docs.prefect.io) to handle these challenges, giving us: |
| 9 | + |
| 10 | +- Automatic caching of expensive operations |
| 11 | +- Concurrent processing with backpressure |
| 12 | +- Observability and retries |
| 13 | + |
| 14 | +Let's look at a real example that demonstrates these concepts. |
| 15 | + |
| 16 | +## Building a Knowledge Base |
| 17 | + |
| 18 | +```python |
| 19 | +from datetime import timedelta |
| 20 | +import httpx |
| 21 | +from prefect import flow, task |
| 22 | +from prefect.tasks import task_input_hash |
| 23 | + |
| 24 | +from raggy.loaders.github import GitHubRepoLoader |
| 25 | +from raggy.loaders.web import SitemapLoader |
| 26 | +from raggy.vectorstores.tpuf import TurboPuffer |
| 27 | + |
| 28 | +# Cache based on content changes |
| 29 | +def get_last_modified(context, parameters): |
| 30 | + """Only reload if the content has changed.""" |
| 31 | + try: |
| 32 | + return httpx.head(parameters["urls"][0]).headers.get("Last-Modified", "") |
| 33 | + except Exception: |
| 34 | + return None |
| 35 | + |
| 36 | +@task( |
| 37 | + cache_key_fn=get_last_modified, |
| 38 | + cache_expiration=timedelta(hours=24), |
| 39 | + retries=2, |
| 40 | +) |
| 41 | +async def gather_documents(urls: list[str]): |
| 42 | + return await SitemapLoader(urls=urls).load() |
| 43 | + |
| 44 | +@flow |
| 45 | +async def refresh_knowledge(): |
| 46 | + # Load from multiple sources |
| 47 | + documents = [] |
| 48 | + for loader in [ |
| 49 | + SitemapLoader(urls=["https://docs.prefect.io/sitemap.xml"]), |
| 50 | + GitHubRepoLoader(repo="PrefectHQ/prefect", include_globs=["README.md"]), |
| 51 | + ]: |
| 52 | + documents.extend(await gather_documents(loader)) |
| 53 | + |
| 54 | + # Store efficiently with concurrent embedding |
| 55 | + with TurboPuffer(namespace="knowledge") as tpuf: |
| 56 | + await tpuf.upsert_batched( |
| 57 | + documents, |
| 58 | + batch_size=100, # tune based on document size |
| 59 | + max_concurrent=8 # tune based on rate limits |
| 60 | + ) |
| 61 | +``` |
| 62 | + |
| 63 | +This example shows key patterns: |
| 64 | + |
| 65 | +1. Content-aware caching (`Last-Modified` headers, commit SHAs, etc) |
| 66 | +2. Automatic retries for resilience |
| 67 | +3. Concurrent processing with backpressure |
| 68 | +4. Efficient batching of embedding operations |
| 69 | + |
| 70 | +See the [refresh examples](https://github.com/zzstoatzz/raggy/tree/main/examples/refresh_vectorstore) for complete implementations using both Chroma and TurboPuffer. |
| 71 | + |
| 72 | +## Performance Tips |
| 73 | + |
| 74 | +For production workloads: |
| 75 | +```python |
| 76 | +@task( |
| 77 | + retries=2, |
| 78 | + retry_delay_seconds=[3, 60], # exponential backoff |
| 79 | + cache_expiration=timedelta(days=1), |
| 80 | + persist_result=True, # save results to storage |
| 81 | +) |
| 82 | +async def gather_documents(loader): |
| 83 | + return await loader.load() |
| 84 | +``` |
| 85 | + |
| 86 | +See [Prefect's documentation](https://docs.prefect.io/latest/concepts/tasks/) for more on task configuration and caching strategies. |
0 commit comments