Enable LLM Observability with `agentless_enabled=True` by default with a parsed API key #572

sabrenner · 2025-03-10T17:36:52Z

What does this PR do?

Enables LLM Observability with agentless_enabled=True to ensure seamless compatibility with the agent used in the DD Extension layer. Previously, LLM Observability would try and use an agent proxy endpoint which doesn't exist on the trace agent used in the next version of the DD Extension layer.

Since all we really use the proxy for outside of serverless is so that users don't have to re-state their API key, it should be fine to just use agentless for serverless environments by default (as LLM Observability now also still sends APM traces even if agentless_enabled=True is set).

To help enable this, I added a call to init_api to get the DD_API_KEY from the secrets manager if it lives there, to make the experience even smoother. Otherwise, we can enforce DD_API_KEY in the lambda function's env vars

Motivation

MLOB-2225

Testing Guidelines

Built the layer with the build_layers script, and verified against our LLM Observability lambda functions with only

DD_LLMOBS_ENABLED
DD_LLMOBS_ML_APP
DD_API_KEY

set, and verified traces showed up in the UI.

Additional Notes

Happy to remove the init_api part if it would be too much of a burden on the code path for a serverless env. I saw it was available, and only used for metrics, but decided to re-use. It can be revisited later if needed and we can enforce DD_API_KEY being set directly instead.

Types of Changes

Bug fix
New feature
Breaking change
Misc (docs, refactoring, dependency upgrade, etc.)

Check all that apply

This PR's description is comprehensive
This PR contains breaking changes that are documented in the description
This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
This PR impacts documentation, and it has been updated (or a ticket has been logged)
This PR's changes are covered by the automated tests
This PR collects user input/sensitive content into Datadog
This PR passes the integration tests (ask a Datadog member to run the tests)

astuyve · 2025-03-10T17:56:24Z

datadog_lambda/wrapper.py

 llmobs_env_var = os.environ.get("DD_LLMOBS_ENABLED", "false").lower() in ("true", "1")
 if llmobs_env_var:
+    from datadog_lambda.api import init_api


nice use of lazy loading here!

astuyve

This looks pretty good to me, a few things to keep in mind:

if LLMobs ships data agentlessly via nonblocking thread or async runtime, you should expect to drop or miss data as the CPU is throttled and resumed between invocations. You may want to check this by performing a few invokes, then waiting several minutes. You can kind of handle this by aggressively retrying, but it's not perfect.
Because you're doing this in the main function process and not the extension, data will be lost if it's not flushed before the sandbox shuts down – as the main process is not executed on the shutdown event (but extensions are).
You should also expect to lose data in the event of a timeout, OOM, or other unhandled exceptions which cause the python process to re-initialize.

I'm not a codeowner of this project any longer, so I'll let that team chime in.

datadog_lambda/wrapper.py

duncanista

Looks good to me!

After pairing on this, we discovered the impact in overhead for LLMObs customers.
The ballpark overhead increase for this use case would be 200ms (datadog.api import and secrets manager call).

So approximately ~800ms, as opposed to ~550ms.

LLMObs adds ~60ms during init.

sabrenner · 2025-03-10T23:53:00Z

Thanks for the callouts @astuyve!

if LLMobs ships data agentlessly via nonblocking thread or async runtime, you should expect to drop or miss data as the CPU is throttled and resumed between invocations. You may want to check this by performing a few invokes, then waiting several minutes. You can kind of handle this by aggressively retrying, but it's not perfect.

Our LLMObs writer just re-uses the periodic HTTP writer service in dd-trace-py, which I think uses a synchronous blocking http request. It does run in a thread I think ( I haven't looked too deeply into the ddtrace internals), but I'll keep a note of this to see if it becomes an issue. I don't think our customers using LLMObs in serverless environments have too large of use cases to be hitting some kind of throttling. I'll run the tests, though, and follow up if I can make any improvements!

Because you're doing this in the main function process and not the extension, data will be lost if it's not flushed before the sandbox shuts down – as the main process is not executed on the shutdown event (but extensions are).

We force flush LLMObs in the datadog_lambda.wrapper._LambdaDecorator._after function, so I think we should be good in this case. This was something that got us in the past so we made sure to have that there 👍

You should also expect to lose data in the event of a timeout, OOM, or other unhandled exceptions which cause the python process to re-initialize.

This is a good point, which I think falls outside of this PR. I'll make a note to follow up on investigating this!

enable llmobs agentless with parsed api_key

82a943a

sabrenner marked this pull request as ready for review March 10, 2025 17:54

sabrenner requested a review from a team as a code owner March 10, 2025 17:54

astuyve reviewed Mar 10, 2025

View reviewed changes

purple4reina reviewed Mar 10, 2025

View reviewed changes

datadog_lambda/wrapper.py Outdated Show resolved Hide resolved

duncanista approved these changes Mar 10, 2025

View reviewed changes

extract getting api key to its own function

b2cc1ba

sabrenner requested a review from a team as a code owner March 10, 2025 21:06

purple4reina approved these changes Mar 10, 2025

View reviewed changes

lint

06e4596

sabrenner merged commit 6e2e5d5 into main Mar 10, 2025
60 checks passed

sabrenner deleted the sabrenner/force-llmobs-agentless branch March 10, 2025 23:53

nhulston mentioned this pull request Mar 11, 2025

Use FIPs endpoints in Govcloud regions #575

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable LLM Observability with `agentless_enabled=True` by default with a parsed API key #572

Enable LLM Observability with `agentless_enabled=True` by default with a parsed API key #572

sabrenner commented Mar 10, 2025

astuyve Mar 10, 2025

astuyve left a comment

duncanista left a comment

sabrenner commented Mar 10, 2025

Enable LLM Observability with agentless_enabled=True by default with a parsed API key #572

Enable LLM Observability with agentless_enabled=True by default with a parsed API key #572

Conversation

sabrenner commented Mar 10, 2025

What does this PR do?

Motivation

Testing Guidelines

Additional Notes

Types of Changes

Check all that apply

astuyve Mar 10, 2025

Choose a reason for hiding this comment

astuyve left a comment

Choose a reason for hiding this comment

duncanista left a comment

Choose a reason for hiding this comment

sabrenner commented Mar 10, 2025

Enable LLM Observability with `agentless_enabled=True` by default with a parsed API key #572

Enable LLM Observability with `agentless_enabled=True` by default with a parsed API key #572