-
Notifications
You must be signed in to change notification settings - Fork 46
Enable LLM Observability with agentless_enabled=True
by default with a parsed API key
#572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
datadog_lambda/wrapper.py
Outdated
llmobs_env_var = os.environ.get("DD_LLMOBS_ENABLED", "false").lower() in ("true", "1") | ||
if llmobs_env_var: | ||
from datadog_lambda.api import init_api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice use of lazy loading here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me, a few things to keep in mind:
- if LLMobs ships data agentlessly via nonblocking thread or async runtime, you should expect to drop or miss data as the CPU is throttled and resumed between invocations. You may want to check this by performing a few invokes, then waiting several minutes. You can kind of handle this by aggressively retrying, but it's not perfect.
- Because you're doing this in the main function process and not the extension, data will be lost if it's not flushed before the sandbox shuts down – as the main process is not executed on the
shutdown
event (but extensions are). - You should also expect to lose data in the event of a timeout, OOM, or other unhandled exceptions which cause the python process to re-initialize.
I'm not a codeowner of this project any longer, so I'll let that team chime in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
After pairing on this, we discovered the impact in overhead for LLMObs customers.
The ballpark overhead increase for this use case would be 200ms (datadog.api
import and secrets manager call).
So approximately ~800ms, as opposed to ~550ms.
LLMObs adds ~60ms during init.
Thanks for the callouts @astuyve!
Our LLMObs writer just re-uses the periodic HTTP writer service in
We force flush LLMObs in the
This is a good point, which I think falls outside of this PR. I'll make a note to follow up on investigating this! |
What does this PR do?
Enables LLM Observability with
agentless_enabled=True
to ensure seamless compatibility with the agent used in the DD Extension layer. Previously, LLM Observability would try and use an agent proxy endpoint which doesn't exist on the trace agent used in thenext
version of the DD Extension layer.Since all we really use the proxy for outside of serverless is so that users don't have to re-state their API key, it should be fine to just use agentless for serverless environments by default (as LLM Observability now also still sends APM traces even if
agentless_enabled=True
is set).To help enable this, I added a call to
init_api
to get theDD_API_KEY
from the secrets manager if it lives there, to make the experience even smoother. Otherwise, we can enforceDD_API_KEY
in the lambda function's env varsMotivation
MLOB-2225
Testing Guidelines
Built the layer with the
build_layers
script, and verified against our LLM Observability lambda functions with onlyDD_LLMOBS_ENABLED
DD_LLMOBS_ML_APP
DD_API_KEY
set, and verified traces showed up in the UI.
Additional Notes
Happy to remove the
init_api
part if it would be too much of a burden on the code path for a serverless env. I saw it was available, and only used for metrics, but decided to re-use. It can be revisited later if needed and we can enforceDD_API_KEY
being set directly instead.Types of Changes
Check all that apply