-
Notifications
You must be signed in to change notification settings - Fork 153
Feature request: Calculate Log Sampling Per Request #3278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @jasonwadsworth, thanks for opening this issue. This is an intended limitation that is documented in the docs, and we designed this feature this way to avoid adding overhead for those customers who don't use the feature. Up until now we have never gotten any demand of changing this behavior, but it's something we could consider provided that there's enough demand from multiple customers. In the meantime, you can call the |
Hi @jasonwadsworth, are you by chance using the Middy.js middleware or class method decorator like: import { Logger } from '@aws-lambda-powertools/logger';
import { injectLambdaContext } from '@aws-lambda-powertools/logger/middleware';
import middy from '@middy/core';
export const handler = middy(async () => {
// your code
}).use(injectLambdaContext(logger); or import { Logger } from '@aws-lambda-powertools/logger';
class Lambda {
@logger.injectLambdaContext()
async handler (event, context) {
// your code
}
}
const lambda = new Lambda();
export const handler = lambda.handler.bind(lambda); If yes, we could add an option to both of them to allow you to configure this behavior and opt-in into recalculating the rate at every request, for example: import { Logger } from '@aws-lambda-powertools/logger';
import { injectLambdaContext } from '@aws-lambda-powertools/logger/middleware';
import middy from '@middy/core';
export const handler = middy(async () => {
// your code
}).use(injectLambdaContext(logger, { refreshSampleRateCalculation: true }); // same for the decorator This would be an unobtrusive change that we could roll out easily at any time.l |
I'm not, but I was going to try your first suggestion for now. |
Thank you for clarifying. As of now, unless you're using one of these two methods then the Logger has no way of knowing when a request starts/ends, so it'd be tricky to implement. |
Had a discussion about this with @am29d yesterday and I thought it'd be useful to bring here the outcomes of the discussion, primarily for future implementation & consideration. The log sampling feature changes the log level of a Logger to As it stands, the feature doesn't exactly behaves as described above. This is because the percentage or ratio is not calculated at the request level, but rather when the Logger class is instantiated, which is usually during the import { Logger } from '@aws-lambda-powertools/logger';
const logger = new Logger({
logSampling: 0.5, // whether or not the log level is switched to `debug` is decided here only
});
export const handler = async () => {
// ... your logic
} This means that all the requests served by the same environment will inherit the sampling decision that was made when the environment was initialized, which in turn results in a sampling rate different than the desired one. The degree of this difference will depend on how many environments are spun up and the distribution of requests among them. To explain what I mean by that, let's consider this example that has 3 environments/sandboxes and a number of requests distributed across them, and - for the sake of simplicity - a log sampling of 0.5 (aka 50%): Assuming a truly random chance of 50%, one could end up in a situation like the above, which would result in a sample rate of ~85% rather than the expected 50%. To get around this, and instead get a more closer rate to the desired 50%, customers can use the import { Logger } from '@aws-lambda-powertools/logger';
const logger = new Logger({
logSampling: 0.5,
});
export const handler = async () => {
// ... your logic
logger.refreshSampleRateCalculation(); // make a new sampling decision
} When called, this method essentially flips the coin again and decides whether the log level should be switched to With this in mind, we should consider easing this situation for customers by adding an optional flag to our class method decorator and Middy.js middleware so that when this flag is enabled, we'll call the The flag would be Obviously, as mentioned above, this would work only if we're able to wrap the handler, so customers who are not using either of the two mechanisms just mentioned would have to continue calling the |
Thanks for diving deep into the investigation @dreamorosi! I created a small PoC in Python with our Logger and we have much more accurate debug lines than the current implementation. To be honest, I think our current implementation is useless in most cases and don't add anything to customers, especially since it happens at the Logger constructor level and it can sample 100% of the requests or 0% of the requests. from aws_lambda_powertools import Logger
sampling_rate = 0.5
total_runs = 1000
sampled_count = 0
actual_rate = 0
logger = Logger(sampling_rate=sampling_rate)
for i in range(0,total_runs):
# logger.debug("sample", tentative=i) # just omitting to not print too much in the screen
logger.reset_sampling()
sampled_count = logger.total_sampled # internal variable just to get the number of times the decision was True
actual_rate = sampled_count / total_runs
print(f"Sampling rate: {sampling_rate}")
print(f"Total runs: {total_runs}")
print(f"Times sampled: {sampled_count}")
print(f"Actual sampling rate: {actual_rate:.2f}") output(venv) ➜ leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 508
Actual sampling rate: 0.51
(venv) ➜ leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 504
Actual sampling rate: 0.50
(venv) ➜ leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 470
Actual sampling rate: 0.47
(venv) ➜ leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 491
Actual sampling rate: 0.49
(venv) ➜ leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 503
Actual sampling rate: 0.50
(venv) ➜ leoleo python leo_logger.py
Sampling rate: 0.5
Total runs: 1000
Times sampled: 493
Actual sampling rate: 0.49
I'm not sure I agree with this! Our current implementation is basically saying: try your luck! I think it makes sense to make this the default for anyone using the decorator (in Python too) and to change our documentation to say: this is the expected behavior, we fixed it for those that are using decorator. If not using decorator, use this function. I'm more than happy to use "it's better to ask for forgiveness than permission" here. But of course I agree with any decision. |
After checking with Java and .NET runtimes with Henrique and Philipp, we determined that other runtimes don't have this problem, only Python and NodeJS. For TypeScript and Python my proposal is
At this point I'd be happy to explore an option to add a new decorator, because |
So basically we're saying that you can't apply log sampling if you're not using a decorator or middleware? |
Hmm yeah just noticed as well that we would rule out this option. I am on the fence. I want to avoid too many options. Setting sampling in the constructor forces you to use additional method to make it work properly. Alternatively, we can keep the But then again, we would need to specifically say that if you use |
I'd be inclined to do this - so the feature is additive But I agree with Leandro's proposal above that we should improve the documentation and basically say that if you're not using decorator/middleware you must call With that said, I don't have strong feelings. With the addition of log buffering I'd prefer to use that rather than having log sampling of debug logs anyway. |
Can you also elaborate on how is that Java / .NET don't have this issue? |
Ok, let's keep all the options. I agree that it does not make sense to not support a feature only because customers don't use decorator or middy. I will run some tests on Java and .NET to confirm. |
@am29d let me know if you need help testing .NET |
ok then - let's move onto implementation next week (for TS). Alex, I changed the status of the issue - hope everything is clear on your side. |
This issue is now closed. Please be mindful that future comments are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so. |
I am reverting the PR we merged because there's an edge case we should discuss. In the current implementation (the broken one), we emit a log like this when the sampling decision is successful:
This log was emitted outside of the handler during the With the new implementation, depending on when we are refreshing the sampling rate decision a couple of things can happen. For the sake of simplicity, let's assume a sample rate of 0.5 (aka 50%). Case 1 If we refresh the sample rate before calling the handler when decorating it, when there's a cold start, the sampling decision will be made twice:
With this in mind, you could get in a situation where the first decision yields a As an operator, you see the logs and see the i.e. See function codeimport { Logger } from "@aws-lambda-powertools/logger";
import type { Context } from "aws-lambda";
const logger = new Logger({
sampleRateValue: 0.5,
});
class Lambda {
@logger.injectLambdaContext()
async handler(_event: unknown, _context: Context) {
logger.debug("test debug");
logger.info("test info");
}
}
const lambda = new Lambda();
export const handler = lambda.handler.bind(lambda);
Case 2 If we refresh the sample rate after calling the handler when decorating it, the problem above doesn't exist, but now we have a different problem at potentially every other request. Since we are calling the This is because refreshing the calculation at the end of a request means essentially deciding whether the next request will be sampled. As an operator, I'd both see the log in a request that was potentially not sampled, and also not see it in a request that was sampled. With this in mind, I think my proposal is to remove completely the sampling decision from the Logger class initialization and refresh the sample rate calculation before every request. This assures that the The problem with this is that customers who were using the feature and not calling |
Thank you for bringing these edge cases to light @dreamorosi.
Idk in TypeScript, but in Python I can control to call the sampling decision at once when it's cold start, so, it will never be called twice for the first invocation. Can you do the same?
In Python, I can refresh the sampling decision before each invocation, so the message "Setting log level to DEBUG due to sampling rate" will appear on the invocation that was sampled. from collections import namedtuple
import uuid
from aws_lambda_powertools import Logger
logger = Logger(sampling_rate=0.5, level="INFO")
@logger.inject_lambda_context
def lambda_handler(event, context):
logger.debug("my debug")
logger.debug("my debug2")
logger.debug("my debug3")
def lambda_context():
lambda_context = {
"function_name": "test",
"memory_limit_in_mb": 128,
"invoked_function_arn": "arn:aws:lambda:eu-west-1:809313241:function:test",
"aws_request_id": str(uuid.uuid4()),
}
return namedtuple("LambdaContext", lambda_context.keys())(*lambda_context.values())
for i in range(0,4):
lambda_handler(None, lambda_context()) Look at {"level":"DEBUG","location":"_configure_sampling:373","message":"Setting log level to DEBUG due to sampling rate","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"e215f669-5f19-4bd1-b8a7-069f9e013564"}
{"level":"DEBUG","location":"lambda_handler:10","message":"my debug","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"e215f669-5f19-4bd1-b8a7-069f9e013564"}
{"level":"DEBUG","location":"lambda_handler:12","message":"my debug2","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"e215f669-5f19-4bd1-b8a7-069f9e013564"}
{"level":"DEBUG","location":"lambda_handler:14","message":"my debug3","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"e215f669-5f19-4bd1-b8a7-069f9e013564"}
{"level":"DEBUG","location":"_configure_sampling:373","message":"Setting log level to DEBUG due to sampling rate","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"0452d41a-ece2-46f8-87ab-88622d95d742"}
{"level":"DEBUG","location":"lambda_handler:10","message":"my debug","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"0452d41a-ece2-46f8-87ab-88622d95d742"}
{"level":"DEBUG","location":"lambda_handler:12","message":"my debug2","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"0452d41a-ece2-46f8-87ab-88622d95d742"}
{"level":"DEBUG","location":"lambda_handler:14","message":"my debug3","timestamp":"2025-02-25 14:49:28,603+0000","service":"service_undefined","sampling_rate":0.5,"cold_start":false,"function_name":"test","function_memory_size":128,"function_arn":"arn:aws:lambda:eu-west-1:809313241:function:test","function_request_id":"0452d41a-ece2-46f8-87ab-88622d95d742"}
It's the same as I explained before, so I'll skip it.
Can you check if this is a cold start and not a refresh sampling decision twice via decorator? In this case we do not change this behavior of performing sampling at constructor level. In Python it is something like this: Please let me know if you need any additional clarification from my side. |
Hey, yes, we can do that, we have a method as well. I didn't think of that - let's do it. @am29d - whenever you're ready maybe we can reopen the PR with this change. |
I was looking into our implementation and there is a blocker atm that we need to resolve first. In our current implementation we have two methods, public getColdStart(): boolean {
if (this.coldStart) {
this.coldStart = false;
return true;
}
return false;
} Besides for the super weird design to add a side-effect to a We can either change coldStart field to protected, or add a new function to read the field (and not change it), or change the |
This issue is now closed. Please be mindful that future comments are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so. |
This is now released under v2.16.0 version! |
Use case
The log sampling concept is great but not very useful in practice if you have provisioned concurrency turned on due to the decision happening at initialization.
Solution/User Experience
Per request sampling would be ideal, as opposed to per log sampling. That would give the user the ability to see the entirety of a request that was sampled rather than a mix of logs for different requests.
Alternative solutions
No response
Acknowledgment
Future readers
Please react with 👍 and your use case to help us understand customer demand.
The text was updated successfully, but these errors were encountered: