-
Notifications
You must be signed in to change notification settings - Fork 56
Sometimes context.current_utc_datetime
intermittently evaluates to None
#241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @timtylin, thank you for reaching out! This seems like a serious problem so I'll be prioritizing this. azure-functions-durable-python/azure/durable_functions/orchestrator.py Lines 144 to 155 in 6c25a78
That block of logic states that, if we cannot find an orchestration started event with a timestamp older than the last orchestration started timestamp, we assign In the meantime, it would help me understand this bug better if you could provide more context around what usage patterns you're giving this orchestrator. If you also have any timeranges and an app name we can use to analyze over our internal logs, that'd help greatly! Below is the metadata we'd need to find your application in our logs
Thanks! Looking forward to investigating and fixing this bug :) ⚡ ⚡ |
Thanks very much for looking into this @davidmrdavid, really appreciate the prompt reply. What you said prompted me to go digging into the table storage holding the execution history of these functions, and I think I'm seeing almost exactly what you're saying. Here's the history records for a relatively "normal" execution of the orchestrator (Run A): And here's a run that's stuck (still!) in the for loop waiting for I also have another run where I've attempted to shut it down with the Terminate Instances The common point is that there was never another orchestration started event after the first one. However, I now realise that I don't actually have any way of proving that the code is actually being executed and that it's hanging in the for loop. From my point of view, it's possible that the orchestrator never started again. Either way, it seems to me like something fishy might be happening to the Durable Functions runtime, instead of the Python library itself? Here's the info requested
|
Hi @timtylin, I'm looking at your logs internally and I've identified something that could explain some of this behavior. Your function site name is over 32 characters long, which exceeds our maximum allowed range. When this happens, it's possible for naming collisions to take place which could explain some strange things I'm seeing in your logs. Could you please rename your function site name to be less than 32 characters? Then, with regards to your Regardless, after cross-checking various references, I believe there was a mistake in that timestamp updating logic so I'll be pushing out a fix shortly! I'll need a few hours. |
I've created a hotfix in the PR shown above ^. You might still experience some errors given the long sitename, but this should take care of the one you're currently seeing. I'll try to release a new version of the SDK today or tomorrow (PDT). |
Thanks very much for this quick turnaround @davidmrdavid.
Is this documented anywhere? When naming this site I was going by the Naming rules and restrictions for Azure resources document: From a quick glance around the Durable Functions site, I can't seem to find any reference to a 32-character limit, but after some searching I was able to find something related to deployment slots in this issue. Are these related?
I'm definitely very eager to remove that loop, which was added when I thought this might have been just a transient issue due to an inconsistent read or something. It was just something to keep development going, and had sort of "worked" until yesterday. It makes sense that the Terminate Instance HTTP API can't do anything about it if it's relying on the orchestrator checking for a special event. I was kind of assuming there might have been some supervisor with the ability to kill the orchestrator process. Thanks again, I'll re-try this with a site name <32 characters, and again when the new SDK is released. Appreciate everyone's help! |
You got it! 😄 As for the 32-character limit, I just started an internal thread asking where this is documented. I also just have a reference to the issue you linked. I'm sure there must be a good reason why this the limit isn't automatically enforced, but at least I would like to see a big warning about it somewhere. I'll let you know when/if I hear back. With respect to terminating orchestrators, I can't say with complete certainty that your while-loop is the reason why they seem stuck (due to being in public preview, our telemetry is limited at the moment), but it's certainly an anti-pattern for durable functions. Also, I just merged the hotfix and should be able to make a release early tomorrow PDT time. I'll update this thread! |
Alright, I have a quick update after trying with new function site names (and a new TaskHub for good measure) but still using the old service plan:
|
One thing to consider trying is to decrease In general, a lot of our default settings for the host.json were selected with C# in mind. We are now realizing that these defaults may need to be tweaked on a per language basis to make sure that the out-of-box experience for developers on each language is far more positive. These changes may take some time to roll out, but I would expect them in early 2021. In the meantime you can tweak these settings yourself to optimize performance. |
That's actually quite a good point! It also makes me wonder if the Activity functions can be declared as Just want to give a shoutout again to you guys for being so responsive and supportive. I know this isn't really a proper support channel so I am definitely appreciative. Cheers! |
I believe there is nothing stopping activity functions from being async, but @davidmrdavid would know better. In general, you would still need to increase I am trying to figure out how much that would help on a consumption plan where all of the workers have one core by talking with the internal Azure Functions python folks, but in general, you are probably going to need to run some tests to find the optimal settings for your app. This drive to make the default settings work better for apps is an "Azure Functions"-wide initiative, as tweaking all of these settings makes ithe platform feel less "serverless". However, this may take some time, as tweaking defaults may help some scenarios and hurt others, so we want to make sure we do it in a thoughtful way. |
Hi @timtylin, you can find the ✨ new release ✨, and its release notes, here:
As for In the meantime, please let us know if the latest release addresses your issue! |
Hi @timtylin, just wondering if you managed to try the latest version and if that appears to mitigate the issue? :) |
I've been stress-testing over the weekend and so far I haven't seen it return None , so I'm happy to say that this no longer happens. I do wonder if this has uncovered some other underlying issue, as I'm still seeing some unexplained long gaps (>100s) between successive timestamps and the only thing in between is an Activity that does a single CosmosDB write. At first I thought it was concurrency issues (I've set max Activity concurrency to 1), but this is the only orchestrator running at that time, so I'm still a bit puzzled on how these delays happen. Thank you very much for resolving the original issue with such a quick turnaround. I just wish there's some way for me to leave you a great internal review 👏 |
Our whole team is fairly active on our GitHub repos, so your feedback about @davidmrdavid's work on this issue is noted 🥇. It sounds like we should close this issue, but I would recommend opening up a separate issue for those weird gaps you are seeing, and we can take a look at those. We should have our internal telemetry all wired up now, so if you give us a timestamp and ideally the orchestration instance id with those weird gaps (and as much information about the orchestration you feel comfortable sharing publicly), that would help us diagnose that issue and see if there are some easy tweaks in the meantime. |
Thanks for the kind words! And +1 to opening up a new issue about the delays, we've seen similar issues like that before and so I'd be interested in seeing if this is related. Thanks! |
To be honest, I'm not too sure how to replicate this reliably, so at this stage I'm more looking more for some help on how to nail this down.
I'm currently facing an issue where deployed durable functions (I don't really see this locally) would sometimes return
None
when trying to evaluatecontext.current_utc_datetime
. I have many places in the orchestrator where I'm recording timestamps to a database entry, so this is something I use often. This can happen at any one of the many evaluations ofcontext.current_utc_datetime
throughout the run, and I can't seem to find rhyme or reason as to what causes this.I've noticed that repeatedly evaluating
context.current_utc_datetime
in a for loop would eventually return a valid timestamp, so I've monkey-patched this hack into the orchestrator:However today, I'm seeing several orchestrator runs where it's just blocking, seemingly forever, on this hack-y for loop without end, for over 30 minutes, which is much longer than I though an orchestrator is allowed to run. I've had to manually stop the entire deployed Functions App in order to try again with another orchestrator instance, but so far it's all been hitting the same problem.
This is all with beta 11, within the last couple of weeks.
The text was updated successfully, but these errors were encountered: