Skip to content

Sometimes context.current_utc_datetime intermittently evaluates to None #241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timtylin opened this issue Dec 17, 2020 · 15 comments
Closed
Assignees

Comments

@timtylin
Copy link

To be honest, I'm not too sure how to replicate this reliably, so at this stage I'm more looking more for some help on how to nail this down.

I'm currently facing an issue where deployed durable functions (I don't really see this locally) would sometimes return None when trying to evaluate context.current_utc_datetime. I have many places in the orchestrator where I'm recording timestamps to a database entry, so this is something I use often. This can happen at any one of the many evaluations of context.current_utc_datetime throughout the run, and I can't seem to find rhyme or reason as to what causes this.

I've noticed that repeatedly evaluating context.current_utc_datetime in a for loop would eventually return a valid timestamp, so I've monkey-patched this hack into the orchestrator:

    def get_current_utc_datetime() -> datetime:
        curr_time = context.current_utc_datetime
        while not curr_time:
            sleep(0.05)
            curr_time = context.current_utc_datetime
        return curr_time

However today, I'm seeing several orchestrator runs where it's just blocking, seemingly forever, on this hack-y for loop without end, for over 30 minutes, which is much longer than I though an orchestrator is allowed to run. I've had to manually stop the entire deployed Functions App in order to try again with another orchestrator instance, but so far it's all been hitting the same problem.

This is all with beta 11, within the last couple of weeks.

@davidmrdavid davidmrdavid self-assigned this Dec 17, 2020
@davidmrdavid
Copy link
Collaborator

Hi @timtylin, thank you for reaching out! This seems like a serious problem so I'll be prioritizing this.
Looking at the codebase, this right here is the place where we could possibly set current_utc_datetime to None:

def _reset_timestamp(self):
last_timestamp = self.durable_context.decision_started_event.timestamp
decision_started_events = [e_ for e_ in self.durable_context.histories
if e_.event_type == HistoryEventType.ORCHESTRATOR_STARTED
and e_.timestamp > last_timestamp]
if len(decision_started_events) == 0:
self.durable_context.current_utc_datetime = None
else:
self.durable_context.decision_started_event = \
decision_started_events[0]
self.durable_context.current_utc_datetime = \
self.durable_context.decision_started_event.timestamp

That block of logic states that, if we cannot find an orchestration started event with a timestamp older than the last orchestration started timestamp, we assign current_utc_datetime to None. I could imagine a scenario where this would break if the current orchestration started event has essentially the same timestamp as the last one, although I'm unsure if that's even possible. I will ask my team-mates tomorrow PDT time about this and see if it makes sense to account for that case.

In the meantime, it would help me understand this bug better if you could provide more context around what usage patterns you're giving this orchestrator. If you also have any timeranges and an app name we can use to analyze over our internal logs, that'd help greatly!

Below is the metadata we'd need to find your application in our logs

  • Timeframe issue observed:
  • Function App name:
  • Function name(s):
  • Region:
  • Orchestration instance ID(s):

Thanks! Looking forward to investigating and fixing this bug :)

⚡ ⚡

@timtylin
Copy link
Author

Thanks very much for looking into this @davidmrdavid, really appreciate the prompt reply.

What you said prompted me to go digging into the table storage holding the execution history of these functions, and I think I'm seeing almost exactly what you're saying. Here's the history records for a relatively "normal" execution of the orchestrator (Run A):

Screenshot 2020-12-17 at 4 09 31 pm

And here's a run that's stuck (still!) in the for loop waiting for context.current_utc_datetime to return something other than None, at least according to its runtime status of "Running" (Run B):

Screenshot 2020-12-17 at 4 10 22 pm

I also have another run where I've attempted to shut it down with the Terminate Instances instances/{instanceId}/terminate API without success (at 03:34:54), forcing me to stop the entire Functions App (at 03:39:55) manually (Run C):

Screenshot 2020-12-17 at 4 08 16 pm

The common point is that there was never another orchestration started event after the first one. However, I now realise that I don't actually have any way of proving that the code is actually being executed and that it's hanging in the for loop. From my point of view, it's possible that the orchestrator never started again. Either way, it seems to me like something fishy might be happening to the Durable Functions runtime, instead of the Python library itself?

Here's the info requested

  • Timeframe issue observed: 2020-12-17 03:00:00 UTC onwards
  • Function App name: TimTestXWIDeploy01-manager-functionApp-ueab5eom
  • Function name(s): XWIJobOrchastrator
  • Task Hub: XWIJobOrchastratorTaskHub
  • Region: North Europe
  • Orchestration instance ID(s):
    • (for Run A) 51bd7f4dd6d447eb952353c80966cb8d
    • (for Run B) ccbfa76a22d94955b4395db733192443
    • (for Run C) b102055657164c699c019c5b17aa2450

@davidmrdavid
Copy link
Collaborator

Hi @timtylin, I'm looking at your logs internally and I've identified something that could explain some of this behavior.

Your function site name is over 32 characters long, which exceeds our maximum allowed range. When this happens, it's possible for naming collisions to take place which could explain some strange things I'm seeing in your logs. Could you please rename your function site name to be less than 32 characters?

Then, with regards to your get_current_utc_datetime helper function, I suspect that might be part of the problem. During normal execution, orchestrators are stopped and restarted at regular intervals as asynchronous work completes and external signals arrive, but we cannot gracefully stop your orchestration if you have it under an infinite loop.

Regardless, after cross-checking various references, I believe there was a mistake in that timestamp updating logic so I'll be pushing out a fix shortly! I'll need a few hours.

@davidmrdavid
Copy link
Collaborator

davidmrdavid commented Dec 17, 2020

I've created a hotfix in the PR shown above ^. You might still experience some errors given the long sitename, but this should take care of the one you're currently seeing. I'll try to release a new version of the SDK today or tomorrow (PDT).

@timtylin
Copy link
Author

Thanks very much for this quick turnaround @davidmrdavid.

Your function site name is over 32 characters long, which exceeds our maximum allowed range. When this happens, it's possible for naming collisions to take place which could explain some strange things I'm seeing in your logs.

Is this documented anywhere? When naming this site I was going by the Naming rules and restrictions for Azure resources document:

Screenshot 2020-12-18 at 12 16 18 am

From a quick glance around the Durable Functions site, I can't seem to find any reference to a 32-character limit, but after some searching I was able to find something related to deployment slots in this issue. Are these related?

Then, with regards to your get_current_utc_datetime helper function, I suspect that might be part of the problem. During normal execution, orchestrators are stopped and restarted at regular intervals as asynchronous work completes and external signals arrive, but we cannot gracefully stop your orchestration if you have it under an infinite loop.

I'm definitely very eager to remove that loop, which was added when I thought this might have been just a transient issue due to an inconsistent read or something. It was just something to keep development going, and had sort of "worked" until yesterday. It makes sense that the Terminate Instance HTTP API can't do anything about it if it's relying on the orchestrator checking for a special event. I was kind of assuming there might have been some supervisor with the ability to kill the orchestrator process.

Thanks again, I'll re-try this with a site name <32 characters, and again when the new SDK is released. Appreciate everyone's help!

@davidmrdavid
Copy link
Collaborator

You got it! 😄

As for the 32-character limit, I just started an internal thread asking where this is documented. I also just have a reference to the issue you linked. I'm sure there must be a good reason why this the limit isn't automatically enforced, but at least I would like to see a big warning about it somewhere. I'll let you know when/if I hear back.

With respect to terminating orchestrators, I can't say with complete certainty that your while-loop is the reason why they seem stuck (due to being in public preview, our telemetry is limited at the moment), but it's certainly an anti-pattern for durable functions.

Also, I just merged the hotfix and should be able to make a release early tomorrow PDT time. I'll update this thread!

@timtylin
Copy link
Author

timtylin commented Dec 18, 2020

Alright, I have a quick update after trying with new function site names (and a new TaskHub for good measure) but still using the old service plan:

  • The loop-forever problem seems to be gone (but I've also added a timeout that throws an exception). Now I'd say 70% the jobs get a valid timestamp on the first try, but the other 30% only gets one after a couple seconds of trying, so I'm still seeing the None issue (n=10)
  • Now that I'm running some of these in parallel to stress test, I'm actually noticing some large delays between subsequent Activities in an orchestrator (upwards of minutes), even though the logic separating them should be instantaneous. The new timeout for the get_current_utc_datetime is only 5 seconds, so that doesn't really explain it. The Activities themselves do take a long time (also several minutes), so I'm actually now wondering if there are limits to how many Activities a particular TaskHub can run in parallel under a Linux consumption plan? I've been thinking that it might just be a limit of the public beta status, but either way I'm surprised by how bad the parallel scaling is.

@ConnorMcMahon
Copy link

@timtylin

One thing to consider trying is to decrease maxConcurrentActivityFunctions to a smaller value (like 1-3). On C#, this is a fairly reasonable value, but Python as a language does not handle concurrency quite as well. That means that we can fetch up to 10 activity queue messages per worker, but on Python, these queue messages are likely being processed one at a time. This means the remaining 9 queue messages are just sitting waiting to be processed on the worker, instead of remaining on the queue where our Scale Controller can look and see that there is pending work and that it should scale out!

In general, a lot of our default settings for the host.json were selected with C# in mind. We are now realizing that these defaults may need to be tweaked on a per language basis to make sure that the out-of-box experience for developers on each language is far more positive. These changes may take some time to roll out, but I would expect them in early 2021. In the meantime you can tweak these settings yourself to optimize performance.

@timtylin
Copy link
Author

Hi @ConnorMcMahon

That's actually quite a good point! It also makes me wonder if the Activity functions can be declared as async. A lot of the time spent in the Activity functions are essentially just coordinating Azure resources via the Python SDK and waiting for the poller to finish before descending further down in an "implicit" the dependency list. I've really wanted to make these async so I can also use the aio versions of the Azure Python SDKs, but I don't recall seeing an example of the Activity functions being declared async so I don't know if it's safe to assume the can be.

Just want to give a shoutout again to you guys for being so responsive and supportive. I know this isn't really a proper support channel so I am definitely appreciative. Cheers!

@ConnorMcMahon
Copy link

I believe there is nothing stopping activity functions from being async, but @davidmrdavid would know better. In general, you would still need to increase FUNCTIONS_WORKER_PROCESS_COUNT to actually concurrently execute these asynchronous functions I believe.

I am trying to figure out how much that would help on a consumption plan where all of the workers have one core by talking with the internal Azure Functions python folks, but in general, you are probably going to need to run some tests to find the optimal settings for your app.

This drive to make the default settings work better for apps is an "Azure Functions"-wide initiative, as tweaking all of these settings makes ithe platform feel less "serverless". However, this may take some time, as tweaking defaults may help some scenarios and hurt others, so we want to make sure we do it in a thoughtful way.

@davidmrdavid
Copy link
Collaborator

Hi @timtylin, you can find the ✨ new release ✨, and its release notes, here:

As for async activity functions, my understanding is that we should support them but, now that I think about it, we have not thoroughly tested for that yet. I'll add it as a ToDo :)

In the meantime, please let us know if the latest release addresses your issue!

@davidmrdavid
Copy link
Collaborator

Hi @timtylin, just wondering if you managed to try the latest version and if that appears to mitigate the issue? :)

@timtylin
Copy link
Author

Hi @davidmrdavid

I've been stress-testing over the weekend and so far I haven't seen it return None , so I'm happy to say that this no longer happens. I do wonder if this has uncovered some other underlying issue, as I'm still seeing some unexplained long gaps (>100s) between successive timestamps and the only thing in between is an Activity that does a single CosmosDB write. At first I thought it was concurrency issues (I've set max Activity concurrency to 1), but this is the only orchestrator running at that time, so I'm still a bit puzzled on how these delays happen.

Thank you very much for resolving the original issue with such a quick turnaround. I just wish there's some way for me to leave you a great internal review 👏

@ConnorMcMahon
Copy link

@timtylin,

Our whole team is fairly active on our GitHub repos, so your feedback about @davidmrdavid's work on this issue is noted 🥇.

It sounds like we should close this issue, but I would recommend opening up a separate issue for those weird gaps you are seeing, and we can take a look at those. We should have our internal telemetry all wired up now, so if you give us a timestamp and ideally the orchestration instance id with those weird gaps (and as much information about the orchestration you feel comfortable sharing publicly), that would help us diagnose that issue and see if there are some easy tweaks in the meantime.

@davidmrdavid
Copy link
Collaborator

Thanks for the kind words! And +1 to opening up a new issue about the delays, we've seen similar issues like that before and so I'd be interested in seeing if this is related. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants