Skip to content

[Bug] Timeout happening even though the function fully runs #1552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rokcarl opened this issue Jul 29, 2024 · 14 comments
Closed

[Bug] Timeout happening even though the function fully runs #1552

rokcarl opened this issue Jul 29, 2024 · 14 comments

Comments

@rokcarl
Copy link

rokcarl commented Jul 29, 2024

Expected Behavior

I expect the function to run without a timeout.

Actual Behavior

The function timeouts.

Steps to Reproduce

  1. Implement a timer trigger function.
  2. Run it.
  3. Hope that it times out even though all the lines have run.

Relevant code being tried

@app.function_name(name="salesforce_report")
@app.timer_trigger(schedule="0 */1 * * * *", arg_name="timer", run_on_startup=True)
def salesforce_report(timer: func.TimerRequest) -> None:
    report_id = config.get("salesforce.default_report_id")
    order_numbers = get_salesforce().get_column_entries_by_id(report_id)
    logging.info(f"Salesforce report {report_id}, found {len(order_numbers)} orders.")
    save_ttl = config.get("maltrix.order_processing_single_timeout_seconds")
    for order_number in order_numbers:
        if cache.check_and_save("orders", order_number, save_ttl):
            logging.info(f"Found order {order_number}, it's new, saving to db.")
            queue.enqueue("orders", order_number)
        else:
            logging.info(f"Found order {order_number}, it's old, continuing.")

Relevant log output

[2024-07-25T20:46:47.658Z] Executing 'Functions.salesforce_report' (Reason='Timer fired at 2024-07-25T22:46:47.5662450+02:00', Id=e7a3a19c-144c-4271-a7cc-8bf3aeaea000)
[2024-07-25T20:46:47.662Z] Trigger Details: UnscheduledInvocationReason: RunOnStartup
[2024-07-25T20:46:49.637Z] Salesforce report [report_id], found 8 orders.
[2024-07-25T20:46:49.910Z] Found order O1, it's new, saving to db.
[2024-07-25T20:46:50.323Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=370812a9-dab8-446f-a791-b5e66d9150e9)
[2024-07-25T20:46:50.324Z] Trigger Details: MessageId: 16a64f49-2ddc-40b4-901d-594d38d2757c, DequeueCount: 1, InsertedOn: 2024-07-25T20:46:50.000+00:00
[2024-07-25T20:46:50.354Z] Processing order number O1.
[2024-07-25T20:46:50.354Z] process_orders_queue_item:
[2024-07-25T20:46:50.366Z] Found order O2, it's new, saving to db.
[2024-07-25T20:46:50.818Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=3cea9e81-9f68-4581-9a37-0543e52fdcd9)
[2024-07-25T20:46:50.822Z] Trigger Details: MessageId: 99c84f4a-64b0-4a75-af85-fe8b60956149, DequeueCount: 1, InsertedOn: 2024-07-25T20:46:50.000+00:00
[2024-07-25T20:46:50.823Z] Found order O3, it's new, saving to db.
[2024-07-25T20:46:51.035Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=58393ad4-4176-4b48-b7f8-01df9fd5ebc3)
[2024-07-25T20:46:51.035Z] Trigger Details: MessageId: f4989039-a10c-4ba1-8cff-6d4b21fdb669, DequeueCount: 1, InsertedOn: 2024-07-25T20:46:50.000+00:00
[2024-07-25T20:46:51.273Z] Found order O4, it's new, saving to db.
[2024-07-25T20:46:51.473Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=0876aff5-cce5-43a8-96d7-20f14364c309)
[2024-07-25T20:46:51.473Z] Trigger Details: MessageId: ddf4bf8a-c687-48cb-a789-e57536ce620f, DequeueCount: 1, InsertedOn: 2024-07-25T20:46:51.000+00:00
[2024-07-25T20:46:51.738Z] Found order O5, it's new, saving to db.
[2024-07-25T20:46:51.848Z] Host lock lease acquired by instance ID '0000000000000000000000001DA5AB02'.
[2024-07-25T20:46:51.925Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=b4f28b7c-a72b-416d-911c-e5ce95654904)
[2024-07-25T20:46:51.925Z] Trigger Details: MessageId: cae78e4d-a575-497a-96d2-6ff397b33cda, DequeueCount: 1, InsertedOn: 2024-07-25T20:46:51.000+00:00
[2024-07-25T20:46:52.187Z] Found order O6, it's new, saving to db.
[2024-07-25T20:46:52.356Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=fbf28097-465c-44fc-bc77-40130f66c23a)
[2024-07-25T20:46:52.356Z] Trigger Details: MessageId: 7d60a12e-270d-4e10-aad3-7ee02e6c0ab9, DequeueCount: 1, InsertedOn: 2024-07-25T20:46:52.000+00:00
[2024-07-25T20:46:52.640Z] Found order O7, it's new, saving to db.
[2024-07-25T20:46:52.830Z] Processing order O1
[2024-07-25T20:46:53.107Z] Found order O8, it's new, saving to db.
[2024-07-25T20:46:53.138Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=dd09b1c4-afa6-4d5e-b410-a8bda6b96100)
[2024-07-25T20:46:53.138Z] Trigger Details: MessageId: 113f8273-0f16-4ead-bee3-d1d99df37bd3, DequeueCount: 1, InsertedOn: 2024-07-25T20:46:52.000+00:00
...
[2024-07-25T20:56:47.702Z] Timeout value of 00:10:00 exceeded by function 'Functions.salesforce_report' (Id: 'e7a3a19c-144c-4271-a7cc-8bf3aeaea000'). Initiating cancellation.
[2024-07-25T20:56:47.763Z] Executed 'Functions.salesforce_report' (Failed, Id=e7a3a19c-144c-4271-a7cc-8bf3aeaea000, Duration=600196ms)
[2024-07-25T20:56:47.764Z] Microsoft.Azure.WebJobs.Host: Timeout value of 00:10:00 was exceeded by function: Functions.salesforce_report.
...

requirements.txt file

amqp==5.2.0
artifacts-keyring==0.3.4
asn1crypto==1.5.1
asttokens==2.4.0
async-timeout==4.0.3
attrs==23.1.0
Authlib==1.2.1
azure-ai-formrecognizer==3.3.1
azure-ai-vision-imageanalysis==1.0.0b1
azure-common==1.1.28
azure-core==1.29.5
azure-cosmos==4.5.1
azure-functions==1.20.0
azure-functions-durable==1.2.9
azure-identity==1.14.1
azure-storage-blob==12.18.3
azure-storage-queue==12.10.0
Babel==2.14.0
backcall==0.2.0
beautifulsoup4==4.12.2
billiard==4.2.0
bleach==6.1.0
celery==5.3.6
certifi==2023.7.22
cffi==1.16.0
cfgv==3.4.0
charset-normalizer==3.3.0
click==8.1.7
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
colorama==0.4.6
comm==0.1.4
cryptography==41.0.7
dacite==1.8.1
debugpy==1.8.0
decorator==5.1.1
defusedxml==0.7.1
Delorean==1.0.0
distlib==0.3.7
docx==0.2.4
entsoe-py==0.5.12
exceptiongroup==1.2.0
executing==2.0.0
fastjsonschema==2.19.0
filelock==3.12.4
fuzzywuzzy==0.18.0
geographiclib==2.0
geopy==2.4.0
greenlet==3.0.2
humanize==4.9.0
identify==2.5.30
idna==3.4
importlib-metadata==6.8.0
importlib-resources==6.1.0
iniconfig==2.0.0
isodate==0.6.1
jaraco.classes==3.3.0
jedi==0.19.1
Jinja2==3.1.2
jsonschema==4.20.0
jsonschema-specifications==2023.11.2
keyring==24.2.0
kombu==5.3.4
Levenshtein==0.23.0
lxml==4.9.3
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mccabe==0.7.0
mistune==3.0.2
more-itertools==10.1.0
msal==1.24.1
msal-extensions==1.0.0
msrest==0.7.1
mypy-extensions==1.0.0
nbclient==0.9.0
nbconvert==7.13.0
nbformat==5.9.2
nest-asyncio==1.5.8
nodeenv==1.8.0
numpy==1.26.0
oauthlib==3.2.2
packaging==23.2
pandas==1.5.2
pandocfilters==1.5.0
parso==0.8.3
pathspec==0.11.2
pickleshare==0.7.5
Pillow==10.1.0
platformdirs==3.11.0
pluggy==1.3.0
portalocker==2.8.2
prompt-toolkit==3.0.41
psutil==5.9.5
pure-eval==0.2.2
pycodestyle==2.11.0
pycparser==2.21
pydantic==2.7.2
pyflakes==3.1.0
Pygments==2.16.1
PyJWT==2.8.0
pyOpenSSL==23.3.0
pytest==7.4.2
pytest-azurepipelines==1.0.5
pytest-nunit==1.0.4
python-dateutil==2.8.2
python-docx==0.8.11
python-Levenshtein==0.23.0
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==25.1.1
rapidfuzz==3.5.2
redis==5.0.7
referencing==0.32.0
reportforce==0.0.7
requests==2.31.0
requests-file==1.5.1
requests-mock==1.11.0
requests-oauthlib==1.3.1
requests-toolbelt==1.0.0
rpds-py==0.15.2
sentry-sdk==2.11.0
simple-salesforce==1.12.2
six==1.16.0
snowflake-connector-python[pandas]==3.6.0
snowflake-sqlalchemy==1.5.1
sortedcontainers==2.4.0
soupsieve==2.5
SQLAlchemy==1.4.50
stack-data==0.6.3
tinycss2==1.2.1
tomli==2.0.1
tomlkit==0.12.3
tornado==6.3.3
traitlets==5.13.0
typing_extensions==4.8.0
tzdata==2023.3
tzlocal==5.2
urllib3==2.0.6
vine==5.1.0
virtualenv==20.24.5
wcwidth==0.2.8
webencodings==0.5.1
zeep==4.2.1
zipp==3.17.0
msgraph-sdk
trio==0.24.0

Where are you facing this problem?

Local - Core Tools

Function app name

No response

Additional Information

As you can see from the log, the report found 8 orders and then 8 times saved it to the database, processing all the orders. This all happens in 6 seconds, then 10 minutes later it times out (have 10 min in hosts.json).

Curiously, the processing of the orders happens through a queue trigger and 3 of the 8 orders also timeout after 10 minutes, but I decided to show this example where it's pretty clear from the code that everything ran.

I would love to give a smaller example, but I don't see a way for me to debug this and figure out what is causing the timeout, so I'd be able to replicate it on a smaller scale.

@hallvictoria
Copy link
Contributor

Hi @rokcarl, thanks for reporting.

I tried replicating with a sample timer trigger function and don't encounter the timeout issue. The timeout issue can occur though if there's an exception that occurs somewhere within the function app code.

Looking at the logs you provided, the timeout error occurred for the first execution, the timer trigger function. Also, while the logs show 8 "'Functions.process_orders_queue_item'" messages, it doesn't show any "Executed" messages, meaning that none of those executions finished. You mentioned that 3 of the queue trigger executions timed out as well, so that makes sense, but that could mean an exception is also happening in that function.

Debugging is a little tricky here since there are multiple components involved, but here are a couple suggestions:
- Deploy the app. App Insights might provide some more logs, and you can also provide the app name here so we can take a closer look
- Adding some more log lines in the timer trigger function (before if statement in loop, after queue.enqueue, after the loop ends, etc.). This could help figure out where an exception might be occurring in this function

@rokcarl
Copy link
Author

rokcarl commented Jul 29, 2024

The timeout issue can occur though if there's an exception that occurs somewhere within the function app code.

This sounds like a bug? Why would a function time out, what is the runtime waiting for if the code breaks?

it doesn't show any "Executed" messages

There were, but I chose not to show every log line.

Deploy the app.

That sounds counterintuitive and definitely not the best way to debug an app. I would think I'd be able to debug locally?

Adding some more log lines

Thanks, I'll try that and report back.

@rokcarl
Copy link
Author

rokcarl commented Jul 30, 2024

Did what you suggested, it only solidified my assumptions.

new Python code with more logging, less orders
@app.function_name(name="salesforce_report")
@app.timer_trigger(schedule="0 */1 * * * *", arg_name="timer", run_on_startup=True)
def salesforce_report(timer: func.TimerRequest) -> None:
    report_id = config.get("salesforce.default_maltrix_report_id")
    order_numbers = ["DE000742510", "123"]
    logging.info(f"[TIMER] Salesforce report {report_id}, found {len(order_numbers)} orders.")
    save_ttl = config.get("maltrix.order_processing_single_timeout_seconds")
    for order_number in order_numbers:
        if cache.check_and_save("orders", order_number, save_ttl):
            logging.info(f"[TIMER] Found order {order_number}, it's new, saving to db.")
            queue.enqueue("orders", order_number)
        else:
            logging.info(f"[TIMER] Found order {order_number}, it's old, continuing.")
    logging.info("[TIMER] finished.")

@app.function_name(name="process_orders_queue_item")
@app.queue_trigger(arg_name="msg", queue_name="orders", connection="AZURE_STORAGE_CONNECTION_STRING")
async def process_orders_queue_item(msg: func.QueueMessage) -> None:
    logging.info("process_orders_queue_item:")
    order_number = msg.get_body().decode("utf-8")
    logging.info(f"Processing order number {order_number}.")
    try:
        app.run(order_number=order_number, run_type="SALESFORCE_REPORT")
    except Exception as e:
        capture_exception(e)
new logs
❯ func start
Found Python version 3.10.10 (python3).

Azure Functions Core Tools
Core Tools Version:       4.0.5801 Commit hash: N/A +5ac2f09758b98257e728dd1b5576ce5ea9ef68ff (64-bit)
Function Runtime Version: 4.34.1.22669

[2024-07-30T12:09:38.557Z] Worker process started and initialized.

Functions:

        ...

For detailed output, run func with --verbose flag.
[2024-07-30T12:09:39.309Z] Executing 'Functions.salesforce_report' (Reason='Timer fired at 2024-07-30T14:09:39.2098190+02:00', Id=66c3589f-c9e5-45ff-9de5-eafbb2d5afbb)
[2024-07-30T12:09:39.312Z] Trigger Details: UnscheduledInvocationReason: IsPastDue, OriginalSchedule: 2024-07-30T10:31:00.0000000+02:00
[2024-07-30T12:09:39.522Z] [TIMER] Salesforce report 00O9Y000000MbjJUAS, found 2 orders.
[2024-07-30T12:09:40.660Z] [TIMER] Found order DE000742510, it's new, saving to db.
[2024-07-30T12:09:41.320Z] [TIMER] Found order 123, it's new, saving to db.
[2024-07-30T12:09:41.411Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=566be9d7-ffb0-4e5c-932e-a64347ce1852)
[2024-07-30T12:09:41.412Z] Trigger Details: MessageId: 5f8155d3-316c-497b-8d34-a7cf62ff0619, DequeueCount: 1, InsertedOn: 2024-07-30T12:09:41.000+00:00
[2024-07-30T12:09:41.457Z] process_orders_queue_item:
[2024-07-30T12:09:41.457Z] Processing order number DE000742510.
[2024-07-30T12:09:41.525Z] [TIMER] finished.
[2024-07-30T12:09:41.628Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=67e29802-767b-483d-9eb2-3b926f1df7f6)
[2024-07-30T12:09:41.628Z] Trigger Details: MessageId: 394c8868-76c2-4804-8d7f-0f3a3ed927dd, DequeueCount: 1, InsertedOn: 2024-07-30T12:09:41.000+00:00
[2024-07-30T12:09:43.501Z] Host lock lease acquired by instance ID '0000000000000000000000001DA5AB02'.
[2024-07-30T12:09:44.521Z] Processing order DE000742510
[2024-07-30T12:09:45.674Z] Running tasks [redacted] with data {'order_number': 'DE000742510', 'run_id': 'ad025629-62fa-44ad-85ef-e8286bbcd109', 'run_type': 'SALESFORCE_REPORT'}
[2024-07-30T12:09:45.674Z] Running the sanity check on order DE000742510, run id ad025629-62fa-44ad-85ef-e8286bbcd109.
[2024-07-30T12:09:48.517Z] Opp: [redacted]
[2024-07-30T12:09:55.549Z] Customer does not have classic zeroing.
[2024-07-30T12:09:56.388Z] ClientSecretCredential.get_token succeeded
[2024-07-30T12:09:56.851Z] HTTP Request: GET https://graph.microsoft.com/[redacted] "HTTP/2 200 OK"
[2024-07-30T12:09:58.107Z] ClientSecretCredential.get_token succeeded
[2024-07-30T12:09:59.278Z] HTTP Request: GET https://graph.microsoft.com/[redacted] "HTTP/2 200 OK"
[2024-07-30T12:09:59.736Z] ClientSecretCredential.get_token succeeded
[2024-07-30T12:10:00.278Z] HTTP Request: GET https://graph.microsoft.com/[redacted] "HTTP/2 200 OK"
[2024-07-30T12:10:00.666Z] ClientSecretCredential.get_token succeeded
[2024-07-30T12:10:01.096Z] HTTP Request: GET https://graph.microsoft.com/[redacted] "HTTP/2 200 OK"
[2024-07-30T12:10:01.475Z] ClientSecretCredential.get_token succeeded
[2024-07-30T12:10:01.919Z] HTTP Request: GET https://graph.microsoft.com/[redacted] "HTTP/2 200 OK"
[2024-07-30T12:10:02.678Z] ClientSecretCredential.get_token succeeded
[2024-07-30T12:10:03.120Z] HTTP Request: GET https://graph.microsoft.com/[redacted] "HTTP/2 200 OK"
[2024-07-30T12:10:03.549Z] ClientSecretCredential.get_token succeeded
[2024-07-30T12:10:03.952Z] HTTP Request: GET https://graph.microsoft.com/[redacted] "HTTP/2 200 OK"
[2024-07-30T12:10:03.969Z] All problems: ['DC Date Problem: 2024-02-09 14:06:09.424000, FF Date: 2023-12-18']
[2024-07-30T12:10:03.969Z] CheckResults([redacted])
[2024-07-30T12:10:06.369Z] Running tasks [redacted] with data {'order_number': 'DE000742510', 'run_id': 'ad025629-62fa-44ad-85ef-e8286bbcd109', 'run_type': 'SALESFORCE_REPORT'}
[2024-07-30T12:10:06.372Z] Extract AP for order DE000742510, run id ad025629-62fa-44ad-85ef-e8286bbcd109.
[2024-07-30T12:10:39.338Z] Timeout value of 00:01:00 exceeded by function 'Functions.salesforce_report' (Id: '66c3589f-c9e5-45ff-9de5-eafbb2d5afbb'). Initiating cancellation.
[2024-07-30T12:10:39.457Z] Executed 'Functions.salesforce_report' (Failed, Id=66c3589f-c9e5-45ff-9de5-eafbb2d5afbb, Duration=60227ms)
[2024-07-30T12:10:39.457Z] Microsoft.Azure.WebJobs.Host: Timeout value of 00:01:00 was exceeded by function: Functions.salesforce_report.
[2024-07-30T12:10:41.420Z] Timeout value of 00:01:00 exceeded by function 'Functions.process_orders_queue_item' (Id: '566be9d7-ffb0-4e5c-932e-a64347ce1852'). Initiating cancellation.
[2024-07-30T12:10:41.436Z] Executed 'Functions.process_orders_queue_item' (Failed, Id=566be9d7-ffb0-4e5c-932e-a64347ce1852, Duration=60034ms)
[2024-07-30T12:10:41.436Z] Microsoft.Azure.WebJobs.Host: Timeout value of 00:01:00 was exceeded by function: Functions.process_orders_queue_item.
[2024-07-30T12:10:41.515Z] A function timeout has occurred. Restarting worker process executing invocationId '66c3589f-c9e5-45ff-9de5-eafbb2d5afbb'.
[2024-07-30T12:10:41.660Z] Timeout value of 00:01:00 exceeded by function 'Functions.process_orders_queue_item' (Id: '67e29802-767b-483d-9eb2-3b926f1df7f6'). Initiating cancellation.
[2024-07-30T12:10:41.676Z] Executed 'Functions.process_orders_queue_item' (Failed, Id=67e29802-767b-483d-9eb2-3b926f1df7f6, Duration=60047ms)
[2024-07-30T12:10:41.676Z] Microsoft.Azure.WebJobs.Host: Timeout value of 00:01:00 was exceeded by function: Functions.process_orders_queue_item.
[2024-07-30T12:10:42.099Z] Worker process started and initialized.
[2024-07-30T12:10:43.451Z] A function timeout has occurred. Restarting worker process executing invocationId '566be9d7-ffb0-4e5c-932e-a64347ce1852'.
[2024-07-30T12:10:43.452Z] Restart of language worker process(es) completed.
[2024-07-30T12:10:43.472Z] Message has reached MaxDequeueCount of 1. Moving message to queue 'orders-poison'.
[2024-07-30T12:10:43.711Z] A function timeout has occurred. Restarting worker process executing invocationId '67e29802-767b-483d-9eb2-3b926f1df7f6'.
[2024-07-30T12:10:43.711Z] Restart of language worker process(es) completed.
[2024-07-30T12:10:43.713Z] Message has reached MaxDequeueCount of 1. Moving message to queue 'orders-poison'.
[2024-07-30T12:10:52.113Z] Restart of language worker process(es) completed.

As you can clearly see, the following happens:

  1. salesforce_report runs, ID 66c3589f-c9e5-45ff-9de5-eafbb2d5afbb,
  2. Finds 2 orders, puts them on the Azure queue,
  3. The queue already triggers the process_orders_queue_item function,
  4. The salesforce_report function is done, last line runs, [TIMER] finished. is displayed in the logs.
  5. But for some reason, it keeps running, then times out after 1 minute (I changed the setting in host.json), same ID 66c3589f-c9e5-45ff-9de5-eafbb2d5afbb.

@hallvictoria
Copy link
Contributor

hallvictoria commented Jul 30, 2024

Thanks for trying this out. I tried to repro with a more simplified example, but I'm not getting the timeout error.

Code:
import logging
import os

import azure.functions as func
from azure.storage.queue import QueueClient, BinaryBase64EncodePolicy

queue_client = QueueClient.from_connection_string(os.getenv("AzureWebJobsStorage"), "v-queue")
queue_client.message_encode_policy = BinaryBase64EncodePolicy()

app = func.FunctionApp()

@app.function_name(name="salesforce_report")
@app.timer_trigger(schedule="0 */1 * * * *", arg_name="timer", run_on_startup=True)
def salesforce_report(timer: func.TimerRequest) -> None:
    order_numbers = ["DE000742510", "123"]
    logging.info(f"[TIMER] Salesforce report, found {len(order_numbers)} orders.")
    for order_number in order_numbers:
        logging.info(f"[TIMER] Found order {order_number}, it's new, saving to db.")
        order_bytes = order_number.encode('utf-8')
        queue_client.send_message(queue_client.message_encode_policy.encode(order_bytes))
    logging.info("[TIMER] finished.")


@app.function_name(name="process_orders_queue_item")
@app.queue_trigger(arg_name="msg", queue_name="v-queue", connection="AzureWebJobsStorage")
async def process_orders_queue_item(msg: func.QueueMessage) -> None:
    logging.info("process_orders_queue_item:")
    order_number = msg.get_body()
    logging.info(f"Processing order number {order_number}.")
    try:
        app.run(order_number=order_number, run_type="SALESFORCE_REPORT")
    except Exception as e:
        logging.error(f"Error processing order {order_number}: {e}")
Logs:
[2024-07-30T20:23:27.043Z] Executing 'Functions.salesforce_report' (Reason='Timer fired at 2024-07-30T15:23:26.8730618-05:00', Id=2a373ebb-29cc-4e3a-addf-47af026341b0)
[2024-07-30T20:23:27.058Z] Trigger Details: UnscheduledInvocationReason: IsPastDue, OriginalSchedule: 2024-07-30T15:22:00.0000000-05:00
[2024-07-30T20:23:27.177Z] Received FunctionInvocationRequest, request ID: dfaa5aae-c07d-4a3a-95c4-24f12c7aa315, function ID: 850b8097-953b-5309-b5c5-42d1a2c012dd, function name: salesforce_report, invocation ID: 2a373ebb-29cc-4e3a-addf-47af026341b0, function type: sync, timestamp (UTC): 2024-07-30 20:23:27.175628, sync threadpool max workers: 1000
[2024-07-30T20:23:27.188Z] [TIMER] Salesforce report, found 2 orders.
[2024-07-30T20:23:27.188Z] [TIMER] Found order DE000742510, it's new, saving to db.
[2024-07-30T20:23:28.049Z] [TIMER] Found order 123, it's new, saving to db.
[2024-07-30T20:23:28.107Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'v-queue'.', Id=a1e115a1-c13b-4149-8ca3-8dd5f5e96e32)
[2024-07-30T20:23:28.108Z] Trigger Details: MessageId: 50e6fd9f-d9ec-4ceb-9a18-817ac5fb6e9e, DequeueCount: 1, InsertedOn: 2024-07-30T20:23:27.000+00:00
[2024-07-30T20:23:28.116Z] Received FunctionInvocationRequest, request ID: dfaa5aae-c07d-4a3a-95c4-24f12c7aa315, function ID: 29d049ba-79f4-5f8b-952f-1736cb468610, function name: process_orders_queue_item, invocation ID: a1e115a1-c13b-4149-8ca3-8dd5f5e96e32, function type: async, timestamp (UTC): 2024-07-30 20:23:28.113793
[2024-07-30T20:23:28.125Z] Processing order number b'DE000742510'.
[2024-07-30T20:23:28.125Z] Error processing order b'DE000742510': 'FunctionApp' object has no attribute 'run'
[2024-07-30T20:23:28.125Z] process_orders_queue_item:
[2024-07-30T20:23:28.129Z] [TIMER] finished.
[2024-07-30T20:23:28.189Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'v-queue'.', Id=2d218ccf-6971-4604-b39b-53434c1f36a7)
[2024-07-30T20:23:28.192Z] Trigger Details: MessageId: 445f1594-210d-4997-97ee-7b366214dfa6, DequeueCount: 1, InsertedOn: 2024-07-30T20:23:27.000+00:00
[2024-07-30T20:23:28.214Z] Received FunctionInvocationRequest, request ID: dfaa5aae-c07d-4a3a-95c4-24f12c7aa315, function ID: 29d049ba-79f4-5f8b-952f-1736cb468610, function name: process_orders_queue_item, invocation ID: 2d218ccf-6971-4604-b39b-53434c1f36a7, function type: async, timestamp (UTC): 2024-07-30 20:23:28.207945
[2024-07-30T20:23:28.215Z] Processing order number b'123'.
**************************[2024-07-30T20:23:28.219Z] Executed 'Functions.salesforce_report' (Succeeded, Id=2a373ebb-29cc-4e3a-addf-47af026341b0, Duration=1294ms)
**************************[2024-07-30T20:23:28.228Z] Executed 'Functions.process_orders_queue_item' (Succeeded, Id=2d218ccf-6971-4604-b39b-53434c1f36a7, Duration=40ms)
[2024-07-30T20:23:28.214Z] process_orders_queue_item:
**************************[2024-07-30T20:23:28.220Z] Executed 'Functions.process_orders_queue_item' (Succeeded, Id=a1e115a1-c13b-4149-8ca3-8dd5f5e96e32, Duration=68ms)
[2024-07-30T20:23:28.215Z] Error processing order b'123': 'FunctionApp' object has no attribute 'run'

One thing that might be a bit different in this example:

  • I'm using a QueueClient from the azure-storage-queue SDK to put a message in the queue. If you're not doing that currently, could you try that?

One other thought:

  • Is the app variable overridden? Based on the @app.<statement> lines, I'm assuming you have app = func.FunctionApp(). The FunctionApp object does not have a run method, though, like app.run statement in process_orders_queue_item.

@rokcarl
Copy link
Author

rokcarl commented Jul 31, 2024

I'm using a QueueClient

I am using azure-storage-queue==12.10.0, so its the same. My queue.enqueue("orders", order_number) code runs the following:

def enqueue(queue_name: str, text: str) -> None:
    get_queue(queue_name).send_message(text)


def get_queue(queue_name) -> QueueClient:
    connection_str = config.get("AZURE_STORAGE_CONNECTION_STRING")
    return QueueClient.from_connection_string(conn_str=connection_str, queue_name=queue_name)

So similar to your code, except for encoding.

Is the app variable overridden?

Yes, my app is import azure.durable_functions as df; app = df.DFApp(http_auth_level=func.AuthLevel.FUNCTION). I've changed it to func.FunctionApp() (and also removed a bunch activity triggers as they're not supported) and the same timeout happened.


I tried creating a smaller example but failed:

function_app.py
import logging
import time

import azure.functions as func

from utils.helpers import queue

logger = logging.getLogger("azure.core.pipeline.policies.http_logging_policy")
logger.setLevel(logging.WARNING)

app = func.FunctionApp()

from azure.storage.queue import QueueClient, BinaryBase64EncodePolicy
from utils import config

queue_client = QueueClient.from_connection_string(config.get("AZURE_STORAGE_CONNECTION_STRING"), "orders")
queue_client.message_encode_policy = BinaryBase64EncodePolicy()


@app.function_name(name="salesforce_report")
@app.timer_trigger(schedule="0 */1 * * * *", arg_name="timer", run_on_startup=True)
def salesforce_report(timer: func.TimerRequest) -> None:
    order_numbers = ["DE000742510", "123"]
    logging.info(f"[TIMER] Salesforce report, found {len(order_numbers)} orders.")
    for order_number in order_numbers:
        logging.info(f"[TIMER] Found order {order_number}, it's new, saving to db.")
        order_bytes = order_number.encode('utf-8')
        queue_client.send_message(queue_client.message_encode_policy.encode(order_bytes))
    logging.info("[TIMER] finished.")


@app.function_name(name="process_orders_queue_item")
@app.queue_trigger(arg_name="msg", queue_name="orders", connection="AZURE_STORAGE_CONNECTION_STRING")
async def process_orders_queue_item(msg: func.QueueMessage) -> None:
    order_number = msg.get_body()
    logging.info(f"[ITEM] started processing {order_number}")
    time.sleep(72)
    logging.info("[ITEM] finished processing")

output
❯ func start
Found Python version 3.10.10 (python3).

Azure Functions Core Tools
Core Tools Version:       4.0.5801 Commit hash: N/A +5ac2f09758b98257e728dd1b5576ce5ea9ef68ff (64-bit)
Function Runtime Version: 4.34.1.22669

[2024-07-31T07:00:18.608Z] Worker process started and initialized.

Functions:

        process_orders_queue_item: queueTrigger

        salesforce_report: timerTrigger

For detailed output, run func with --verbose flag.
[2024-07-31T07:00:19.263Z] Executing 'Functions.salesforce_report' (Reason='Timer fired at 2024-07-31T09:00:19.1976890+02:00', Id=89daf718-bad6-4999-b05c-946259590a0b)
[2024-07-31T07:00:19.266Z] Trigger Details: UnscheduledInvocationReason: RunOnStartup
[2024-07-31T07:00:19.454Z] [TIMER] Salesforce report, found 2 orders.
[2024-07-31T07:00:19.454Z] [TIMER] Found order DE000742510, it's new, saving to db.
[2024-07-31T07:00:19.606Z] [TIMER] Found order 123, it's new, saving to db.
[2024-07-31T07:00:19.668Z] [TIMER] finished.
[2024-07-31T07:00:19.719Z] Executed 'Functions.salesforce_report' (Succeeded, Id=89daf718-bad6-4999-b05c-946259590a0b, Duration=504ms)
[2024-07-31T07:00:19.750Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=54287cf9-4fda-48c3-b9fd-2da94286400e)
[2024-07-31T07:00:19.751Z] Trigger Details: MessageId: b601bcf7-c5f5-43cc-9f8f-00126deb8f46, DequeueCount: 1, InsertedOn: 2024-07-31T07:00:19.000+00:00
[2024-07-31T07:00:19.766Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=c3f0483c-f738-4846-a9b4-cbb67461a039)
[2024-07-31T07:00:19.767Z] Trigger Details: MessageId: d002fb8f-f89d-4298-8783-94b638523937, DequeueCount: 1, InsertedOn: 2024-07-31T07:00:19.000+00:00
[2024-07-31T07:00:19.861Z] [ITEM] started processing b'REUwMDA3NDI1MTA='
[2024-07-31T07:00:23.506Z] Host lock lease acquired by instance ID '0000000000000000000000001DA5AB02'.
[2024-07-31T07:01:00.024Z] Executing 'Functions.salesforce_report' (Reason='Timer fired at 2024-07-31T09:01:00.0175580+02:00', Id=f50e99b7-325e-49dd-b8ca-15bbc438073b)
[2024-07-31T07:01:19.761Z] Timeout value of 00:01:00 exceeded by function 'Functions.process_orders_queue_item' (Id: '54287cf9-4fda-48c3-b9fd-2da94286400e'). Initiating cancellation.
[2024-07-31T07:01:19.769Z] Timeout value of 00:01:00 exceeded by function 'Functions.process_orders_queue_item' (Id: 'c3f0483c-f738-4846-a9b4-cbb67461a039'). Initiating cancellation.
[2024-07-31T07:01:19.825Z] Executed 'Functions.process_orders_queue_item' (Failed, Id=c3f0483c-f738-4846-a9b4-cbb67461a039, Duration=60057ms)[2024-07-31T07:01:19.825Z] Executed 'Functions.process_orders_queue_item' (Failed, Id=54287cf9-4fda-48c3-b9fd-2da94286400e, Duration=60085ms)
[2024-07-31T07:01:19.825Z] Microsoft.Azure.WebJobs.Host: Timeout value of 00:01:00 was exceeded by function: Functions.process_orders_queue_item.

[2024-07-31T07:01:19.825Z] Microsoft.Azure.WebJobs.Host: Timeout value of 00:01:00 was exceeded by function: Functions.process_orders_queue_item.
[2024-07-31T07:01:21.847Z] A function timeout has occurred. Restarting worker process executing invocationId 'c3f0483c-f738-4846-a9b4-cbb67461a039'.
[2024-07-31T07:01:21.847Z] A function timeout has occurred. Restarting worker process executing invocationId '54287cf9-4fda-48c3-b9fd-2da94286400e'.
[2024-07-31T07:01:21.863Z] Restart of language worker process(es) completed.
[2024-07-31T07:01:21.899Z] Message has reached MaxDequeueCount of 1. Moving message to queue 'orders-poison'.
[2024-07-31T07:01:22.244Z] Worker process started and initialized.
[2024-07-31T07:01:32.262Z] Restart of language worker process(es) completed.
[2024-07-31T07:01:32.266Z] Message has reached MaxDequeueCount of 1. Moving message to queue 'orders-poison'.

@rokcarl
Copy link
Author

rokcarl commented Jul 31, 2024

The thing that scares me the most about all of this is that we're in the "let's try this or maybe that" mode of debugging. Imagine having a production workload, a complex app, and then a bug that is debugged using this approach.

@hallvictoria
Copy link
Contributor

So similar to your code, except for encoding

When I didn't encode the message, I saw an error. The execution failed though; it wasn't a timeout. I'm assuming you've formatted the input differently in your code than my simple example so you're not hitting this but wanted to bring that up just in case.

[2024-07-31T14:46:01.082Z] Executed 'Functions.salesforce_report' (Failed, Id=3cd92f72-0646-4db8-b59b-d89ced05781c, Duration=974ms)
[2024-07-31T14:46:01.679Z] System.Private.CoreLib: Exception while executing function: Functions.salesforce_report. System.Private.CoreLib: Result: Failure
Exception: TypeError: Message content must not be bytes. Use the BinaryBase64EncodePolicy to send bytes.

I tried creating a smaller example but failed

From the logs provided, the salesforce_report timer function executed as expected. The process_orders_queue_item queue trigger function failed in a timeout, but that seems to be due to the time.sleep(72) call. What is the failure you're referring to here?

I'm still a little confused about the app.run(order_number=order_number, run_type="SALESFORCE_REPORT") line you had in your previous code. Even if app is a DFApp object, DFApp objects also do not have a run method AFAIK. Is run a method you defined in your code, or do you have multiple variables called app? The reason I'm asking is because that could also be causing an issue.

@rokcarl
Copy link
Author

rokcarl commented Aug 1, 2024

encode

To make encoding work, I checked a bunch of Github issues and Stack Overflow. In the end, the only thing that fixed it for me was to set the message encoding to none in host.json, so it works fine.

What is the failure you're referring to here?

Sorry I wasn't clear, I meant I failed to reproduce the problem, i.e. everything was working as it should, the timeout in that case is expected, since I sleep for more than the timeout in the queue trigger function.

I'm still a little confused about the app.run...

Whoops, sorry, this was meant to be [some_process].run(order_number=order_number, run_type="SALESFORCE_REPORT") with [some_process] being business related, I renamed (redacted) it to app when posting, forgetting that app already exists. So it's not actually calling anything on the DFApp object.

@hallvictoria
Copy link
Contributor

Since the timeout isn't reproduced with the simpler example, I'm leaning towards the issue being caused from one of the other functions being called inside. The timer trigger function itself isn't returning anything, but it will only finish executing when all of the code inside also finishes. The last log line can still execute even if the function as a whole times out, and that would be more of a python related question than functions. Are any of the other methods async, concurrent, blocking, relying on external dependencies, or something similar?

For debugging, I would recommend incrementally adding back in the more complicated logic to identify after which step the timeout occurs.

@rokcarl
Copy link
Author

rokcarl commented Aug 2, 2024

issue being caused from one of the other functions being called inside

As we already saw, the timer trigger function does two things:

  1. Gets the order numbers.
  2. Enqueues each order number in the Azure Queue.

From that, I don't see a mechanism where some "other functions being called" would cause a timeout in this timer trigger functions. You mention any other async, blocking, etc. methods. But the logging.info() call is executed (as I can see the result of it in the output), and then the function should end, if we're talking about a python related question.

For debugging, I would recommend incrementally adding back in the more complicated logic to identify after which step the timeout occurs.

I did already try this. But going with some form of bisection on the complex codebase without basic debugging tools, I might as well move to FastAPI + AKS and call it a day. I've already moved away from the promise of Azure Durable Functions as I've had so many bugs that I've encountered (one, two, three).

I would expect to be able to debug such issues of timeouts some way. If I spend days now on hunting this one down using bisection, then I know this will happen again, but at that point, we'll be in production and this is basically an outage.

@hallvictoria
Copy link
Contributor

I was referring to the get_salesforce() and cache.check_and_save() lines and checking if those methods could be causing timeouts. The simpler function app provided follows the same general scenario and doesn't hit the timeout issue. My suggestion was to take the simpler function app that was confirmed to work as expected and add back in those lines incrementally to end up with the initial function app code provided at the beginning, apologies if that wasn't clear.

There are other methods for debugging that might provide assistance as well:

@rokcarl
Copy link
Author

rokcarl commented Aug 2, 2024

I was referring to the get_salesforce() and cache.check_and_save()

get_salesforce() was not there in the latest code that broke. cache.check_and_save() was, but even then, what would be the mechanism that such a call would cause a problem? We're talking blocking Python code that ran through and lines further down already ran.

But just to indulge, I did try with the reduced code, timeout set to 30s, timeouts still occurring.

function_app.py
import logging
import os

import azure.functions as func

from api import companyapp
from utils import config
from utils.helpers import cache, queue

app = func.FunctionApp()

@app.function_name(name="salesforce_report")
@app.timer_trigger(schedule="0 */1 * * * *", arg_name="timer", run_on_startup=True)
def salesforce_report(timer: func.TimerRequest) -> None:
    report_id = config.get("salesforce.default_companyapp_report_id")
    order_numbers = ["DE000742510", "DEPV50000003001", "123"]
    logging.info(f"[TIMER] Salesforce report {report_id}, found {len(order_numbers)} orders.")
    for order_number in order_numbers:
        logging.info(f"[TIMER] Found order {order_number}, it's new, saving to db.")
        queue.enqueue("orders", order_number)
    logging.info("[TIMER] finished.")

@app.function_name(name="process_orders_queue_item")
@app.queue_trigger(arg_name="msg", queue_name="orders", connection="AZURE_STORAGE_CONNECTION_STRING")
async def process_orders_queue_item(msg: func.QueueMessage) -> None:
    logging.info("process_orders_queue_item:")
    order_number = msg.get_body().decode("utf-8")
    logging.info(f"Processing order number {order_number}.")
    try:
        companyapp.run(order_number=order_number, run_type="SALESFORCE_REPORT")
    except Exception as e:
        logging.error("EXCEPTION")
        logging.error(e)
log
❯ func start
Found Python version 3.10.10 (python3).

Azure Functions Core Tools
Core Tools Version:       4.0.5801 Commit hash: N/A +5ac2f09758b98257e728dd1b5576ce5ea9ef68ff (64-bit)
Function Runtime Version: 4.34.1.22669

[2024-08-02T20:42:18.682Z] Worker process started and initialized.

Functions:

        process_orders_queue_item: queueTrigger

        salesforce_report: timerTrigger

For detailed output, run func with --verbose flag.
[2024-08-02T20:42:19.382Z] Executing 'Functions.salesforce_report' (Reason='Timer fired at 2024-08-02T22:42:19.3106840+02:00', Id=7eb4e6f1-4dff-4e4b-90c7-73bbb0081cd7)
[2024-08-02T20:42:19.385Z] Trigger Details: UnscheduledInvocationReason: RunOnStartup
[2024-08-02T20:42:19.562Z] [TIMER] Salesforce report 00O9Y000000MbjJUAS, found 3 orders.
[2024-08-02T20:42:19.733Z] [TIMER] Found order DEPV50000003001, it's new, saving to db.
[2024-08-02T20:42:19.901Z] [TIMER] Found order 123, it's new, saving to db.
[2024-08-02T20:42:20.071Z] [TIMER] finished.
[2024-08-02T20:42:20.109Z] Executing 'Functions.process_orders_queue_item' (Reason='New queue message detected on 'orders'.', Id=d8edd5e8-05be-48cd-85ca-e541f8265498)
[2024-08-02T20:42:20.109Z] Trigger Details: MessageId: b0a702e4-1384-4bd5-b148-97706df2ed58, DequeueCount: 1, InsertedOn: 2024-08-02T20:42:20.000+00:00
[2024-08-02T20:42:21.577Z] Processing order DE000742510
[2024-08-02T20:42:22.461Z] Running tasks ['sanity_check'] with data {'order_number': 'DE000742510', 'run_id': '800bdefb-37f4-4bc1-978b-01680ef40805', 'run_type': 'SALESFORCE_REPORT'}
[2024-08-02T20:42:22.461Z] Running the sanity check on order DE000742510, run id 800bdefb-37f4-4bc1-978b-01680ef40805.
[2024-08-02T20:42:23.569Z] Host lock lease acquired by instance ID '0000000000000000000000001DA5AB02'.
[2024-08-02T20:42:27.459Z] Customer does not have classic zeroing.
[2024-08-02T20:42:28.066Z]     'Content-Length': '1753'
[2024-08-02T20:42:28.220Z] ClientSecretCredential.get_token succeeded
[2024-08-02T20:42:34.749Z] CompanyappCheckResults(remarks_salesforce_check=>DC Date Problem: 2024-02-09 14:06:09.424000, FF Date: 2023-12-18, order_number=DE000742510, run_id=800bdefb-37f4-4bc1-978b-01680ef40805, updated_at=2024-08-02 21:42:34, created_at=2024-08-02 21:42:34, salesforce_data_check=Declined, planning_check=Approved, other_information_check=Approved, meter_merging_check=Approved, legal_check=Approved)
[2024-08-02T20:42:36.483Z] Extract AP for order DE000742510, run id 800bdefb-37f4-4bc1-978b-01680ef40805.
[2024-08-02T20:42:49.490Z] Executed 'Functions.salesforce_report' (Failed, Id=7eb4e6f1-4dff-4e4b-90c7-73bbb0081cd7, Duration=30165ms)
[2024-08-02T20:42:49.490Z] Microsoft.Azure.WebJobs.Host: Timeout value of 00:00:30 was exceeded by function: Functions.salesforce_report.
[2024-08-02T20:42:49.601Z] ClientSecretCredential.get_token succeeded
[2024-08-02T20:42:49.853Z] Timeout value of 00:00:30 exceeded by function 'Functions.process_orders_queue_item' (Id: '10188f24-7113-4580-b54d-f1083afad3c8'). Initiating cancellation.
[2024-08-02T20:42:49.865Z] Executed 'Functions.process_orders_queue_item' (Failed, Id=10188f24-7113-4580-b54d-f1083afad3c8, Duration=30029ms)
[2024-08-02T20:42:49.865Z] Microsoft.Azure.WebJobs.Host: Timeout value of 00:00:30 was exceeded by function: Functions.process_orders_queue_item.
[2024-08-02T20:42:49.914Z] Timeout value of 00:00:30 exceeded by function 'Functions.process_orders_queue_item' (Id: 'd76a4747-65d3-42de-ade2-24bcc4c602ea'). Initiating cancellation.
[2024-08-02T20:42:49.918Z] Executed 'Functions.process_orders_queue_item' (Failed, Id=d76a4747-65d3-42de-ade2-24bcc4c602ea, Duration=30005ms)
[2024-08-02T20:42:49.918Z] Microsoft.Azure.WebJobs.Host: Timeout value of 00:00:30 was exceeded by function: Functions.process_orders_queue_item.
[2024-08-02T20:42:49.937Z] HTTP Request: GET https://graph.microsoft.com/v1.0/drives/[redacted] "HTTP/2 200 OK"
[2024-08-02T20:42:49.955Z] Request URL: 'https://login.microsoftonline.com/[redacted]/v2.0/.well-known/openid-configuration'

@hallvictoria
Copy link
Contributor

Apologies for the delay. I was able to repro this scenario.

The main issue here is the mixing of and interaction between sync (timer) and async (queue) functions, which isn't recommended. I'll see if we can get this highlighted in our documentation more.

Sync and async executions are handled differently. Sync calls are handled through the threadpool, and async are handled through a loop. The worker runs in the loop and only on one thread. When the worker receives an invocation request for the queue trigger, which is async, it serves as a blocking call and control switches to that call. Control isn't returned to the sync invocation to complete, so while the worker is waiting on the queue trigger functions to finish executing, the timer trigger function can't finish and return. This results in the timeout.

To address this issue, I would recommend converting the timer function to be async. azure-queue-storage-aio will help with uploading messages to the queue asynchronously. If this function needs to be synchronous, another solution would be to create a separate function app for the timer trigger.

@rokcarl
Copy link
Author

rokcarl commented Aug 28, 2024

The team did try with both async and the problem persisted.

While I don't consider this solved and I think this is an active problem you have, we are in the process of moving to Azure Container Apps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants