Exclusive tasks running without a corresponding lock #6668

GitHK · 2024-11-05T12:35:05Z

What happened?

It was detected that the resource-usage-tracker service did not have a lock inside redis and it was reporting the following error: redis.exceptions.LockNotOwnedError: Cannot reacquire a lock that's no longer owned.

What triggered this situation?

The cause is currently unknown.

What could trigger this situation?

Lock is removed manually or somehow expires.
Redis database becoming unavailable for a short amount of time.

Why is this bad?

Immagine having multiple instances relying on this lock, if somehow it is freed (and the lock_context does nothing), a different instance will acquire it and use it as if the resource (protected by the lock) was free to be used.

How does it present itself?

Consider the following code:

async with lock_context(...):
    # some user defined
    # code here

Two situations compete to producing the issue:

lock_context creates a task which raises redis.exceptions.LockNotOwnedError but only logs the issue without handling it.
lock_context (context manager) has no way of stopping the execution of the code defined by the user.

Is there a way to reproduce it in a test?

Yes, apply the following changes in your repo changes.diff.zip

Then run one of the following tests to make the issue appear:

tests/test_redis_utils.py::test_possible_regression_lock_extension_fails_if_key_is_missing
tests/test_redis_utils.py::test_possible_regression_first_extension_delayed_and_expires_key

What was already tried?

To mitigate the issue it was proposed to try and stop the user defined code.
The only possible solution for a context manger is using signals (which are only available on linux).

When running test tests/test_redis_utils.py::test_context_manager_timing_out the code hangs unexpectedly inside the signal's handler functions. This means the entire asyncio runtime will be halted.

This approach cannot be used.

tests/test_redis_utils.py::test_context_manager_timing_out which produces a very unexpected result.

    def handler(signum, frame):
>       raise RuntimeError(f"Operation timed out after {after} seconds")
E       RuntimeError: Operation timed out after 10 seconds

tests/test_redis_utils.py:190: RuntimeError
> /home/silenthk/work/pr-osparc-redis-lock-issues/packages/service-library/tests/test_redis_utils.py(190)handler()
-> raise RuntimeError(f"Operation timed out after {after} seconds")

Different proposals

Feel free to suggest more below.

@GitHK: I would do the following. The locking mechanism should receive a task (which runs the user defined code) which it can cancel if something goes wrong. This solution no longer uses a context manger, but uses a battle tested method of stopping the running user defined code.

The text was updated successfully, but these errors were encountered:

matusdrobuliak66 · 2024-11-05T14:01:25Z

Additional notes/observations:

During the same period, a similar issue was observed in storage — the same Cannot reacquire a lock error was logged. The storage background task was running properly, and the lock key was present in the Redis lock DB table. (This is strange, the Resource Tracker issue differed because the key was missing there. However, why this error was logged in storage is unknown)
OpenTelemetry tracing was introduced recently; could this be causing some blocking behavior?
I would suggest not rushing but instead taking two steps:
- Analyze if there might be any blocking issues that could have caused this.
- To make the platform more robust, implement a mechanism to handle this situation.
  - Easiest solution: Consider marking the application as unhealthy (when it restarts, it should gracefully resolve the issue by itself).

GitHK · 2024-11-05T14:10:36Z

Additional notes/observations:

During the same period, a similar issue was observed in storage — the same Cannot reacquire a lock error was logged. The storage background task was running properly, and the lock key was present in the Redis lock DB table. (This is strange, the Resource Tracker issue differed because the key was missing there. However, why this error was logged in storage is unknown)

@matusdrobuliak66 can we please make a different issue out of this one? It is something different and I don't want to mix them.

matusdrobuliak66 · 2024-11-05T14:25:02Z

Additional notes/observations:

During the same period, a similar issue was observed in storage — the same Cannot reacquire a lock error was logged. The storage background task was running properly, and the lock key was present in the Redis lock DB table. (This is strange, the Resource Tracker issue differed because the key was missing there. However, why this error was logged in storage is unknown)

@matusdrobuliak66 can we please make a different issue out of this one? It is something different and I don't want to mix them.

Yes, no problem. I just wanted to make a note of it, because since we don't fully understand the issue yet, we can't be certain that those two are not somehow interconnected.

matusdrobuliak66 · 2024-12-11T10:56:27Z

Happened in the sim4life.io in EFS guardian , started 7.12., noticed 10.12. (restarting service was quick fix)

matusdrobuliak66 · 2025-01-06T14:03:06Z

Currently happening in osparc.io (Storage, RUT, dynamic scheduler) and sim4life.io (Efs guardian).

sanderegg · 2025-01-07T13:12:51Z

Investigation

storage, dynamic-schldr, resource-usage-tracker, efs-guardian that use the exclusive_task_starter complain about
efs-guardian issue is due to synchronous code (fixed in 🐛 efs - deletion of data run in executor #7013)
in osparc.io, they all started complaining at the same date at about the same time (+/- 5 minutes)

Local tests

make up-devel,
docker pause redis container
looking at storage logs, nothing is shown until a while when one can see time-outs in the redis client, but the background task does not show any error
docker unpause redis container
looking at storage logs, it shows the same error as above and the lock disappeared from the redis database

sanderegg · 2025-01-07T20:48:38Z

Decision(s)/Action(s)

a test was designed to show case the problem by hard-removing the lock from redis
After brainstorming and also together with @pcrespov , indeed the lock_context context manager is not the way to go and a decorator shall be designed that will cancel the work task and therefore properly ensure that the exclusive task do not continue running as if nothing happened.

FI @GitHK @matusdrobuliak66

GitHK added the bug buggy, it does not work as expected label Nov 5, 2024

GitHK assigned GitHK, pcrespov, sanderegg, matusdrobuliak66, giancarloromeo and bisgaard-itis Nov 5, 2024

GitHK changed the title ~~lock_context does not realise lock is no longer owned~~ lock_context does not release lock is no longer owned Nov 5, 2024

GitHK changed the title ~~lock_context does not release lock is no longer owned~~ lock_context does not release lock when it no longer owns the lock Nov 5, 2024

GitHK changed the title ~~lock_context does not release lock when it no longer owns the lock~~ lock_context does not release lock when it no longer owns it Nov 5, 2024

GitHK added this to the Event Horizon milestone Nov 5, 2024

sanderegg unassigned giancarloromeo, pcrespov and bisgaard-itis Nov 29, 2024

sanderegg mentioned this issue Nov 29, 2024

periodic_check_of_running_services_task lock does not seem to expire #6237

Closed

matusdrobuliak66 added the High Priority a totally crucial bug/feature to be fixed asap label Dec 10, 2024

sanderegg unassigned GitHK and matusdrobuliak66 Jan 7, 2025

This was referenced Jan 7, 2025

lock.reacquire() logs LockNotOwedError but probably should be silenced #5910

Closed

Replace all usage of reds locks with lock_context #4112

Closed

sanderegg changed the title ~~lock_context does not release lock when it no longer owns it~~ Exclusive tasks running without a corresponding lock Jan 7, 2025

This was referenced Jan 8, 2025

♻️Maintenance: Refactoring of redis client structure #7015

Merged

🐛Redis locks disappearing and fixup weird usage #7020

Merged

sanderegg closed this as completed in #7020 Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exclusive tasks running without a corresponding lock #6668

Exclusive tasks running without a corresponding lock #6668

GitHK commented Nov 5, 2024

matusdrobuliak66 commented Nov 5, 2024 •

edited

Loading

Uh oh!

GitHK commented Nov 5, 2024

Additional notes/observations:

Uh oh!

matusdrobuliak66 commented Nov 5, 2024

Additional notes/observations:

Uh oh!

matusdrobuliak66 commented Dec 11, 2024

Uh oh!

matusdrobuliak66 commented Jan 6, 2025 •

edited

Loading

Uh oh!

sanderegg commented Jan 7, 2025

Uh oh!

sanderegg commented Jan 7, 2025

Uh oh!

Exclusive tasks running without a corresponding lock #6668

Exclusive tasks running without a corresponding lock #6668

Comments

GitHK commented Nov 5, 2024

What happened?

What triggered this situation?

What could trigger this situation?

Why is this bad?

How does it present itself?

Is there a way to reproduce it in a test?

What was already tried?

Different proposals

matusdrobuliak66 commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional notes/observations:

Uh oh!

GitHK commented Nov 5, 2024

Additional notes/observations:

Uh oh!

matusdrobuliak66 commented Nov 5, 2024

Additional notes/observations:

Uh oh!

matusdrobuliak66 commented Dec 11, 2024

Uh oh!

matusdrobuliak66 commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanderegg commented Jan 7, 2025

Investigation

Local tests

Uh oh!

sanderegg commented Jan 7, 2025

Decision(s)/Action(s)

Uh oh!

matusdrobuliak66 commented Nov 5, 2024 •

edited

Loading

matusdrobuliak66 commented Jan 6, 2025 •

edited

Loading