-
Notifications
You must be signed in to change notification settings - Fork 678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
asyncio duplicated instrumentation cause memory leak #3383
Comments
@yonathan-wolloch-lendbuzz I guess we still have references to something, are you able to sort out the culprit with the tool you suggested? |
@xrmx I don't want to mistakenly focus the discussion on it, but it seems that there is endless increase in objects from the path /python3.13/site-packages/opentelemetry/instrumentation/asyncio/init.py |
@xrmx it also seems that using getone() within a while loop instead of anext is also causing a constant linear increase in memory utilization. |
Also setting the env var OTEL_PYTHON_DISABLED_INSTRUMENTATIONS=asyncio solves that issue, but we don't want to use this solution. |
@xrmx are there any additional checks or things I can do to help soving this issue? |
@bourbonkk @dimastbk tagging you as its probably related to asyncio and aiokafka instrumentation 🙏 |
@yonathan-wolloch-lendbuzz I'm currently investigating this from the instrumentation side (@bourbonkk here 👋), and I wanted to add a few technical points that may help with narrowing things down: The The relevant environment variables are:
So I’m curious if in your setup, any of these environment variables were set, even unintentionally (perhaps via Docker ENV, Also, since the memory increase seems tied to span or metric creation, it would be helpful to know:
We’ve run stress tests using Any more insight into the environment would be very helpful 🙏 |
@bourbonkk None of these env vars are defined. regarding the stress testing we didn't see correlation between number of requests to memory usage, just a constant linear increase of memory over time. Here are the OTEL env vars we do use:
and the packages we use:
Please let me know if there is anything else I can do or provide in order to solve this. 🙏 |
👋 I tested this issue by setting up a real FastAPI application using ✅ Test Setup
🧪 Test Code Snippet@app.get("/")
async def root():
tracemalloc.start()
AsyncioInstrumentor().instrument()
asyncio.create_task(kafka_consumer_task())
return {"message": "Kafka consumer running with FastAPI"}
async def kafka_consumer_task():
snapshot1 = tracemalloc.take_snapshot()
consumer = AIOKafkaConsumer(...)
await consumer.start()
try:
count = 0
with tracer.start_as_current_span("root-kafka-span"):
async for msg in consumer:
count += 1
if count >= 1_000_000:
break
finally:
await consumer.stop()
gc.collect()
snapshot2 = tracemalloc.take_snapshot()
# Compare and log memory diff 📈 Test Result INFO: 📊 Top memory diffs:
INFO: aiokafka/protocol/types.py:161: +2048 KiB
INFO: aiokafka/conn.py:281: +30.4 KiB
INFO: aiokafka/conn.py:526: +5.2 KiB
INFO: asyncio/streams.py:132: +5.2 KiB
...
INFO: 📈 Total increased memory: 2.25 MB ✅ Conclusion
Let me know if you'd like the full source or logs. I'd be happy to share more! |
👋 Thank you so much for taking the time to thoroughly test and investigate this issue—I greatly appreciate your detailed response and the effort you've invested. Based on your results, it seems the memory growth you're seeing is minimal, which is reassuring. In our case, the memory leak seems unrelated to the number of messages consumed but rather to the total runtime of the service, as memory usage gradually increases over time. aiokafka also experienced similar issues. We are still actively investigating the exact cause of the leak on our end. It would be very helpful if you could share the full source code and logs from your testing. Having your detailed setup might help us pinpoint differences that contribute to our ongoing memory issues. Thanks again for your invaluable assistance! 🙏 |
📌 Memory Diff Results (After 4 Hours, 2 Kafka Messages)We've observed significant memory growth after just 2 messages consumed over a period of 4 hours since deployment. Below is our detailed memory diff captured using Top 10 Memory Differences:
|
Final Issue DescriptionAfter extensive investigation, we've identified a memory leak involving the
continuously accumulate callbacks due to the interaction with Debugging SnippetHere's a sanitized excerpt from our debug session highlighting the accumulated callbacks: [
<Future pending cb=[AsyncioInstrumentor.trace_future.<locals>.callback() at /path/to/project/.venv/lib/python3.13/site-packages/opentelemetry/instrumentation/asyncio/__init__.py:306, <70066 more callbacks>]>,
<Future finished result=None>,
<Future pending cb=[AsyncioInstrumentor.trace_future.<locals>.callback() at /path/to/project/.venv/lib/python3.13/site-packages/opentelemetry/instrumentation/asyncio/__init__.py:306, <70066 more callbacks>]>,
<Future pending cb=[AsyncioInstrumentor.trace_future.<locals>.callback() at /path/to/project/.venv/lib/python3.13/site-packages/opentelemetry/instrumentation/asyncio/__init__.py:306, <70066 more callbacks>]>,
<Task pending coro=<GroupCoordinator._heartbeat_routine() running at /path/to/project/.venv/lib/python3.13/site-packages/aiokafka/consumer/group_coordinator.py:773> wait_for=<Future pending cb=[Task.task_wakeup()]>>,
<Task pending coro=<GroupCoordinator._commit_refresh_routine() running at /path/to/project/.venv/lib/python3.13/site-packages/aiokafka/consumer/group_coordinator.py:910> wait_for=<Future pending cb=[Task.task_wakeup()]>>
] We'd greatly appreciate assistance or suggestions on resolving this issue. |
MRE (run with import asyncio
from opentelemetry.instrumentation.asyncio import AsyncioInstrumentor
async def test_intrument():
AsyncioInstrumentor().instrument()
fut = asyncio.Future()
for _ in range(10000):
await asyncio.wait(
[fut, asyncio.create_task(asyncio.sleep(0))], return_when=asyncio.FIRST_COMPLETED
) Result:
Patch: diff --git a/instrumentation/opentelemetry-instrumentation-asyncio/src/opentelemetry/instrumentation/asyncio/__init__.py b/instrumentation/opentelemetry-instrumentation-asyncio/src/opentelemetry/instrumentation/asyncio/__init__.py
index 9905d91d..950c7d5a 100644
--- a/instrumentation/opentelemetry-instrumentation-asyncio/src/opentelemetry/instrumentation/asyncio/__init__.py
+++ b/instrumentation/opentelemetry-instrumentation-asyncio/src/opentelemetry/instrumentation/asyncio/__init__.py
@@ -90,6 +90,7 @@ import sys
from asyncio import futures
from timeit import default_timer
from typing import Collection
+import weakref
from wrapt import wrap_function_wrapper as _wrap
@@ -125,6 +126,8 @@ class AsyncioInstrumentor(BaseInstrumentor):
"run_coroutine_threadsafe",
]
+ _future_callbacks = weakref.WeakValueDictionary()
+
def instrumentation_dependencies(self) -> Collection[str]:
return _instruments
@@ -303,6 +306,9 @@ class AsyncioInstrumentor(BaseInstrumentor):
self.record_process(start, attr, span, exception)
def trace_future(self, future):
+ if future in self._future_callbacks:
+ return future
+
start = default_timer()
span = (
self._tracer.start_span(f"{ASYNCIO_PREFIX} future")
@@ -324,6 +330,7 @@ class AsyncioInstrumentor(BaseInstrumentor):
)
future.add_done_callback(callback)
+ self._future_callbacks[future] = callback
return future
def record_process( Result:
|
@yonathan-wolloch-lendbuzz |
@yonathan-wolloch-lendbuzz Respect! |
Is the issue title still accurate, or is it actually related to aiokafka as well? It sounds like the leak is only in asyncio instrumentation |
Describe your environment
OS: Ubuntu
Python version: (e.g., Python 3.13.1)
Package version: 0.51b0
Aiokafka[lz4] version: 0.12.0
What happened?
we have a memory leakage caused by asyncio when using AIOKafkaConsumer in our fastapi app, exactly as documented in aiokafka documentation.
we do think that using getone() within a while loop instead of anext solves that issue but we want to follow aiokafka best practices.
Steps to Reproduce
add the following code as part of the fastapi app startup lifespan:
then trigger the consume once with a message, and the memory will scale up exponentially.
you can check the heap using guppy3 and tracemalloc. the best way is just to measure the memory utilization of the process.
Expected Result
stable memory utilization when using aiokafka's best practices.
Actual Result
exponentially increasing memory utilization.
Additional context
No response
Would you like to implement a fix?
No
The text was updated successfully, but these errors were encountered: