Skip to content

Extremely slow deserialization due to lock contention and logging module #1867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aantn opened this issue Jul 28, 2022 · 24 comments
Open

Extremely slow deserialization due to lock contention and logging module #1867

aantn opened this issue Jul 28, 2022 · 24 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@aantn
Copy link

aantn commented Jul 28, 2022

We are experiencing performance issues when doing API calls in many threads at the same time. The issue only occurs on certain clusters and we suspect it is happens when a high number of API calls occur simultaneously.

When looking at stack traces, we see contention around a lock in the logging module.

Thread XYZ (most recent call first):
  File \"/usr/local/lib/python3.9/logging/__init__.py\", line 225 in _acquireLock
  File \"/usr/local/lib/python3.9/logging/__init__.py\", line 1298 in getLogger
  File \"/usr/local/lib/python3.9/logging/__init__.py\", line 2042 in getLogger
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/configuration.py\", line 112 in __init__
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/models/v1_container_image.py\", line 48 in __init__
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 641 in __deserialize_model
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 303 in __deserialize
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 280 in <listcomp>
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 280 in __deserialize
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 639 in __deserialize_model
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 303 in __deserialize
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 639 in __deserialize_model
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 303 in __deserialize
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 280 in <listcomp>
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 280 in __deserialize
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 639 in __deserialize_model
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 303 in __deserialize
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 264 in deserialize
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 192 in __call_api
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py\", line 348 in call_api
  File \"/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py\", line 16517 in list_node_with_http_info
  File \"/usr/local/lib/python3.9/site-packages/hikaru/model/rel_1_16/v1/v1.py\", line 46011 in listNode
  ... application specific stuff

The above is from a single thread. Each stack trace contains about 20 threads like that, all stuck on different API calls in the client library, all waiting in _acquireLock. (There are several paths to _acquireLock - it's not just getLogger but also functions like _clear_cache(), setLeve() etc.)

Is there a fundamental reason why deserializing a Kubernetes object must involve a call to getLogger() and other logging functions?

If not, it would be great to improve the performance here and allow not calling any lockful logging functions during deserialization.

@aantn aantn added the kind/bug Categorizes issue or PR as related to a bug. label Jul 28, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 25, 2022
@aantn
Copy link
Author

aantn commented Nov 25, 2022

/remove-lifecycle rotten

This is a fairly major issue for us!

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 25, 2022
@CastixGitHub
Copy link

py-spy screenshoot
Screenshot from 2022-11-30 18-40-35

from kubernetes import client
from os import getenv
from time import sleep
configuration = client.Configuration()
configuration.api_key_prefix['authorization'] = 'Bearer'
configuration.api_key['authorization'] = getenv('TOKEN') or ''
configuration.host = getenv('HOST') or ''
configuration.verify_ssl = False
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

with client.ApiClient(configuration) as api_client:
    core_v1 = client.CoreV1Api(api_client)
    for i in range(1, 20001):
        secrets = core_v1.list_secret_for_all_namespaces().items

~1000 TLS secrets

@mdrakiburrahman
Copy link

mdrakiburrahman commented Dec 15, 2022

Here's a solution, just hack out the lines in the rest.py file that is the root of all evil in a dockerfile during build:

find /usr/local/lib/python*/dist-packages/kubernetes/client/rest.py -type f -exec sed -i 's:^\(.*logger.*\)$:#\1:' {} \;

@mdrakiburrahman
Copy link

mdrakiburrahman commented Dec 15, 2022

Hey @roycaihw - we ran a benchmark on our pod to quantify the memory leak from this bug if you run an infinite loop for 5 minutes calling K8s API over and over and over, our Pod went from 200 MB base to 38 GB of memory spill:

t=0
image

t=5 mins
image

This was Sonobuoy running on an AKS cluster, but I'm sure it's reproducible on GKE as well.

This pretty much makes the Python client unusable.

If you just run an infinite loop inside a Kubernetes Pod and run malloc, you will see the same behavior.

Here's a quick script you can run to get at the culprit in the rest.py:

import gc
import tracemalloc

tracemalloc.start(25)
while True:
    print(f"Duration elapsed: {delta}/{timeout} seconds, interval of {interval} seconds.")
    # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    # >>>>>>>>>>>  Make some K8s API calls here <<<<<<<<<<<<<<
    # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')
    print("[ Top 10 ]")
    for stat in top_stats[:10]:
        print(stat)
    top_stats = snapshot.statistics('traceback')
    stat = top_stats[0]
    print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
    for line in stat.traceback.format():
        print(line)

This is what you'll see, the Python Kubernetes client starts guzzling up RAM:

========================================================================
[ Top 10 ]
/usr/lib/python3.8/logging/__init__.py:1088: size=268 MiB, count=1373, average=200 KiB                          # <-------------------- CULPRIT
/usr/lib/python3.8/logging/__init__.py:373: size=134 MiB, count=240, average=571 KiB                            # <-------------------- CULPRIT
/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py:229: size=134 MiB, count=240, average=571 KiB  # <-------------------- CULPRIT
/usr/lib/python3.8/linecache.py:137: size=785 KiB, count=7484, average=107 B
/usr/lib/python3.8/logging/__init__.py:294: size=267 KiB, count=1366, average=200 B
/usr/lib/python3.8/tracemalloc.py:185: size=225 KiB, count=1001, average=230 B
/usr/lib/python3.8/tracemalloc.py:479: size=55.2 KiB, count=1002, average=56 B
/usr/lib/python3.8/logging/__init__.py:608: size=48.9 KiB, count=686, average=73 B
/usr/lib/python3.8/tracemalloc.py:472: size=39.5 KiB, count=1002, average=40 B
/usr/lib/python3.8/posixpath.py:145: size=38.4 KiB, count=685, average=57 B
========================================================================
24 memory blocks: 36462.6 KiB
    monitorstacks = get_latest_custom_resource_list(
  File "/tests/util.py", line 851
    latest_crd = get_latest_custom_resource_definition(group, kind)
  File "/tests/util.py", line 873
    for crd in get_latest_custom_resource_definitions()
  File "/tests/util.py", line 893
    for x in api.list_custom_resource_definition().to_dict()["items"]
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/apiextensions_v1_api.py", line 624
    return self.list_custom_resource_definition_with_http_info(**kwargs)  # noqa: E501
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/apiextensions_v1_api.py", line 731
    return self.api_client.call_api(
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 348
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 180
    response_data = self.request(
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 373
    return self.rest_client.GET(url,
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 241
    return self.request("GET", url,
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 232
    logger.debug("response body: %s", r.data)                                                                           # <-------------------- CULPRIT
  File "/usr/lib/python3.8/logging/__init__.py", line 1434
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib/python3.8/logging/__init__.py", line 1589
    self.handle(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 1599
    self.callHandlers(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 1661
    hdlr.handle(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 954
    self.emit(record)
  File "/usr/local/lib/python3.8/dist-packages/_pytest/logging.py", line 342
    super().emit(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 1088
    stream.write(msg + self.terminator)
========================================================================

This should be a trivial feature for a widely adopted community tool (don't cause massively instable memory leaks), we have the one-liner hack I mentioned above to just comment out the buggy debugging code in rest.py, but ideally this should be fixed in the source code.

@CastixGitHub
Copy link

the issue was reported back in 2019 #836 and became rotten without any reply
I also looked into this because of the memory issue. it's not AKS specific. and it's not bound to the python version used.

@mdrakiburrahman
Copy link

Tagging @yliaog @kariya-mitsuru @palnabarun as original authors of rest.py

@raufabr
Copy link

raufabr commented Feb 3, 2023

It'd be useful for us if this was fixed as well.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2023
@tzhu-yelp
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2023
@aantn
Copy link
Author

aantn commented Jul 13, 2023

@mdrakiburrahman what's the memory usage if you apply that patch vs without it?

@mdrakiburrahman
Copy link

@mdrakiburrahman what's the memory usage if you apply that patch vs without it?

With the hack patch, about ~200 MB - steady, as expected for my pod.

Without the hack patch, starts climbing, I've seen it go up to 40 GB+ (I removed limits to see how bad the damage can be, it's bad.).

@mdrakiburrahman
Copy link

With hack:

image

Without hack:
image

@yliaog
Copy link
Contributor

yliaog commented Jul 18, 2023

kubernetes/client
/rest.py

is generated, any chance to look into fixing the upstream generator?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2024
@tzhu-yelp
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2024
@tzhu-yelp
Copy link

/remove-lifecycle stale

@gibsondan
Copy link

The linked issue (#2284) has a workaround for this issue by patching the ApiClient to pass through a Configuration object to each model that it creates.

@aantn
Copy link
Author

aantn commented Sep 22, 2024

Thanks. We patched a related issue with something based on this PR (see patch and find commands at the end) and I suspect it includes the fix for this too. (For now we're also patching this as detailed here, which I suspect is redundant.)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 20, 2025
@CastixGitHub
Copy link

/remove-lifecycle rotten
/lifecycle waiting_for_mantainer_action

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

9 participants