Skip to content

Windows: RabbitMQ spawns wmic periodically and wmiprvse leaks resources #1343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lasfromhell opened this issue Aug 25, 2017 · 41 comments
Closed
Assignees
Labels
bug effort-low pkg-windows An issue specific to Windows
Milestone

Comments

@lasfromhell
Copy link

RabbitMQ spawns wmic in infinite loop and wmiprvse "eats" CPU and memory is slowly leaking. When it is going, cpu load is up to 50% (4 cores i5) and memory very slowly leaking from the system.
Screenshot of tracing and finding out who spanws process is attached.
When rabbitmq is off, everything is ok.

Installed version of rabbitmq - 3.6.11 with enabled management plugin and default configuration (OS Win 10 x64). No consumers/producers connected to mq.

wmi

On screenshot above you can see that processes spawned all the time, executing wmic request takes some time and cpu

@michaelklishin michaelklishin added the pkg-windows An issue specific to Windows label Aug 25, 2017
@michaelklishin
Copy link
Collaborator

@lukebakken this was introduced in #1223. Do you know why the sub-process may be sticking around?

@michaelklishin michaelklishin changed the title Windows: RabbitMQ spawns wmic in infinite loop and wmiprvse "eats" CPU and memory is slowly leaking Windows: RabbitMQ spawns wmic periodically and wmiprvse leaks resources Aug 25, 2017
@lukebakken
Copy link
Collaborator

No idea, we'll have to investigate.

@michaelklishin
Copy link
Collaborator

@lasfromhell you can set rabbit.vm_memory_calculation_strategy to erlang in your config file and wmic won't be used.

@lukebakken
Copy link
Collaborator

Commit with change to wmic - b5b529d

@lukebakken
Copy link
Collaborator

@lasfromhell thanks for that output. I just installed 3.6.11 on Windows 8.1 and don't see this behavior. I will try on Win 10 next.

What version of Erlang are you using?

@lasfromhell
Copy link
Author

@lukebakken Erlang 8.3 (OTP 19.3) x64
@michaelklishin thanks, that helps

@lukebakken
Copy link
Collaborator

@lasfromhell - wmic is used to calculate the memory used by the Erlang VM process and is run once per second. It is more accurate than using the erlang memory calculation strategy.

I have 3.6.11 running on OTP 19.3 on Windows 10 running in a VirtualBox VM with 2GB ram and 2 vCPUs. As reported by Task Manager, total system memory used remains constant at 1.2GB, and CPU usage doesn't exceed 20%. I'll keep this VM running a while to see if that changes.

rabbitmq-server-1343

@lukebakken
Copy link
Collaborator

My environment has remained stable as shown in the above screenshot. I'll close this issue now but if any additional reproduction information can be provided please re-open it.

@michaelklishin michaelklishin added not-enough-information Please provide more information by starting a mailing list thread. and removed bug labels Aug 29, 2017
@michaelklishin
Copy link
Collaborator

@lasfromhell if you can reproduce this in a minimalistic Windows VM, perhaps you can share it with us?

@atroxes
Copy link

atroxes commented Sep 5, 2017

Recently upgraded to RabbitMQ 3.6.11 (Erlang OTP 20.0) and experienced this issue as well. In a 2-node cluster, Host A and B spent a lot more CPU resources right after the upgrade, and after about a week, Host A suddenly went into a downward spiral, using 100% CPU resources. Host B did not experience this though, only higher avg. CPU usage, caused by WmiPrvSE.exe.

These two images show Host A from just before the upgrade on the morning of August 31st, to about a week later. Especially "Number of Threads" seems interesting.

Host A CPU Utilization - https://i.imgur.com/7k3i1kN.png
Host A Number of Threads - https://i.imgur.com/tGmpXYM.png

The issue has been temporarily fixed for us by using the suggestion of @michaelklishin and setting rabbit.vm_memory_calculation_strategy to erlang in the config file.

Edit: The higher avg. CPU usage of WmiPrvSE.exe was also present using Erlang OTP 19.3.

@lukebakken
Copy link
Collaborator

@atroxes - which version of Windows and what patch level? Thanks!

@lukebakken lukebakken reopened this Sep 5, 2017
@atroxes
Copy link

atroxes commented Sep 5, 2017

@lukebakken - Windows 2012 R2 - Last rollup installed is KB4022726 - It's a test system :)

Would you suggest fully updating and trying again?

@michaelklishin
Copy link
Collaborator

@atroxes thank you, we will try reproducing it some more. FWIW this is not something that's commonly reported (and Windows is a surprisingly popular deployment target for us), so any help with reproducing it would be greatly appreciated. Any chance you can share an image of your test env with us privately, for example?

@atroxes
Copy link

atroxes commented Sep 6, 2017

@michaelklishin - I'm afraid I'm not at liberty to do that, sorry. I will however attempt to fully update the same host and see if this resolves the issue.

Update: Same behaviour on a fully updated Windows 2012 R2. Excessive CPU usage from WmiPrvSE.exe as soon as RabbitMQ service is started.

@hairyhum
Copy link
Contributor

hairyhum commented Sep 7, 2017

Observation:
When running a lot of wmic processes in concurrently, it takes longer to return and a single WmiPrvSE.exe process consumes most CPU. Same for tasklist and get-process (powershell)
Assumption: All this commands call a blocking API.
Hypothesis: The get_system_process_resident_memory is called via get_memory_use from many queues checking if it's the time to page messages to disk.
Proposed experiment: start a rabbitmq windows server, create many queues and start publishing transient messages to hit the paging ratio. Monitor CPU usage of WmiPrvSE.exe and a number of cmd processes.
Proposed fix: memoize the get_system_process_resident_memory function with ets and fixed update period.

@michaelklishin
Copy link
Collaborator

michaelklishin commented Sep 7, 2017 via email

@hairyhum
Copy link
Contributor

hairyhum commented Sep 8, 2017

The experiment failed. Simple queues workload does not create much pressure to memory collection. The queues are using cached values from rabbit_node_monitor.
@atroxes @lasfromhell can you please post more information here about the workload you have?

  • number of queues, channels, connections
  • publish/consume rates
  • message types (persistent/transient) and if queues are lazy
  • message avg sizes
  • frequency of new cmd and wmic processes creation/completion

@atroxes
Copy link

atroxes commented Sep 8, 2017

@hairyhum

Connections: 7
Channels: 24
Exchanges: 33
Queues: 24
Consumers: 24

Regarding the rest: It's a test setup, it's literally doing next to nothing, as in, I barely see any messages at all. All rates/s are bascially 0.

Sorry I can't provide more detailed information, I'm pretty new to RabbitMQ :-)

@michaelklishin
Copy link
Collaborator

wmic activity is no in way related to message rates. @atroxes can you monitor how often new cmd and wmic processes are launched and for those that do not terminate, what they may be doing? (e.g. run a tracing tool similar to strace on them?)

@hairyhum
Copy link
Contributor

hairyhum commented Sep 8, 2017

file_handle_cache is calling vm_memory_monitor:get_memory_use on (some) reads, and since reads are performed by the message store the queue index, it can related to workload.

@hairyhum
Copy link
Contributor

hairyhum commented Sep 8, 2017

Or we should not use system memory reporting in hot code paths like file_handle_cache at all, because it's not parallelizable.

@michaelklishin
Copy link
Collaborator

@hairyhum good find. file_handle_cache should then periodically update its known memory used value or cache it another way.

@michaelklishin
Copy link
Collaborator

michaelklishin commented Sep 11, 2017

@atroxes the issue were are trying to resolve with the one off build above is is whether we effectively "leak" wmic processes by starting them "too often" (we figured it uses a shared resource that requires OS-wide synchronisation), not whether CPU usage is 0. It's expected that not starting wmic processes will consume less resources than not starting them.

@michaelklishin
Copy link
Collaborator

michaelklishin commented Sep 11, 2017

According to a team member, using wmic adds ~ 4% to CPU utilisation even when it is executed several times a minute, so, not on the hot code path. For cases where the difference is significant it is possible to avoid using wmic entirely by relying on the erlang strategy for memory usage calculation.

We also learned that memory usage calculation calls on the hot path can get problematic on UNIX systems as well, so the change in rabbitmq/rabbitmq-common#221 would help there as well. Good job, @hairyhum 👍

@michaelklishin michaelklishin added bug effort-low and removed not-enough-information Please provide more information by starting a mailing list thread. labels Sep 12, 2017
@michaelklishin michaelklishin added this to the 3.6.13 milestone Sep 12, 2017
@michaelklishin
Copy link
Collaborator

@lasfromhell @atroxes we believe this is addressed. We will have a snapshot build later today that includes a fix: would you be interested in verifying it?

@atroxes
Copy link

atroxes commented Sep 18, 2017

@michaelklishin Sure thing! The test server is still online :)

@michaelklishin
Copy link
Collaborator

@atroxes I think this snapshot should no longer exhibit this behavior. Please give it a try :)

@atroxes
Copy link

atroxes commented Sep 20, 2017

@michaelklishin Behaviour changed with 3.6.12-alpha.44.

Now I see 'WMIC.exe' spawning every 2-3 seconds and while it's running, 'WmiPrvSE.exe' eats around 3-7% on my, admittedly, low spec'ed VM. I'm not sure if the resource usage is significant on larger systems, but it's definitely noticeable.

@michaelklishin
Copy link
Collaborator

@atroxes asking the OS about how much RAM a node uses every 2-3 seconds seems pretty reasonable to me. If you have suggestions on a more efficient tool that would return the same result and is shipped with Windows 7+, please let us know on the mailing list.

@atroxes
Copy link

atroxes commented Sep 20, 2017

@michaelklishin Unfortunately I know nothing about the capabilities of Erlang, but maybe it's able to query Resource Monitor? That should be available in Windows 7+.

Thanks for looking into this though, it's not a huge issue for us luckily :)

@michaelklishin
Copy link
Collaborator

@atroxes it's not Erlang that queries wmic, it's RabbitMQ itself. We need a way to query the kernel about how much RAM the OS process uses because #1223. wmic is the most portable and straightforward (least fragile) way of doing that on the command line that we know. The 2-3 second frequency is about right and seems reasonable to our team.

@jdahl
Copy link

jdahl commented Sep 22, 2017

We see the same behavior. On my local Win 10 machine it started to consume 25% of the cpu so a complete core on its own. Using Rabbit 3.6.12 and Erlang 19.3. On windows you really shouldn't call a command prompt every second since its a quite CPU intensive thing to do. The one I see that is spawned is

image

Suggestions would be to either find another way to ask for the memory usage or possibly revert to the old system on windows.

@michaelklishin
Copy link
Collaborator

@jdahl see above. You can revert to the previous strategy without us doing anything. It is already fixed in https://bintray.com/rabbitmq/all-dev/rabbitmq-server/3.6.12-alpha.44#files/rabbitmq-server/3.6.12-alpha.44, so feel free to give it a try.

@Gmcourtney
Copy link

Hello,

Just wanted to contribute that we have noticed a similar issue with the spawned cmd.exe and WMIC.exe processes.

Our initial test environment is configured as below:
OS- Windows Server 2008R2
RabbitMQ- 3.6.11
Erlang- OTP20

We noticed that upon starting the RabbitMQ service, Windows Resource Monitor showed system memory in use growing at a rate of approx 200K per hour, however when looking at running processes nothing appeared to be growing out of control. On a small test environment with 4GB RAM, our machine would make it 12 hours or so before before it crashed.

Using RamMap.exe we identified that there were thousands of cmd.exe and WMIC.exe processes, each still holding 4K of private memory and 16K in the page file. These zombie process handles were growing at an extremely rapid rate, which appeared to be occuring every time a process was spawned/destroyed. This finding lead me to this bug/issue discussion thread.

I applied the update provided by @hairyhum (rabbitmq-server-3.6.12.rc3+2.g9086607). The behaviour is still being seen, however at a SIGNIFICANTLY reduced rate. Since applying this package yesterday evening, our environment has run for approx 18 hours. Running RamMap to check for abandoned processes, I see approximately 30-40 cmd.exe and WMIC.exe processes. So it's still a problem that leads to a memory leak, although much less impactful.

rammap

powershell

I then switched the configuration to use {vm_memory_calculation_strategy, erlang}. Obviously this solved my memory issue... if the processes don't spawn, there's nothing to hang.

I have two questions:

  1. Is this problem isolated to specific versions of Windows? I've found threads regarding similar symptoms/behaviour leaving hanging process references to terminated processes with other software running on Windows Server. On my local development machine, which is running Windows 10 (build 1607), I see no such behavior. This machine has been running for 3-4 days with RabbitMQ service running (although precious few messages have actually been processed). I can watch the processes being spawned in Process Explorer, and they are all terminated and cleaned up properly. RamMap shows no evidence of any remnants of these processes on my local developer machine.

  2. What am I giving up by switching the vm_memory_calculation_strategy to use the old approach? This change resolves the memory pressure that results from the spawned processes, RabbitMQ appears to be stable, and my application still appears to function properly interacting with the bus. Obviously the new approach to use WMI to get the statistics was done for a purpose; so I'm looking for some guidance with regards to any downside to using the old approach in a productive environment.

Thank you!

@lukebakken lukebakken reopened this Sep 22, 2017
@lukebakken
Copy link
Collaborator

lukebakken commented Sep 22, 2017

@Gmcourtney - thanks for the detailed report.

  1. Would you mind sharing links to discussions around this issue? None of the RabbitMQ team members have been able to duplicate this on their workstations. Personally I have tried both Windows 8.1 and 10 Pro, and see processes cleaned up correctly like you have. I don't have access to any Windows server versions at this time.

  2. If you use the old approach for memory calculation, it is likely that the amount of memory RabbitMQ thinks it is using is less than what it actually is using. This could lead to scenarios where RMQ exhausts memory before its internal memory alarms are triggered.

@michaelklishin @hairyhum - It's starting to look like we should revert back to the previous method on Windows.

@michaelklishin
Copy link
Collaborator

@lukebakken @gerhard @hairyhum we can bump memory monitor refresh rate on Windows but if wmic.exe turns out to be completely unsuitable for periodic runs (once a second or two is hardly a high rate for what we ask it to do), we can change the default for Windows.

@jdahl
Copy link

jdahl commented Sep 24, 2017

As a windows developer I can say that you really shouldn't start any process once per second. It's quite resource intensive on windows to do. If you needs to ask WMI you should do it in process but I assume your problem is that you can't do this though erlang. The best solution is probably to revert this behavior on windows until a better solution is found.

I personally didn't get the stuck command process as other has reported but I see the cmd executed and closing. What however took 25% cpu usage was the WMI window services. Partly because every time you asked this it among other things loaded in a resource dll from the window system. A bit tricky to track down that rabbit was the cause since the process had closed before I could look at who called the WMI service. Because of this behavior rabbit isn't seen as the cause since a number of other services relies on the WMI service that consumes all cpu. Rabbit isn't among them but update services and security services where on my machine. Might be part of the reason why you dont get many reports on this. Users thinks something else is the cause and not rabbit.

@michaelklishin
Copy link
Collaborator

@jdahl ok, thanks. We will default to the runtime strategy on Windows as of 3.6.13 then.

@michaelklishin
Copy link
Collaborator

michaelklishin commented Oct 6, 2017

An update: due to what we've learned in this issue and a few other places, as of 3.6.13 there are further adjustments to the strategy: as of rabbitmq/rabbitmq-common#225 it now uses runtime's allocators stats (which supposedly track every single malloc performed). This means no external tools are invoked. In addition, as of rabbitmq/rabbitmq-common#221 we avoid frequent calls to the function in question: it is now invoked once a second.

Because existing strategy names no longer make sense with these changes, we renamed them to allocated (née rss) and legacy (née erlang). rss and erlang are still supported for backwards compatibility, allocated/rss is still the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug effort-low pkg-windows An issue specific to Windows
Projects
None yet
Development

No branches or pull requests

7 participants