TimerEventSource and InformerEventSource use separate thread pools respectively to improve reconcile speed #2293

ysymi · 2024-03-18T11:39:57Z

This is not a bug report, this is a discussion about performance optimization

What did you do?

I am a user of Flink Kubernetes Operator, I found that there is a potential performance issues with huge amount resources.
The detail:
flink kubernetes operator use updateControl to update custom resource's state with 1 minute interval, when custom resource amount reaches 10000, the worker threads and task queue in the reconcile thread pool will be filled up by these periodic tasks of UpdateControl （from TimerEventSource）.
At this time, when we create a new CR or delete a CR , operator can not handle this event timely. maybe after several tens of seconds, operator handle this event.

What did you expect to see?

TimerEventSource And InformerEventSource can be independent of each other so that operator can handle CR's create、 update、delete event

What did you see instead? Under which circumstances?

create cr need long time to be reconciled when lots of timer task with UpdateControl

Environment

Kubernetes cluster type:

$ Mention java-operator-sdk version from pom.xml file

4.4.4

$ java -version

openjdk11

Possible Solution

TimerEventSource and InformerEventSource use separate thread pools respectively, and the thread pool size can be configured separately.

The text was updated successfully, but these errors were encountered:

csviri · 2024-03-18T12:07:01Z

Hi @ysymi, do I understand properly, that the problem is that when there is a huge number of custom resources, with timed reconciliation and a new custom resource is created/updated there will be a significant delay until it is processed?

Note that ideally there should not be a timed reschedule, rather there reconciliation should be always triggered by event sources. I see there other optimization possibilities, like gradually increase this re-schedule (if the use case allows that).

Introducing some "priority events" or similar approached would provide potentially huge complexity.

cc @gyfora

ysymi · 2024-03-19T03:49:31Z

yes, @csviri, your understanding is correct but I want to emphasize that I don't care about the delay of timed reconciliation, I care about the delay of handling events from k8s, the latter affects the submission speed of jobs.
So I want to use some priority control or separate thread pools to avoid mutual interference between the two types of event handling.
I just want to bring up this issue to see if there are any reasonable solutions.
If "priority events" is too complex, how about "separate thread pools respectively" ?

cc @gyfora

csviri · 2024-03-19T06:55:29Z

I see, so there are a couple of options without going to separate thread pools, priority events, or other solutions.

Again timed reconciliation is usually an issue in itself, it is in the framework since in some simple cases might easy to start with them, but in general, it should be avoided. Could you describe for what purpose is it used in Flink Operator?
Cannot be replaced by an Event Source? Considering also the polling event sources, which should do local state checks and trigger events for reconciliation only when it is needed. So in general this part can be usually optimized very nicely, so the actual problem with the delay would disappear

ysymi · 2024-03-19T11:01:48Z

When bringing up this issue, I thought of two solutions: one is to separate thread pools, and the other is to set a priority queue. But as you said, setting a priority queue is more complex. So I think separating thread pools may be more reasonable. Of course, if there are not so many updateControl calls, then this problem would not exist in the first place.

As for the Flink Operator... Firstly , flink stream processing job is a long-time running workload, after flink job submited into a k8s cluster (of course operator handle this submission), operator need check flink job's state so that if job was crash , operator can re-submit it to recover job running. However, most of the time, the Operator is doing the logic of periodically checking the state and triggering corresponding maintenance operations.

This is my personal understanding. If there are any inaccuracies or missing details, @gyfora please do some supplement.

csviri · 2024-03-19T14:30:01Z

@ysymi I discussed this briefly also with @gyfora . To my understanding some parts could be / might be optimized in future Flink Operator (using the polling event sources for some state caching for example).

But, regarding the performance, since the reconciliation in Flink could take a bit longer because of the HTTP requires opposed when everything is in memory (in ideal case). Wouldn't it be a solution to simply just increase the thread pool size of the executor to solve this problem?

When bringing up this issue, I thought of two solutions: one is to separate thread pools, and the other is to set a priority queue. But as you said, setting a priority queue is more complex. So I think separating thread pools may be more reasonable. Of course, if there are not so many updateControl calls, then this problem would not exist in the first place.

Both of these are just an idea, I meant as kinda brainstorming. Both of them would complicate the code. Note that there is no such concept currently even in go counterpart (controller-runtime).

ysymi · 2024-03-19T14:57:21Z

OK，thanks for discussing this problem @csviri I am going to do some Flink Operator optimize firstly

csviri · 2024-03-19T15:11:44Z

thx!

will close this issue now, feel free to reopen or create a new one when it comes to that point.

csviri closed this as completed Mar 19, 2024

toshitabata mentioned this issue Nov 25, 2024

How to reduce impact to reconciliation time on operator start up? #2608

Open

csviri mentioned this issue Apr 4, 2025

Priority Queues #2754

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimerEventSource and InformerEventSource use separate thread pools respectively to improve reconcile speed #2293

TimerEventSource and InformerEventSource use separate thread pools respectively to improve reconcile speed #2293

ysymi commented Mar 18, 2024

csviri commented Mar 18, 2024 •

edited

Loading

ysymi commented Mar 19, 2024 •

edited

Loading

csviri commented Mar 19, 2024

ysymi commented Mar 19, 2024 •

edited

Loading

csviri commented Mar 19, 2024 •

edited

Loading

ysymi commented Mar 19, 2024

csviri commented Mar 19, 2024

TimerEventSource and InformerEventSource use separate thread pools respectively to improve reconcile speed #2293

TimerEventSource and InformerEventSource use separate thread pools respectively to improve reconcile speed #2293

Comments

ysymi commented Mar 18, 2024

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Possible Solution

csviri commented Mar 18, 2024 • edited Loading

ysymi commented Mar 19, 2024 • edited Loading

csviri commented Mar 19, 2024

ysymi commented Mar 19, 2024 • edited Loading

csviri commented Mar 19, 2024 • edited Loading

ysymi commented Mar 19, 2024

csviri commented Mar 19, 2024

csviri commented Mar 18, 2024 •

edited

Loading

ysymi commented Mar 19, 2024 •

edited

Loading

ysymi commented Mar 19, 2024 •

edited

Loading

csviri commented Mar 19, 2024 •

edited

Loading