-
Notifications
You must be signed in to change notification settings - Fork 458
[FLINK-31215] [autoscaler] Backpropagate processing rate limits from non-scalable bottlenecks to upstream operators #847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi, @gyfora, could you review the code and run the workflows, please? |
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java
Outdated
Show resolved
Hide resolved
@@ -58,6 +58,15 @@ private static ConfigOptions.OptionBuilder autoScalerConfig(String key) { | |||
.withDescription( | |||
"Enable vertex scaling execution by the autoscaler. If disabled, the autoscaler will only collect metrics and evaluate the suggested parallelism for each vertex but will not upgrade the jobs."); | |||
|
|||
public static final ConfigOption<Boolean> PROCESSING_RATE_BACKPROPAGATION_ENABLED = | |||
autoScalerConfig("processing.rate.backpropagation.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could call this simply bottleneck-propagation.enabled
and to control the scaling bottleneck-propagation.allow-scale-down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2 config names are not in sync now
Thanks for the PR @aplyusnin! I'll take a look. |
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java
Outdated
Show resolved
Hide resolved
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java
Outdated
Show resolved
Hide resolved
flink-autoscaler/src/test/java/org/apache/flink/autoscaler/ScalingExecutorTest.java
Outdated
Show resolved
Hide resolved
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java
Outdated
Show resolved
Hide resolved
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/VertexScalingResult.java
Outdated
Show resolved
Hide resolved
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/VertexScalingResult.java
Outdated
Show resolved
Hide resolved
|
||
void addBottleneckVertex(JobVertexID bottleneck, double factor) { | ||
bottlenecks.add(bottleneck); | ||
backpropagationScaleFactor = Math.min(backpropagationScaleFactor, factor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the scaling factor not kept on a per-vertex level? If there are two vertices within different branches, they will influence each other, e.g. a propgagation factor of 0.1 will override another with 0.9. I think we need to account for it per input, similarly to how we propagate the actual rates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scale factor works as follows: the minimum possible value is picked and pushed to sources, lowering target capacity by this factor on each vertex. This approach works fine for most of the jobs:
- Starting from a bottleneck vertex, the capacity of all upstream vertices of the bottleneck is reduced by the factor
- During propagation, agation source operators are reached and their capacity is reduced
- It affects vertices that may not be directly connected with the initial bottleneck
- Repeating steps 2 and 3 will adjust all vertices in the connected components
I think case with 2 and more connected components (e.g. graph source1 -> op1 -> sink1; source2 -> op2 -> sink2) appears rarely.
As an alternative, bottlenecks can be iterated in bottleneck factor decreasing order, making propagation more accurate, but it takes more time for scaling (O(N^2) against O(N)) and is harder to maintain and is less predictable.
What do you think? Should we use more complex logic for propagation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might also have to consider the output ratios when propagating the bottleneck backwards .
So technically speaking if we want to be completely precise we can do this in a single pass if we start computing the target rates from the sinks. Once the actual scaled rate is computed we have to propagate the diff compared to the original one back based on the output ratio to the upstream tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Gyula has a point. The current implementation works for simple jobs but there are many jobs with more complex uncorrelated branches which would lead to unnecessary scale downs or prevent upscales (if scale down is disabled). Using the output ratios would allow us to precisely feed back the bottleneck ratios and avoid any accidental backpropagation.
Hi, @mxm, @gyfora. I rewrote the code for the processing rate backpropagation. Unfortunately, backpropagation results depend on the vertices order during backpropagation, and the default Flink's topological order may not be the best. Also, I decided to update target data rate metrics during backpropagation processing to make the code more compact. Is it ok? |
I think the backpropagation has to go in reverse topological order (ie from sinks to sources) and then it should be stable. Am I missing something @aplyusnin ? |
Yes, @gyfora, you are right. Now, backpropagation considers vertices in reverse topological order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to review the code but it's very hard for me to understand the actual logic just by looking at it.
I am a bit puzzled by the concept of back propagation factors, scale factors etc. per vertex and things like that.
In my head I am looking for a much simpler logic such as:
actualTargetProcessingRate = min(targetProcessingRate, max(downstream_target_rate / output_ratio))
Basically for each vertex we check that it has any downstream vertex with a target capacity that would backpressure it and then adjust the target rate.
I don't see why we need factors / multipliers etc
@@ -58,6 +58,15 @@ private static ConfigOptions.OptionBuilder autoScalerConfig(String key) { | |||
.withDescription( | |||
"Enable vertex scaling execution by the autoscaler. If disabled, the autoscaler will only collect metrics and evaluate the suggested parallelism for each vertex but will not upgrade the jobs."); | |||
|
|||
public static final ConfigOption<Boolean> PROCESSING_RATE_BACKPROPAGATION_ENABLED = | |||
autoScalerConfig("processing.rate.backpropagation.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2 config names are not in sync now
Thank you for your reply. Now, the backpropagation logic for a single vertex is the following:
For example, take a look at operator 3. Initially, it's target data rate was 250, and it is lowered by upstream by 0.8 and become 200. This value is 2 times bigger than max parallelism (40 / 20 = 2), so the backpressure factor to propagate is 0.8 (from upstream) * 20 / 40 (the vertex is a bottleneck) = 0.4. Now it's time to propagate the factor to the direct upstream (operator 1 and operator 2). Note that operator 1 is already adjusted by some other vertices. At first, the data rate from the direct upstream is evaluated (target data rate * output rate * backpressure factor): 100 * 2 * 0.5 = 100 from operator 1 and 50 * 1 * 1 = 50 from operator 2, summing up to 150. Since the adjusted target data rate of operator 3 is 100 and the upstream provides 150, all direct upstream operators should be lowered. To do it, their backpressure factor should be multiplied by 100 / 150 = 2/3 (target data rate / data rate from the upstream). This process repeats for all vertices in reverse topological order. Then, the target data rate is updated using scale factors propagated to sources. There are also some extra checks to prevent aggressive scaling down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update to the PR @aplyusnin! Could you explain why the logic you described is required over the simpler logic Gyula outlined? It does not seem necessary to propagate the load factors.
In my eyes, it would be sufficient to do a single path in revese-topologically order (i.e. from sinks to sources), where we limit upstream vertices by the rate limit established downstream via the initial scaling logic. This could be done in a recursive fashion.
It doesn't matter to the upstream vertices what the downstream backpropagation factos are, because the rates dictate how much the vertice will be scaled. The backprogagtion factor is only relevant locally to the vertex to apply a limit to its rate.
double averageTrueProcessingRate = | ||
evaluatedMetrics | ||
.getVertexMetrics() | ||
.get(vertex) | ||
.get(TRUE_PROCESSING_RATE) | ||
.getAverage(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this metric required? Isn't the TARGET_DATA_RATE
the relevant metric sufficinet to apply the backpropagation factor?
Thank you for your replies! I don't understand how we can determine if a vertex is a bottleneck without evaluating its parallelism. This is why TRUE_PROCESSING_RATE is used. Also, I think that using the simpler approach is not accurate enough. Suppose we have a window join operator of two upstreams. It's target_data_rate is calculated as:
If the join operator is a bottleneck, then it's This is where problems with accuracy appear. The Also, the In both cases, the upstream_1 and upstream_2 operators will remain blocked after scaling. This is why the simpler approach may not be good enough. |
I think it can work if we apply the same logic that we used to determine
Consequently, we would need to satisfy the following equation for the backpropagation:
That would mean that each input vertex gets the following limit applied:
where Do you think that would work? The benefit of this approach is that we leverage all the available information without having to add and backfeed additional factors. |
cc0479a
to
f25d484
Compare
Hi, sorry for a big delay. I've tried to simplify the logic of the backpropagation. Every vertex has it's initial target data rate and the one limited by the donwstream and potential performance after scaling. There is a pitfall with negative limited data rate appearance during backpropagation (check ScalingExecutorBackpropagationTest#testDisconnectedJobMetricsUpdate) because of "even" backpropagation of data rate to upstream operators, so it is not still very clear how to resolve such situations. Also I decided to replace bottleneck-propagation.allow-scale-down option with processing.rate.backpropagation.impact, because I think it is very unclear how use the previous one properly. The new option represents a "weight" of backpropagation: 0 means backpropagation is ignored and 1.0 means new values will be used further. |
@aplyusnin could we have a quick call on this maybe? You can ping me in email/slack so we can wrap this up quicker :) |
7345d39
to
efa204e
Compare
c9ab70e
to
4200319
Compare
…bottlenecks to upstream operators
What is the purpose of the change
This pull request adds logic for backpropagating processing rate from non-scalable bottlenecks to upstream operators, potentially reducing parallelism of bakcpressured vertices after scaling.
Brief change log
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
CustomResourceDescriptors
: noDocumentation