merge-ort: add code to check for whether cached renames can be reused

newren · newren · commit e123d1325567 · 2021-02-13T19:36:24.000-08:00
We need to know when renames detected in a previous merge operation can
be reused in a later merge operation.  Consider the following setup
(from the git-rebase manpage):

                     A---B---C topic
                    /
               D---E---F---G master

After rebasing, this will appear as:

                             A'--B'--C' topic
                            /
               D---E---F---G master

Further, let's say that 'oldfile' was renamed to 'newfile' between E
and G.  The rebase or cherry-pick of A onto G will involve a three-way
merge between E (as the merge base) and G and A.  After detecting the
rename between E:oldfile and G:newfile, there will be a three-way
content merge of the following:
    E:oldfile
    G:newfile
    A:oldfile
and produce a new result:
    A':newfile

Now, when we want to pick B onto A', we will need to do a three-way
merge between A (as the merge-base) and A' and B.  This will involve
a three-way content merge of
    A:oldfile
    A':newfile
    B:oldfile
but only if we can detect that A:oldfile is similar enough to A':newfile
to be used together in a three-way content merge, i.e. only if we can
detect that A:oldfile and A':newfile are a rename.  But we already know
that A:oldfile and A':newfile are similar enough to be used in a
three-way content merge, because that is precisely where A':newfile came
from in the previous merge.

Note that A &amp; A' both appear in both merges.  That gives us the
condition under which we can reuse renames.

There are a couple important points about this optimization:
  - If the rebase or cherry-pick halts for user conflicts, these caches
    are NOT saved anywhere.  Thus, resuming a halted rebase or
    cherry-pick will result in no reused renames for the next commit.
    This is intentional, as user resolution can change files
    significantly and in ways that violate the similarity assumptions
    here.
  - Technically, this might give different results for rename detection.
    In particular, looking at the first merge above for oldfile and
    newfile, if both A:oldfile and G:newfile were each 40% different
    from E:oldfile, and different from each other, and they still merged
    cleanly, then A':newfile could be more than 50% different from
    A:oldfile.  This would mean that traditionally the next step of the
    rebase operation, moving B to B', would not detect the rename
    between A:oldfile and A':newfile.  This most likely would have
    resulted in a modify/delete conflict and the rebase operation making
    the user resolve the problem.  With this optimization, the rename
    would be detected and the operation would continue...but the odds of
    the successful merging with such different files seems somewhat low
    and thus seems likely that it would also halt and tell the user to
    resolve the conflict.  The odds of this happening in practice are
    small...and even if it did occur, there's a pedantic question about
    whether this would be considered a regression or a bugfix.  I am not
    sure which it would be considered, but given that rename detection
    has always been heuristics and the original rules started with a
    weird dipole with funny corner cases:
      - files with the exact same name are considered the same file,
        no matter how dissimilar the content
      - any other two files are considered the same file only if they
        are the best possible content match (highest similarity)
        regardless of filename similarity
    I don't see this optimization and minor change in behavior as out of
    place.  Much like the above two cases, even if a corner case is hit,
    it is at least fairly easy for people to understand why the
    algorithm did what it did -- somewhere in the sequence those files
    were related.  Much like the technical difference in rename detection
    provided by basename-guided matching, this theoretical difference in
    results is totally worth the potential time savings.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
diff --git a/merge-ort.c b/merge-ort.c
@@ -138,6 +138,30 @@ struct rename_info {
 	int callback_data_nr, callback_data_alloc;
 	char *callback_data_traverse_path;
 
+	/*
+	 * merge_trees: trees passed to the merge algorithm for the merge
+	 *
+	 * merge_trees records the trees passed to the merge algorithm.  But,
+	 * this data also is stored in merge_result->priv.  If a sequence of
+	 * merges are being done (such as when cherry-picking or rebasing),
+	 * the next merge can look at this and re-use information from
+	 * previous merges under certain cirumstances.
+	 *
+	 * See also all the cached_* variables.
+	 */
+	struct tree *merge_trees[3];
+
+	/*
+	 * cached_pairs_valid_side: which side's cached info can be reused
+	 *
+	 * See the description for merge_trees.  For repeated merges, at most
+	 * only one side's cached information can be used.  Valid values:
+	 *   MERGE_SIDE2: cached data from side2 can be reused
+	 *   MERGE_SIDE1: cached data from side1 can be reused
+	 *   0:           no cached data can be reused
+	 */
+	int cached_pairs_valid_side;
+
 	/*
 	 * cached_pairs: Caching of renames and deletions.
 	 *
@@ -447,6 +471,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_func(&renames->cached_pairs[i], 1);
 		strset_func(&renames->cached_irrelevant[i]);
 	}
+	renames->cached_pairs_valid_side = 0;
+	renames->dir_rename_mask = 0;
 
 	if (!reinitialize) {
 		struct hashmap_iter iter;
@@ -469,8 +495,6 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_clear(&opti->output, 0);
 	}
 
-	renames->dir_rename_mask = 0;
-
 	/* Clean out callback_data as well. */
 	FREE_AND_NULL(renames->callback_data);
 	renames->callback_data_nr = renames->callback_data_alloc = 0;
@@ -3615,6 +3639,35 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 	trace2_region_leave("merge", "allocate/init", opt->repo);
 }
 
+static void merge_check_renames_reusable(struct merge_options *opt,
+					 struct merge_result *result,
+					 struct tree *merge_base,
+					 struct tree *side1,
+					 struct tree *side2)
+{
+	struct rename_info *renames;
+	struct tree **merge_trees;
+	struct merge_options_internal *opti = result->priv;
+
+	if (!opti)
+		return;
+
+	renames = &opti->renames;
+	merge_trees = renames->merge_trees;
+	/* merge_trees[0..2] will only be NULL if opti is */
+	assert(merge_trees[0] && merge_trees[1] && merge_trees[2]);
+
+	/* Check if we meet a condition for re-using cached_pairs */
+	if (     oideq(&merge_base->object.oid, &merge_trees[2]->object.oid) &&
+		 oideq(     &side1->object.oid, &result->tree->object.oid))
+		renames->cached_pairs_valid_side = MERGE_SIDE1;
+	else if (oideq(&merge_base->object.oid, &merge_trees[1]->object.oid) &&
+		 oideq(     &side2->object.oid, &result->tree->object.oid))
+		renames->cached_pairs_valid_side = MERGE_SIDE2;
+	else
+		renames->cached_pairs_valid_side = 0; /* neither side valid */
+}
+
 /*** Function Grouping: merge_incore_*() and their internal variants ***/
 
 /*
@@ -3755,7 +3808,16 @@ void merge_incore_nonrecursive(struct merge_options *opt,
 
 	trace2_region_enter("merge", "merge_start", opt->repo);
 	assert(opt->ancestor != NULL);
+	merge_check_renames_reusable(opt, result, merge_base, side1, side2);
 	merge_start(opt, result);
+	/*
+	 * Record the trees used in this merge, so if there's a next merge in
+	 * a cherry-pick or rebase sequence it might be able to take advantage
+	 * of the cached_pairs in that next merge.
+	 */
+	opt->priv->renames.merge_trees[0] = merge_base;
+	opt->priv->renames.merge_trees[1] = side1;
+	opt->priv->renames.merge_trees[2] = side2;
 	trace2_region_leave("merge", "merge_start", opt->repo);
 
 	merge_ort_nonrecursive_internal(opt, merge_base, side1, side2, result);