[ML] Multiclass maximise minimum recall #1113

tveasey · 2020-04-02T09:11:50Z

This implements maximise minimum class recall for multiclass classification when assigning class labels, which is often a better objective when classes are imbalanced in the training data.

tveasey · 2020-04-02T09:30:25Z

lib/maths/CDataFrameUtils.cc

-        }
+        doReduce(frame.readRows(numberThreads, 0, frame.numberRows(),
+                                readCategoryCounts, &rowMask),
+                 copyCategoryCounts, reduceCategoryCounts, result);


The check was unnecessary here since readCategoryCounts can't fail.

valeriy42

Good work altogether. I just have one comment wrt. the initialization of w0 in the optimization procedure.

valeriy42 · 2020-04-03T12:44:50Z

lib/maths/CDataFrameUtils.cc

+    // We want to solve max_w{min_j{recall(class_j)}} = max_w{min_j{c_j(w) / n_j}}
+    // where c_j(w) and n_j are correct predictions for weight w and count of class_j
+    // in the sample set, respectively. We use an equivalent formulation
+    //
+    //   min_w{max_j{f_j(w)}} = min_w{max_j{1 - c_j(w) / n_j}}
+    //
+    // We can write f_j(w) as
+    //
+    //    max_j{sum_i{1 - 1{argmax_i(w_i p_i) == j}} / n_j}                     (1)
+    //
+    // where 1{.} denotes the indicator function. (1) has a smooth relaxation given
+    // by f_j(w) = max_j{sum_i{1 - softmax_j(w_i p_i)} / n_j}. Note that this isn't
+    // convex so we use multiple restarts.


valeriy42 · 2020-04-03T12:51:12Z

lib/maths/CDataFrameUtils.cc

+        for (std::size_t j = 0; j < numberClasses; ++j) {
+            interpolate(j) = CSampling::uniformSample(rng, 0.0, 1.0);
+        }
+        w0 = (a + interpolate.cwiseProduct(b - a)).array().exp();


It seems to me that this should be at the beginning of the for-loop. Otherwise, you try with w0=(1,1,1..,1) at first and this can be outside of your bounds.

I actually intended to do this. My thought was we should always include the best solution in the vicinity of doing nothing, i.e. all weights being equal. Since we use line search with backtracking we are then guarantying that we never do worse than not reweighting at all, which is a nice property since the optimisation objective is complex. WDYT?

I reworked initialisation slightly and added a comment.

I still think it is worth trying the one vector for reason outlined. I also now bake in the fact we expect weights to be (roughly) a monotonic decreasing function of class recalls. (This isn't guaranteed because it depends how close probabilities are and what the predicted classes are for the error cases.)

I also reduced the number of restarts because trying out with a wider variety of numbers of classes and range of recalls I didn't see evidence we needed as many restarts after this change. See this commit.

…onic decreasing function of recalls

valeriy42

Thank you for explaining. LGTM

tveasey added 5 commits April 2, 2020 09:12

Maximize minimum class recall for multi-class classification

61408ca

Test fallout

768682d

Comment

280caf6

Correct comment

de797c6

More comments

d0a9812

tveasey added >enhancement review v8.0.0 :ml/DataFrameAnalysis v7.8.0 labels Apr 2, 2020

tveasey requested a review from valeriy42 April 2, 2020 09:11

tveasey added 2 commits April 2, 2020 10:12

Merge branch 'master' into multiclass-maximize-minimum-recall

04fe8ff

Docs

b42653b

tveasey commented Apr 2, 2020

View reviewed changes

tveasey added 3 commits April 2, 2020 12:10

Compilation fix

6db9436

Work around compiler bug

6fe7f30

Merge branch 'master' into multiclass-maximize-minimum-recall

55e377d

valeriy42 reviewed Apr 3, 2020

View reviewed changes

tveasey added 2 commits April 3, 2020 17:36

Fix more test fallout

ff7ba64

Improved initialisation: we expect best weights to be (roughly) monot…

21ba4fe

…onic decreasing function of recalls

valeriy42 approved these changes Apr 7, 2020

View reviewed changes

tveasey merged commit 40595de into elastic:master Apr 7, 2020

tveasey deleted the multiclass-maximize-minimum-recall branch April 7, 2020 12:01

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Apr 9, 2020

[ML] Multiclass maximise minimum recall (elastic#1113)

9d7143d

tveasey mentioned this pull request Apr 9, 2020

[7.8][ML] Multiclass maximise minimum recall #1133

Merged

tveasey added a commit that referenced this pull request Apr 9, 2020

[7.8[ML] Multiclass maximise minimum recall (#1113) (#1133)

bfb57fc

This was referenced May 12, 2020

[ML] Fix weights to maximize minimum recall for multiclass classification #1231

Merged

[ML] Fix weights to maximize minimum recall for multiclass classification when the training data is missing classes #1239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Multiclass maximise minimum recall #1113

[ML] Multiclass maximise minimum recall #1113

Uh oh!

tveasey commented Apr 2, 2020

Uh oh!

tveasey Apr 2, 2020

Uh oh!

valeriy42 left a comment

Uh oh!

valeriy42 Apr 3, 2020

Uh oh!

valeriy42 Apr 3, 2020

Uh oh!

tveasey Apr 3, 2020 •

edited

Loading

Uh oh!

tveasey Apr 6, 2020

Uh oh!

valeriy42 left a comment

Uh oh!

Uh oh!

[ML] Multiclass maximise minimum recall #1113

[ML] Multiclass maximise minimum recall #1113

Uh oh!

Conversation

tveasey commented Apr 2, 2020

Uh oh!

tveasey Apr 2, 2020

Choose a reason for hiding this comment

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

valeriy42 Apr 3, 2020

Choose a reason for hiding this comment

Uh oh!

valeriy42 Apr 3, 2020

Choose a reason for hiding this comment

Uh oh!

tveasey Apr 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tveasey Apr 6, 2020

Choose a reason for hiding this comment

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tveasey Apr 3, 2020 •

edited

Loading