Add cancellation and resource monitoring to IMonitor (used for AutoML experiments) #6471

andrasfuchs · 2022-11-25T09:24:17Z

Fixes #6320, #6465, #6426, #6425 and helps investigating further problems with AutoML trials.

This PR lets the user skip trials based various performance metrics. It changed my user experience with AutoML experiments significantly, because I regularly had crashes and failed trials when I tried to run experiments for a long time. With this modification I could implement my own IMonitor and react to changes in memory demand, disk space or I could skip a trial if it was running unexpectedly long.
Before the modifications my experiments usually stopped with an error in a few hours and after 15-20 trials, but now I just had my longest, 12-hour-long successful run with 729 trials!

I include my logs that were generated by my custom IMonitor implementation as an example:

03:41:00.5 info:   Resources Trial # 725 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         94,5% - Duration:   2,0 minutes
03:41:03.9 info:   Completed Trial # 725 - Pipeline: LightGbmRegression                  -   Metric:        0,7367 - Duration:   2,1 minutes
03:41:03.9 info:  ------------------------------------------------------------------------------------------------------------------------------------------------------
03:41:03.9 info:     Running Trial # 726
03:41:10.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:        5,0 GB -      CPU:         45,9% - Duration:   0,1 minutes
03:41:15.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         88,1% - Duration:   0,2 minutes
03:41:21.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         94,5% - Duration:   0,3 minutes
03:41:27.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         94,7% - Duration:   0,4 minutes
03:41:33.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         93,3% - Duration:   0,5 minutes
03:41:39.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         94,0% - Duration:   0,6 minutes
03:41:45.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         94,7% - Duration:   0,7 minutes
03:41:51.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         93,5% - Duration:   0,8 minutes
03:41:58.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         93,7% - Duration:   0,9 minutes
03:42:03.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,0% - Duration:   1,0 minutes
03:42:09.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         91,6% - Duration:   1,1 minutes
03:42:16.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         91,5% - Duration:   1,2 minutes
03:42:21.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,5% - Duration:   1,3 minutes
03:42:27.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,6% - Duration:   1,4 minutes
03:42:33.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,2% - Duration:   1,5 minutes
03:42:40.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,6% - Duration:   1,6 minutes
03:42:45.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,2% - Duration:   1,7 minutes
03:42:51.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,6% - Duration:   1,8 minutes
03:42:58.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,9% - Duration:   1,9 minutes
03:43:03.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         94,7% - Duration:   2,0 minutes
03:43:10.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       21,0 GB -      CPU:         80,9% - Duration:   2,1 minutes
03:43:10.1 info:   Completed Trial # 726 - Pipeline: LightGbmRegression                  -   Metric:        0,7325 - Duration:   2,1 minutes
03:43:10.1 info:  ------------------------------------------------------------------------------------------------------------------------------------------------------
03:43:10.1 info:     Running Trial # 727
03:43:14.7 info:   Completed Trial # 727 - Pipeline: LbfgsPoissonRegressionRegression    -   Metric:        0,2601 - Duration:   0,1 minutes
03:43:14.7 info:  ------------------------------------------------------------------------------------------------------------------------------------------------------
03:43:14.7 info:     Running Trial # 728
03:43:20.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:        4,9 GB -      CPU:         46,6% - Duration:   0,1 minutes
03:43:26.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,4% - Duration:   0,2 minutes
03:43:32.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         85,5% - Duration:   0,3 minutes
03:43:38.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,4% - Duration:   0,4 minutes
03:43:44.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,9% - Duration:   0,5 minutes
03:43:50.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,2% - Duration:   0,6 minutes
03:43:56.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         94,1% - Duration:   0,7 minutes
03:44:02.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,2% - Duration:   0,8 minutes
03:44:08.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         91,5% - Duration:   0,9 minutes
03:44:14.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,8% - Duration:   1,0 minutes
03:44:20.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,7% - Duration:   1,1 minutes
03:44:26.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,6% - Duration:   1,2 minutes
03:44:32.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         91,7% - Duration:   1,3 minutes
03:44:38.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         94,7% - Duration:   1,4 minutes
03:44:44.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         91,0% - Duration:   1,5 minutes
03:44:50.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,7% - Duration:   1,6 minutes
03:44:56.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,0% - Duration:   1,7 minutes
03:45:02.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,9% - Duration:   1,8 minutes
03:45:08.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,3% - Duration:   1,9 minutes
03:45:14.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,3% - Duration:   2,0 minutes
03:45:17.2 info:   Completed Trial # 728 - Pipeline: LightGbmRegression                  -   Metric:        0,7247 - Duration:   2,0 minutes
03:45:17.2 info:  ------------------------------------------------------------------------------------------------------------------------------------------------------
03:45:17.2 info:     Running Trial # 729
03:45:23.3 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:        5,5 GB -      CPU:         44,8% - Duration:   0,1 minutes
03:45:29.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         93,0% - Duration:   0,2 minutes
03:45:35.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         93,5% - Duration:   0,3 minutes
03:45:41.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         92,4% - Duration:   0,4 minutes
03:45:47.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         93,8% - Duration:   0,5 minutes
03:45:53.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         92,1% - Duration:   0,6 minutes
03:45:59.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         94,1% - Duration:   0,7 minutes
03:46:05.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         94,5% - Duration:   0,8 minutes
03:46:11.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         93,9% - Duration:   0,9 minutes
03:46:16.5 info:  AutoML result: 0,7441799613917761. Saving model as 'o:\Work\BBD.BodyMonitor\BBD_20221122__TrainingData.Sleep.MLP12__MLP12_0p25Hz-250Hz__Session_SegmentedData_Sleep_Level__15782rows.zip'

…d ReportTrialResourceUsage event to IMonitor

andrasfuchs · 2022-11-26T10:49:14Z

Update: it solves the cancellation problem only partially (see #6465), so I close this PR for now.
I'll resubmit it, if I found the proper fix.

andrasfuchs added 4 commits November 25, 2022 09:48

Fix a typo

9647027

Fix trial cancellation bug

269b1bd

Move performance related properties to TrialPerformanceMetrics and ad…

a2c5781

…d ReportTrialResourceUsage event to IMonitor

Add new class and property explanations

e3fd992

ghost added the community-contribution label Nov 25, 2022

andrasfuchs closed this Nov 26, 2022

ghost locked as resolved and limited conversation to collaborators Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cancellation and resource monitoring to IMonitor (used for AutoML experiments) #6471

Add cancellation and resource monitoring to IMonitor (used for AutoML experiments) #6471

andrasfuchs commented Nov 25, 2022

andrasfuchs commented Nov 26, 2022 •

edited

Loading

Add cancellation and resource monitoring to IMonitor (used for AutoML experiments) #6471

Add cancellation and resource monitoring to IMonitor (used for AutoML experiments) #6471

Conversation

andrasfuchs commented Nov 25, 2022

andrasfuchs commented Nov 26, 2022 • edited Loading

andrasfuchs commented Nov 26, 2022 •

edited

Loading