Skip to content

Add cancellation and resource monitoring to IMonitor (used for AutoML experiments) #6471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

andrasfuchs
Copy link
Contributor

Fixes #6320, #6465, #6426, #6425 and helps investigating further problems with AutoML trials.

This PR lets the user skip trials based various performance metrics. It changed my user experience with AutoML experiments significantly, because I regularly had crashes and failed trials when I tried to run experiments for a long time. With this modification I could implement my own IMonitor and react to changes in memory demand, disk space or I could skip a trial if it was running unexpectedly long.
Before the modifications my experiments usually stopped with an error in a few hours and after 15-20 trials, but now I just had my longest, 12-hour-long successful run with 729 trials!

I include my logs that were generated by my custom IMonitor implementation as an example:

03:41:00.5 info:   Resources Trial # 725 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         94,5% - Duration:   2,0 minutes
03:41:03.9 info:   Completed Trial # 725 - Pipeline: LightGbmRegression                  -   Metric:        0,7367 - Duration:   2,1 minutes
03:41:03.9 info:  ------------------------------------------------------------------------------------------------------------------------------------------------------
03:41:03.9 info:     Running Trial # 726
03:41:10.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:        5,0 GB -      CPU:         45,9% - Duration:   0,1 minutes
03:41:15.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         88,1% - Duration:   0,2 minutes
03:41:21.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         94,5% - Duration:   0,3 minutes
03:41:27.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         94,7% - Duration:   0,4 minutes
03:41:33.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         93,3% - Duration:   0,5 minutes
03:41:39.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         94,0% - Duration:   0,6 minutes
03:41:45.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         94,7% - Duration:   0,7 minutes
03:41:51.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         93,5% - Duration:   0,8 minutes
03:41:58.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,3 GB -      CPU:         93,7% - Duration:   0,9 minutes
03:42:03.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,0% - Duration:   1,0 minutes
03:42:09.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         91,6% - Duration:   1,1 minutes
03:42:16.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         91,5% - Duration:   1,2 minutes
03:42:21.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,5% - Duration:   1,3 minutes
03:42:27.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,6% - Duration:   1,4 minutes
03:42:33.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,2% - Duration:   1,5 minutes
03:42:40.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,6% - Duration:   1,6 minutes
03:42:45.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,2% - Duration:   1,7 minutes
03:42:51.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,6% - Duration:   1,8 minutes
03:42:58.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,9% - Duration:   1,9 minutes
03:43:03.9 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         94,7% - Duration:   2,0 minutes
03:43:10.0 info:   Resources Trial # 726 - Pipeline: LightGbmRegression                  -   Memory:       21,0 GB -      CPU:         80,9% - Duration:   2,1 minutes
03:43:10.1 info:   Completed Trial # 726 - Pipeline: LightGbmRegression                  -   Metric:        0,7325 - Duration:   2,1 minutes
03:43:10.1 info:  ------------------------------------------------------------------------------------------------------------------------------------------------------
03:43:10.1 info:     Running Trial # 727
03:43:14.7 info:   Completed Trial # 727 - Pipeline: LbfgsPoissonRegressionRegression    -   Metric:        0,2601 - Duration:   0,1 minutes
03:43:14.7 info:  ------------------------------------------------------------------------------------------------------------------------------------------------------
03:43:14.7 info:     Running Trial # 728
03:43:20.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:        4,9 GB -      CPU:         46,6% - Duration:   0,1 minutes
03:43:26.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,4% - Duration:   0,2 minutes
03:43:32.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         85,5% - Duration:   0,3 minutes
03:43:38.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,4% - Duration:   0,4 minutes
03:43:44.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,9% - Duration:   0,5 minutes
03:43:50.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,2% - Duration:   0,6 minutes
03:43:56.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         94,1% - Duration:   0,7 minutes
03:44:02.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,2% - Duration:   0,8 minutes
03:44:08.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         91,5% - Duration:   0,9 minutes
03:44:14.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,8% - Duration:   1,0 minutes
03:44:20.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,7% - Duration:   1,1 minutes
03:44:26.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,6% - Duration:   1,2 minutes
03:44:32.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         91,7% - Duration:   1,3 minutes
03:44:38.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         94,7% - Duration:   1,4 minutes
03:44:44.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         91,0% - Duration:   1,5 minutes
03:44:50.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,7% - Duration:   1,6 minutes
03:44:56.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,0% - Duration:   1,7 minutes
03:45:02.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,9% - Duration:   1,8 minutes
03:45:08.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         93,3% - Duration:   1,9 minutes
03:45:14.7 info:   Resources Trial # 728 - Pipeline: LightGbmRegression                  -   Memory:       26,4 GB -      CPU:         92,3% - Duration:   2,0 minutes
03:45:17.2 info:   Completed Trial # 728 - Pipeline: LightGbmRegression                  -   Metric:        0,7247 - Duration:   2,0 minutes
03:45:17.2 info:  ------------------------------------------------------------------------------------------------------------------------------------------------------
03:45:17.2 info:     Running Trial # 729
03:45:23.3 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:        5,5 GB -      CPU:         44,8% - Duration:   0,1 minutes
03:45:29.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         93,0% - Duration:   0,2 minutes
03:45:35.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         93,5% - Duration:   0,3 minutes
03:45:41.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         92,4% - Duration:   0,4 minutes
03:45:47.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         93,8% - Duration:   0,5 minutes
03:45:53.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         92,1% - Duration:   0,6 minutes
03:45:59.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         94,1% - Duration:   0,7 minutes
03:46:05.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         94,5% - Duration:   0,8 minutes
03:46:11.2 info:   Resources Trial # 729 - Pipeline: LightGbmRegression                  -   Memory:       26,8 GB -      CPU:         93,9% - Duration:   0,9 minutes
03:46:16.5 info:  AutoML result: 0,7441799613917761. Saving model as 'o:\Work\BBD.BodyMonitor\BBD_20221122__TrainingData.Sleep.MLP12__MLP12_0p25Hz-250Hz__Session_SegmentedData_Sleep_Level__15782rows.zip'

@andrasfuchs
Copy link
Contributor Author

andrasfuchs commented Nov 26, 2022

Update: it solves the cancellation problem only partially (see #6465), so I close this PR for now.
I'll resubmit it, if I found the proper fix.

@ghost ghost locked as resolved and limited conversation to collaborators Jan 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add resource (CPU,RAM,GPU,thread count) monitoring to AutoML experiments
1 participant