[ML] Repeated info in log message from all ML nodes leads to long message #29950

elasticmachine · 2017-06-07T12:55:10Z

Original comment by @sophiec20:

Found in 5.5 "build" : { "hash" : "62e486b", "date" : "2017-06-06T13:54:18.605Z" }

23 node cluster, of which 3 are master and 20 are ML and data nodes.
Once I reached the limit of max open jobs, the following error occurs when you try to open a job.

This is repeating the error message from 20 nodes. This will only get worse as the number of nodes increases.

Not a priority, considering we are only recommending a small number of dedicated nodes.

== Opening job for streamingtv299...
{"error":{"root_cause":[{"type":"status_exception","reason":"Could not open job because no suitable nodes were found, allocation explanation [Not opening job [streamingtv299] on node [{ip-10-0-3-237}{iVQhSWkEQV2QBcNgAbry5Q}{L2IqCUuAQLymdOw47osyVg}{10.0.3.237}{10.0.3.237:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-50}{HhByQF_ORUqD0NiClqPArg}{pmkQnJAhRBeUUx6pCZzgtQ}{10.0.3.50}{10.0.3.50:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-231}{cA5e1J2yS4-E96bBlaRJ7w}{XBoP_nupR6K5qB9ve6V3KA}{10.0.3.231}{10.0.3.231:9300}], because this node isn't a ml node.|Not opening job [streamingtv299] on node [{ip-10-0-3-56}{9nywi9T5S5mz0MXEibFaCw}{wqon-rXpStm2SuIYfga7zA}{10.0.3.56}{10.0.3.56:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-162}{pvnUofCaRwOiO8LUiJ4TbQ}{Rwwf5uc0SKG9IS28nifBhw}{10.0.3.162}{10.0.3.162:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-196}{gZ36z6RYQfqKUaLqMTjiKw}{NFRpcdtPTYerbM5C1Hj3jQ}{10.0.3.196}{10.0.3.196:9300}], because this node isn't a ml node.|Not opening job [streamingtv299] on node [{ip-10-0-3-110}{VntXc0TpQKGzhK6OprukSw}{fam8-p92SWCf4QNg5I3dpA}{10.0.3.110}{10.0.3.110:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-126}{bkigpzP5SsW4j4Z0ox5g-Q}{bR8YYWd7TfGrGjv0sTsZ7Q}{10.0.3.126}{10.0.3.126:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-209}{9rluv4g5RXKbzM9clOaWGA}{bvM7WC3DQhm9VLBzIwCSQg}{10.0.3.209}{10.0.3.209:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-203}{J8r3z4gpQBW7DjAhLaMdnA}{Fq7BKus3T02JSJt7DoA5Cg}{10.0.3.203}{10.0.3.203:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-58}{c-jxUMWFR9S5HUasmhppaw}{RpTzjgLfTnag4kt__Qd1UQ}{10.0.3.58}{10.0.3.58:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-57}{x1u2qSaZQ9mSuwzsc9uS5w}{akMho_T0TJSttYCrbiolVw}{10.0.3.57}{10.0.3.57:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-197}{cWYjfUh8R2Gyut_sQGOG3Q}{qjZUBsfdRsuTazL_Lc2f6Q}{10.0.3.197}{10.0.3.197:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-215}{-eE-JaQMQgqYdRB4FJY5KQ}{WYfKVgAdT4yC2I7cpIa1Cg}{10.0.3.215}{10.0.3.215:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-41}{oyVlA-MBRW6dzjhWNwDAWg}{QUbWs3F4SUOOYx3bMDdXAA}{10.0.3.41}{10.0.3.41:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-68}{A4G6q0_QR7WIjuCCr0zfMQ}{1jJwyj94QoKLSKRW4LQjlw}{10.0.3.68}{10.0.3.68:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-228}{VIxQSGk-SGuhcoPjYnMlgg}{lqz7p1NLTnWpsJyfG92-QA}{10.0.3.228}{10.0.3.228:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-44}{_lVGfCsVQ6eZlRc7SKqQWw}{Lwepw_9tQnefUECnO11fpQ}{10.0.3.44}{10.0.3.44:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-32}{wqVpvt-cT1aHdB8ASGh_Rg}{7rqe4GuqTAW7iGrvc3uaUw}{10.0.3.32}{10.0.3.32:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-182}{5nZNKrlKTdOt1aZW75AkYQ}{kDOS7T6cS8KWtKnDjRSqpA}{10.0.3.182}{10.0.3.182:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-47}{MQlVJm70SFSu3ZElRf2Hlg}{2TWEPfQQQQaefdIvSyx2MA}{10.0.3.47}{10.0.3.47:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-10}{C2y3cPk_QjyyZDZb-GHKwQ}{ePrCTvllTLaMiIqRjOJVWA}{10.0.3.10}{10.0.3.10:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-33}{GhXTuf3hT5WDUEYciix-yw}{gONvo5mLRL2_VGWOugPfJA}{10.0.3.33}{10.0.3.33:9300}], because this node isn't a ml node.]"}],"type":"status_exception","reason":"Could not open job because no suitable nodes were found, allocation explanation [Not opening job [streamingtv299] on node [{ip-10-0-3-237}{iVQhSWkEQV2QBcNgAbry5Q}{L2IqCUuAQLymdOw47osyVg}{10.0.3.237}{10.0.3.237:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-50}{HhByQF_ORUqD0NiClqPArg}{pmkQnJAhRBeUUx6pCZzgtQ}{10.0.3.50}{10.0.3.50:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-231}{cA5e1J2yS4-E96bBlaRJ7w}{XBoP_nupR6K5qB9ve6V3KA}{10.0.3.231}{10.0.3.231:9300}], because this node isn't a ml node.|Not opening job [streamingtv299] on node [{ip-10-0-3-56}{9nywi9T5S5mz0MXEibFaCw}{wqon-rXpStm2SuIYfga7zA}{10.0.3.56}{10.0.3.56:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-162}{pvnUofCaRwOiO8LUiJ4TbQ}{Rwwf5uc0SKG9IS28nifBhw}{10.0.3.162}{10.0.3.162:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-196}{gZ36z6RYQfqKUaLqMTjiKw}{NFRpcdtPTYerbM5C1Hj3jQ}{10.0.3.196}{10.0.3.196:9300}], because this node isn't a ml node.|Not opening job [streamingtv299] on node [{ip-10-0-3-110}{VntXc0TpQKGzhK6OprukSw}{fam8-p92SWCf4QNg5I3dpA}{10.0.3.110}{10.0.3.110:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-126}{bkigpzP5SsW4j4Z0ox5g-Q}{bR8YYWd7TfGrGjv0sTsZ7Q}{10.0.3.126}{10.0.3.126:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-209}{9rluv4g5RXKbzM9clOaWGA}{bvM7WC3DQhm9VLBzIwCSQg}{10.0.3.209}{10.0.3.209:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-203}{J8r3z4gpQBW7DjAhLaMdnA}{Fq7BKus3T02JSJt7DoA5Cg}{10.0.3.203}{10.0.3.203:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-58}{c-jxUMWFR9S5HUasmhppaw}{RpTzjgLfTnag4kt__Qd1UQ}{10.0.3.58}{10.0.3.58:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-57}{x1u2qSaZQ9mSuwzsc9uS5w}{akMho_T0TJSttYCrbiolVw}{10.0.3.57}{10.0.3.57:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-197}{cWYjfUh8R2Gyut_sQGOG3Q}{qjZUBsfdRsuTazL_Lc2f6Q}{10.0.3.197}{10.0.3.197:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-215}{-eE-JaQMQgqYdRB4FJY5KQ}{WYfKVgAdT4yC2I7cpIa1Cg}{10.0.3.215}{10.0.3.215:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-41}{oyVlA-MBRW6dzjhWNwDAWg}{QUbWs3F4SUOOYx3bMDdXAA}{10.0.3.41}{10.0.3.41:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-68}{A4G6q0_QR7WIjuCCr0zfMQ}{1jJwyj94QoKLSKRW4LQjlw}{10.0.3.68}{10.0.3.68:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-228}{VIxQSGk-SGuhcoPjYnMlgg}{lqz7p1NLTnWpsJyfG92-QA}{10.0.3.228}{10.0.3.228:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-44}{_lVGfCsVQ6eZlRc7SKqQWw}{Lwepw_9tQnefUECnO11fpQ}{10.0.3.44}{10.0.3.44:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-32}{wqVpvt-cT1aHdB8ASGh_Rg}{7rqe4GuqTAW7iGrvc3uaUw}{10.0.3.32}{10.0.3.32:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-182}{5nZNKrlKTdOt1aZW75AkYQ}{kDOS7T6cS8KWtKnDjRSqpA}{10.0.3.182}{10.0.3.182:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-47}{MQlVJm70SFSu3ZElRf2Hlg}{2TWEPfQQQQaefdIvSyx2MA}{10.0.3.47}{10.0.3.47:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-10}{C2y3cPk_QjyyZDZb-GHKwQ}{ePrCTvllTLaMiIqRjOJVWA}{10.0.3.10}{10.0.3.10:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-33}{GhXTuf3hT5WDUEYciix-yw}{gONvo5mLRL2_VGWOugPfJA}{10.0.3.33}{10.0.3.33:9300}], because this node isn't a ml node.]"},"status":429}

The text was updated successfully, but these errors were encountered:

elasticmachine · 2017-06-07T12:58:20Z

Original comment by @sophiec20:

These jobs were created using a script, however nothing appeared in Job Messages.

If I try to open the 201th job from the console, the red error message is somewhat overwhelming.

elasticmachine · 2017-06-07T13:04:03Z

Original comment by @dimitris-athanasiou:

I guess we can have the full explanation shown in the job-stats but when it comes to the error message we can truncate.

elasticmachine · 2017-09-06T16:10:15Z

Original comment by @droberts195:

This has also been seen by @richcollier, so we might want to reconsider the priority.

Since the impact is biggest in the UI and the full message could actually be useful in certain circumstances in the back end log file, I think the solution is to use Elasticsearch's root_cause logic such that the full exception report contains the reason but not in the most deeply nested ElasticsearchException. The most deeply nested ElasticsearchException is considered the root_cause, and is what gets displayed in the UI.

elasticmachine · 2017-11-27T12:04:07Z

Original comment by @sophiec20:

After the 6.1 additions for checking established model memory, I now experience this error message more often. This is because I open and close a lot of jobs repeatedly. These may or may not have an established model memory, and all my scripts currently use the default 1GB. These conditions should be less likely in a production environment, although there may be evaluation users who are also impacted.

This is an example of an error message if there is no available node to open the job on. This is only a 2 node cluster. The end-user should really only see a limited about of info to say there is no available node to open the job on at the moment.

This looks much worse than it really is. For starters, do we need the node ID and IP details?

elasticmachine · 2018-03-21T11:35:37Z

Original comment by @droberts195:

We should try to get this into 6.3 as ML in Cloud will also be affected by this. It is particularly bad in clusters with large numbers of nodes but small numbers of ML nodes.

elasticmachine · 2018-03-21T16:19:19Z

Original comment by @davidkyle:

This looks much worse than it really is. For starters, do we need the node ID and IP details?

I raised LINK REDACTED to remove these fields

elasticmachine · 2018-03-21T16:27:54Z

Original comment by @sophiec20:

This looks much worse than it really is. For starters, do we need the node ID and IP details?

They clutter up GET job too.

Once they are removed from the message, I have a hunch it will look just as silly as the repeats will be more obvious.

skearns64 · 2018-07-06T13:10:19Z

I expect this to become a lot more common once ML is available on Cloud, and we see users running with smaller ML nodes.

Ideally, this would be a simple, easy-to-understand error message, highlighting the fact the job couldn't be open because there were no ML-capable nodes with enough memory. I believe that the full detail of the allocation explanation are valuable, but may belong in the job messages tab?

When an ML job cannot be allocated to a node the exception contained an explanation of why the job couldn't be allocated to each node in the cluster. For large clusters this was not particularly easy to read and made the error displayed in the UI look very scary. This commit changes the structure of the error to an outer ElasticsearchException with a high level message and an inner IllegalStateException containing the detailed explanation. Because the definition of root cause is the innermost ElasticsearchException the detailed explanation will not be the root cause (which is what Kibana displays). Fixes elastic#29950

When an ML job cannot be allocated to a node the exception contained an explanation of why the job couldn't be allocated to each node in the cluster. For large clusters this was not particularly easy to read and made the error displayed in the UI look very scary. This commit changes the structure of the error to an outer ElasticsearchException with a high level message and an inner IllegalStateException containing the detailed explanation. Because the definition of root cause is the innermost ElasticsearchException the detailed explanation will not be the root cause (which is what Kibana displays). Fixes #29950

elasticmachine added :ml Machine learning >bug v6.3.0 labels Apr 25, 2018

bleskes added v6.3.1 v6.4.0 and removed v6.3.0 v6.3.1 labels Apr 26, 2018

sophiec20 mentioned this issue Jul 6, 2018

[ML] Improve error message notifications when insufficient available memory elastic/kibana#20518

Closed

droberts195 self-assigned this Jul 9, 2018

droberts195 mentioned this issue Jul 10, 2018

[ML] Move open job failure explanation out of root cause #31925

Merged

droberts195 closed this as completed in #31925 Jul 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Repeated info in log message from all ML nodes leads to long message #29950

[ML] Repeated info in log message from all ML nodes leads to long message #29950

elasticmachine commented Jun 7, 2017

elasticmachine commented Jun 7, 2017

elasticmachine commented Jun 7, 2017

elasticmachine commented Sep 6, 2017

elasticmachine commented Nov 27, 2017

elasticmachine commented Mar 21, 2018

elasticmachine commented Mar 21, 2018

elasticmachine commented Mar 21, 2018

skearns64 commented Jul 6, 2018

[ML] Repeated info in log message from all ML nodes leads to long message #29950

[ML] Repeated info in log message from all ML nodes leads to long message #29950

Comments

elasticmachine commented Jun 7, 2017

elasticmachine commented Jun 7, 2017

elasticmachine commented Jun 7, 2017

elasticmachine commented Sep 6, 2017

elasticmachine commented Nov 27, 2017

elasticmachine commented Mar 21, 2018

elasticmachine commented Mar 21, 2018

elasticmachine commented Mar 21, 2018

skearns64 commented Jul 6, 2018