Skip to content

[ML] Repeated info in log message from all ML nodes leads to long message #29950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticmachine opened this issue Jun 7, 2017 · 8 comments
Assignees
Labels
>bug :ml Machine learning v6.4.0

Comments

@elasticmachine
Copy link
Collaborator

Original comment by @sophiec20:

Found in 5.5 "build" : { "hash" : "62e486b", "date" : "2017-06-06T13:54:18.605Z" }

23 node cluster, of which 3 are master and 20 are ML and data nodes.
Once I reached the limit of max open jobs, the following error occurs when you try to open a job.

This is repeating the error message from 20 nodes. This will only get worse as the number of nodes increases.

Not a priority, considering we are only recommending a small number of dedicated nodes.

== Opening job for streamingtv299...
{"error":{"root_cause":[{"type":"status_exception","reason":"Could not open job because no suitable nodes were found, allocation explanation [Not opening job [streamingtv299] on node [{ip-10-0-3-237}{iVQhSWkEQV2QBcNgAbry5Q}{L2IqCUuAQLymdOw47osyVg}{10.0.3.237}{10.0.3.237:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-50}{HhByQF_ORUqD0NiClqPArg}{pmkQnJAhRBeUUx6pCZzgtQ}{10.0.3.50}{10.0.3.50:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-231}{cA5e1J2yS4-E96bBlaRJ7w}{XBoP_nupR6K5qB9ve6V3KA}{10.0.3.231}{10.0.3.231:9300}], because this node isn't a ml node.|Not opening job [streamingtv299] on node [{ip-10-0-3-56}{9nywi9T5S5mz0MXEibFaCw}{wqon-rXpStm2SuIYfga7zA}{10.0.3.56}{10.0.3.56:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-162}{pvnUofCaRwOiO8LUiJ4TbQ}{Rwwf5uc0SKG9IS28nifBhw}{10.0.3.162}{10.0.3.162:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-196}{gZ36z6RYQfqKUaLqMTjiKw}{NFRpcdtPTYerbM5C1Hj3jQ}{10.0.3.196}{10.0.3.196:9300}], because this node isn't a ml node.|Not opening job [streamingtv299] on node [{ip-10-0-3-110}{VntXc0TpQKGzhK6OprukSw}{fam8-p92SWCf4QNg5I3dpA}{10.0.3.110}{10.0.3.110:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-126}{bkigpzP5SsW4j4Z0ox5g-Q}{bR8YYWd7TfGrGjv0sTsZ7Q}{10.0.3.126}{10.0.3.126:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-209}{9rluv4g5RXKbzM9clOaWGA}{bvM7WC3DQhm9VLBzIwCSQg}{10.0.3.209}{10.0.3.209:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-203}{J8r3z4gpQBW7DjAhLaMdnA}{Fq7BKus3T02JSJt7DoA5Cg}{10.0.3.203}{10.0.3.203:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-58}{c-jxUMWFR9S5HUasmhppaw}{RpTzjgLfTnag4kt__Qd1UQ}{10.0.3.58}{10.0.3.58:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-57}{x1u2qSaZQ9mSuwzsc9uS5w}{akMho_T0TJSttYCrbiolVw}{10.0.3.57}{10.0.3.57:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-197}{cWYjfUh8R2Gyut_sQGOG3Q}{qjZUBsfdRsuTazL_Lc2f6Q}{10.0.3.197}{10.0.3.197:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-215}{-eE-JaQMQgqYdRB4FJY5KQ}{WYfKVgAdT4yC2I7cpIa1Cg}{10.0.3.215}{10.0.3.215:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-41}{oyVlA-MBRW6dzjhWNwDAWg}{QUbWs3F4SUOOYx3bMDdXAA}{10.0.3.41}{10.0.3.41:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-68}{A4G6q0_QR7WIjuCCr0zfMQ}{1jJwyj94QoKLSKRW4LQjlw}{10.0.3.68}{10.0.3.68:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-228}{VIxQSGk-SGuhcoPjYnMlgg}{lqz7p1NLTnWpsJyfG92-QA}{10.0.3.228}{10.0.3.228:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-44}{_lVGfCsVQ6eZlRc7SKqQWw}{Lwepw_9tQnefUECnO11fpQ}{10.0.3.44}{10.0.3.44:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-32}{wqVpvt-cT1aHdB8ASGh_Rg}{7rqe4GuqTAW7iGrvc3uaUw}{10.0.3.32}{10.0.3.32:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-182}{5nZNKrlKTdOt1aZW75AkYQ}{kDOS7T6cS8KWtKnDjRSqpA}{10.0.3.182}{10.0.3.182:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-47}{MQlVJm70SFSu3ZElRf2Hlg}{2TWEPfQQQQaefdIvSyx2MA}{10.0.3.47}{10.0.3.47:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-10}{C2y3cPk_QjyyZDZb-GHKwQ}{ePrCTvllTLaMiIqRjOJVWA}{10.0.3.10}{10.0.3.10:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-33}{GhXTuf3hT5WDUEYciix-yw}{gONvo5mLRL2_VGWOugPfJA}{10.0.3.33}{10.0.3.33:9300}], because this node isn't a ml node.]"}],"type":"status_exception","reason":"Could not open job because no suitable nodes were found, allocation explanation [Not opening job [streamingtv299] on node [{ip-10-0-3-237}{iVQhSWkEQV2QBcNgAbry5Q}{L2IqCUuAQLymdOw47osyVg}{10.0.3.237}{10.0.3.237:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-50}{HhByQF_ORUqD0NiClqPArg}{pmkQnJAhRBeUUx6pCZzgtQ}{10.0.3.50}{10.0.3.50:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-231}{cA5e1J2yS4-E96bBlaRJ7w}{XBoP_nupR6K5qB9ve6V3KA}{10.0.3.231}{10.0.3.231:9300}], because this node isn't a ml node.|Not opening job [streamingtv299] on node [{ip-10-0-3-56}{9nywi9T5S5mz0MXEibFaCw}{wqon-rXpStm2SuIYfga7zA}{10.0.3.56}{10.0.3.56:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-162}{pvnUofCaRwOiO8LUiJ4TbQ}{Rwwf5uc0SKG9IS28nifBhw}{10.0.3.162}{10.0.3.162:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-196}{gZ36z6RYQfqKUaLqMTjiKw}{NFRpcdtPTYerbM5C1Hj3jQ}{10.0.3.196}{10.0.3.196:9300}], because this node isn't a ml node.|Not opening job [streamingtv299] on node [{ip-10-0-3-110}{VntXc0TpQKGzhK6OprukSw}{fam8-p92SWCf4QNg5I3dpA}{10.0.3.110}{10.0.3.110:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-126}{bkigpzP5SsW4j4Z0ox5g-Q}{bR8YYWd7TfGrGjv0sTsZ7Q}{10.0.3.126}{10.0.3.126:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-209}{9rluv4g5RXKbzM9clOaWGA}{bvM7WC3DQhm9VLBzIwCSQg}{10.0.3.209}{10.0.3.209:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-203}{J8r3z4gpQBW7DjAhLaMdnA}{Fq7BKus3T02JSJt7DoA5Cg}{10.0.3.203}{10.0.3.203:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-58}{c-jxUMWFR9S5HUasmhppaw}{RpTzjgLfTnag4kt__Qd1UQ}{10.0.3.58}{10.0.3.58:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-57}{x1u2qSaZQ9mSuwzsc9uS5w}{akMho_T0TJSttYCrbiolVw}{10.0.3.57}{10.0.3.57:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-197}{cWYjfUh8R2Gyut_sQGOG3Q}{qjZUBsfdRsuTazL_Lc2f6Q}{10.0.3.197}{10.0.3.197:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-215}{-eE-JaQMQgqYdRB4FJY5KQ}{WYfKVgAdT4yC2I7cpIa1Cg}{10.0.3.215}{10.0.3.215:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-41}{oyVlA-MBRW6dzjhWNwDAWg}{QUbWs3F4SUOOYx3bMDdXAA}{10.0.3.41}{10.0.3.41:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-68}{A4G6q0_QR7WIjuCCr0zfMQ}{1jJwyj94QoKLSKRW4LQjlw}{10.0.3.68}{10.0.3.68:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-228}{VIxQSGk-SGuhcoPjYnMlgg}{lqz7p1NLTnWpsJyfG92-QA}{10.0.3.228}{10.0.3.228:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-44}{_lVGfCsVQ6eZlRc7SKqQWw}{Lwepw_9tQnefUECnO11fpQ}{10.0.3.44}{10.0.3.44:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-32}{wqVpvt-cT1aHdB8ASGh_Rg}{7rqe4GuqTAW7iGrvc3uaUw}{10.0.3.32}{10.0.3.32:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-182}{5nZNKrlKTdOt1aZW75AkYQ}{kDOS7T6cS8KWtKnDjRSqpA}{10.0.3.182}{10.0.3.182:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-47}{MQlVJm70SFSu3ZElRf2Hlg}{2TWEPfQQQQaefdIvSyx2MA}{10.0.3.47}{10.0.3.47:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-10}{C2y3cPk_QjyyZDZb-GHKwQ}{ePrCTvllTLaMiIqRjOJVWA}{10.0.3.10}{10.0.3.10:9300}{ml.enabled=true}], because this node is full. Number of opened jobs [10], max_running_jobs [10]|Not opening job [streamingtv299] on node [{ip-10-0-3-33}{GhXTuf3hT5WDUEYciix-yw}{gONvo5mLRL2_VGWOugPfJA}{10.0.3.33}{10.0.3.33:9300}], because this node isn't a ml node.]"},"status":429}
@elasticmachine
Copy link
Collaborator Author

Original comment by @sophiec20:

These jobs were created using a script, however nothing appeared in Job Messages.

If I try to open the 201th job from the console, the red error message is somewhat overwhelming.
image

@elasticmachine
Copy link
Collaborator Author

Original comment by @dimitris-athanasiou:

I guess we can have the full explanation shown in the job-stats but when it comes to the error message we can truncate.

@elasticmachine
Copy link
Collaborator Author

Original comment by @droberts195:

This has also been seen by @richcollier, so we might want to reconsider the priority.

Since the impact is biggest in the UI and the full message could actually be useful in certain circumstances in the back end log file, I think the solution is to use Elasticsearch's root_cause logic such that the full exception report contains the reason but not in the most deeply nested ElasticsearchException. The most deeply nested ElasticsearchException is considered the root_cause, and is what gets displayed in the UI.

@elasticmachine
Copy link
Collaborator Author

Original comment by @sophiec20:

After the 6.1 additions for checking established model memory, I now experience this error message more often. This is because I open and close a lot of jobs repeatedly. These may or may not have an established model memory, and all my scripts currently use the default 1GB. These conditions should be less likely in a production environment, although there may be evaluation users who are also impacted.

This is an example of an error message if there is no available node to open the job on. This is only a 2 node cluster. The end-user should really only see a limited about of info to say there is no available node to open the job on at the moment.

This looks much worse than it really is. For starters, do we need the node ID and IP details?

image

@elasticmachine
Copy link
Collaborator Author

Original comment by @droberts195:

We should try to get this into 6.3 as ML in Cloud will also be affected by this. It is particularly bad in clusters with large numbers of nodes but small numbers of ML nodes.

@elasticmachine
Copy link
Collaborator Author

Original comment by @davidkyle:

This looks much worse than it really is. For starters, do we need the node ID and IP details?

I raised LINK REDACTED to remove these fields

@elasticmachine
Copy link
Collaborator Author

Original comment by @sophiec20:

This looks much worse than it really is. For starters, do we need the node ID and IP details?

They clutter up GET job too.

Once they are removed from the message, I have a hunch it will look just as silly as the repeats will be more obvious.

@skearns64
Copy link
Contributor

I expect this to become a lot more common once ML is available on Cloud, and we see users running with smaller ML nodes.

Ideally, this would be a simple, easy-to-understand error message, highlighting the fact the job couldn't be open because there were no ML-capable nodes with enough memory. I believe that the full detail of the allocation explanation are valuable, but may belong in the job messages tab?

@droberts195 droberts195 self-assigned this Jul 9, 2018
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jul 10, 2018
When an ML job cannot be allocated to a node the exception
contained an explanation of why the job couldn't be
allocated to each node in the cluster.  For large clusters
this was not particularly easy to read and made the error
displayed in the UI look very scary.

This commit changes the structure of the error to an outer
ElasticsearchException with a high level message and an
inner IllegalStateException containing the detailed
explanation.  Because the definition of root cause is the
innermost ElasticsearchException the detailed explanation
will not be the root cause (which is what Kibana displays).

Fixes elastic#29950
droberts195 added a commit that referenced this issue Jul 13, 2018
When an ML job cannot be allocated to a node the exception
contained an explanation of why the job couldn't be
allocated to each node in the cluster.  For large clusters
this was not particularly easy to read and made the error
displayed in the UI look very scary.

This commit changes the structure of the error to an outer
ElasticsearchException with a high level message and an
inner IllegalStateException containing the detailed
explanation.  Because the definition of root cause is the
innermost ElasticsearchException the detailed explanation
will not be the root cause (which is what Kibana displays).

Fixes #29950
droberts195 added a commit that referenced this issue Jul 13, 2018
When an ML job cannot be allocated to a node the exception
contained an explanation of why the job couldn't be
allocated to each node in the cluster.  For large clusters
this was not particularly easy to read and made the error
displayed in the UI look very scary.

This commit changes the structure of the error to an outer
ElasticsearchException with a high level message and an
inner IllegalStateException containing the detailed
explanation.  Because the definition of root cause is the
innermost ElasticsearchException the detailed explanation
will not be the root cause (which is what Kibana displays).

Fixes #29950
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning v6.4.0
Projects
None yet
Development

No branches or pull requests

4 participants