-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] Repeated info in log message from all ML nodes leads to long message #29950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Original comment by @sophiec20: These jobs were created using a script, however nothing appeared in Job Messages. If I try to open the 201th job from the console, the red error message is somewhat overwhelming. |
Original comment by @dimitris-athanasiou: I guess we can have the full explanation shown in the job-stats but when it comes to the error message we can truncate. |
Original comment by @droberts195: This has also been seen by @richcollier, so we might want to reconsider the priority. Since the impact is biggest in the UI and the full message could actually be useful in certain circumstances in the back end log file, I think the solution is to use Elasticsearch's |
Original comment by @sophiec20: After the 6.1 additions for checking established model memory, I now experience this error message more often. This is because I open and close a lot of jobs repeatedly. These may or may not have an established model memory, and all my scripts currently use the default 1GB. These conditions should be less likely in a production environment, although there may be evaluation users who are also impacted. This is an example of an error message if there is no available node to open the job on. This is only a 2 node cluster. The end-user should really only see a limited about of info to say there is no available node to open the job on at the moment. This looks much worse than it really is. For starters, do we need the node ID and IP details? |
Original comment by @droberts195: We should try to get this into 6.3 as ML in Cloud will also be affected by this. It is particularly bad in clusters with large numbers of nodes but small numbers of ML nodes. |
Original comment by @davidkyle:
I raised LINK REDACTED to remove these fields |
Original comment by @sophiec20:
They clutter up Once they are removed from the message, I have a hunch it will look just as silly as the repeats will be more obvious. |
I expect this to become a lot more common once ML is available on Cloud, and we see users running with smaller ML nodes. Ideally, this would be a simple, easy-to-understand error message, highlighting the fact the job couldn't be open because there were no ML-capable nodes with enough memory. I believe that the full detail of the allocation explanation are valuable, but may belong in the job messages tab? |
When an ML job cannot be allocated to a node the exception contained an explanation of why the job couldn't be allocated to each node in the cluster. For large clusters this was not particularly easy to read and made the error displayed in the UI look very scary. This commit changes the structure of the error to an outer ElasticsearchException with a high level message and an inner IllegalStateException containing the detailed explanation. Because the definition of root cause is the innermost ElasticsearchException the detailed explanation will not be the root cause (which is what Kibana displays). Fixes elastic#29950
When an ML job cannot be allocated to a node the exception contained an explanation of why the job couldn't be allocated to each node in the cluster. For large clusters this was not particularly easy to read and made the error displayed in the UI look very scary. This commit changes the structure of the error to an outer ElasticsearchException with a high level message and an inner IllegalStateException containing the detailed explanation. Because the definition of root cause is the innermost ElasticsearchException the detailed explanation will not be the root cause (which is what Kibana displays). Fixes #29950
When an ML job cannot be allocated to a node the exception contained an explanation of why the job couldn't be allocated to each node in the cluster. For large clusters this was not particularly easy to read and made the error displayed in the UI look very scary. This commit changes the structure of the error to an outer ElasticsearchException with a high level message and an inner IllegalStateException containing the detailed explanation. Because the definition of root cause is the innermost ElasticsearchException the detailed explanation will not be the root cause (which is what Kibana displays). Fixes #29950
Original comment by @sophiec20:
Found in 5.5
"build" : { "hash" : "62e486b", "date" : "2017-06-06T13:54:18.605Z" }
23 node cluster, of which 3 are master and 20 are ML and data nodes.
Once I reached the limit of max open jobs, the following error occurs when you try to open a job.
This is repeating the error message from 20 nodes. This will only get worse as the number of nodes increases.
Not a priority, considering we are only recommending a small number of dedicated nodes.
The text was updated successfully, but these errors were encountered: