-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] Adds feature importance option to inference processor #52218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Adds feature importance option to inference processor #52218
Conversation
Pinging @elastic/ml-core (:ml) |
@elasticmachine update branch |
…benwtrent/elasticsearch into feature/ml-inference-feature-importance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
My only real comment is that the thread safety of Tree
needs thinking about. If in fact these objects are only ever used from one thread then just remove the TODO
, the volatile
s and add a class Javadoc comment stating they must only be used from one thread at a time. But if they can be used from multiple threads then your TODO
is correct and extra work to ensure thread safety is required.
(Optional, integer) | ||
If set, feature importance for the top | ||
most important features will be computed. Importance is calculated | ||
using the SHAP (SHapley Additive exPlanations) method as described in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to use the macro from https://github.com/elastic/elasticsearch/pull/52283/files#diff-876579125ed69329aeee2dff00706206R909 to avoid duplicating this text.
.collect(Collectors.toMap(featureNames::get, i -> featureImportance[i]))); | ||
} | ||
|
||
//TODO synchronized? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is possible that this can be called from two different threads around the same time then there is a bigger problem that making this synchronized
will not solve. In the private featureImportance(List<Double>, Map<String, String> featureDecoder)
method above it would be possible for a simultaneous call to calculateNodeEstimatesIfNeeded()
has changed maxDepth
but not nodeEstimates
at the time when these members are being read in featureImportance()
.
If multithreaded access to Tree
objects is possible then the featureImportance(List<Double>, Map<String, String> featureDecoder)
method needs to be synchronized and this method needs a comment saying it must only be called from featureImportance(List<Double>, Map<String, String> featureDecoder)
. Then nodeEstimates
and maxDepth
wouldn't need to be volatile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 I do think this will be accessible via multiple threads. Multiple docs can come into different pipelines on the same node referencing the same model. So, they would all hit the same instance in cache.
I am going to get this synchronized, and also add a node to the classes saying that they must be MT safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine update branch |
run elasticsearch-ci/docs |
…enwtrent/elasticsearch into feature/ml-inference-feature-importance
…c#52218) This adds machine learning model feature importance calculations to the inference processor. The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values` Example: ``` "inference": { "field_mappings": {}, "model_id": "my_model", "inference_config": { "regression": { "num_top_feature_importance_values": 3 } } } ``` This will write to the document as follows: ``` "inference" : { "feature_importance" : { "FlightTimeMin" : -76.90955548511226, "FlightDelayType" : 114.13514762158526, "DistanceMiles" : 13.731580450792187 }, "predicted_value" : 108.33165831875137, "model_id" : "my_model" } ``` This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888). It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7. Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded. NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc usability blocked by: elastic/ml-cpp#991
#52666) This adds machine learning model feature importance calculations to the inference processor. The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values` Example: ``` "inference": { "field_mappings": {}, "model_id": "my_model", "inference_config": { "regression": { "num_top_feature_importance_values": 3 } } } ``` This will write to the document as follows: ``` "inference" : { "feature_importance" : { "FlightTimeMin" : -76.90955548511226, "FlightDelayType" : 114.13514762158526, "DistanceMiles" : 13.731580450792187 }, "predicted_value" : 108.33165831875137, "model_id" : "my_model" } ``` This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888). It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7. Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded. NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc usability blocked by: elastic/ml-cpp#991
This adds machine learning model feature importance calculations to the inference processor.
The new flag in the configuration matches the analytics parameter name:
num_top_feature_importance_values
Example:
This will write to the document as follows:
This is done through calculating the SHAP values.
It requires that models have populated
number_samples
for each tree node. This is not available to models that were created before 7.7.Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.
NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc
usability blocked by: elastic/ml-cpp#991