-
Notifications
You must be signed in to change notification settings - Fork 2.7k
MN.Net Ranking Project #648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @gangasahu - I read through your questions and wanted to provide you with the below information. Please let me know if this still doesn't answer your questions. Using your example where a new query "testing tool" is entered by a user, two things need to happen: 1.) Your app that is consuming the model is first responsible for determining the query results themselves and must group these results with an identifier, known as the group id -- the key here is that this is the responsibility of the app that is consuming the model. For example, your app would need to provide the query results similar to this (note: 'etc' represents additional feature columns that should be included that were used in training the model):
2.) Once the query results are determined and are grouped according to a common group id, the model can then be used to rank these results for the consuming app. Continuing with the "testing tool" example, the data shown in the above table would be passed to the model to have it rank all those results that have the same group id. Here are the lines of code from the sample that would do this step: // Load the model to perform predictions with it.
DataViewSchema predictionPipelineSchema;
ITransformer predictionPipeline = mlContext.Model.Load(modelPath, out predictionPipelineSchema);
// Predict rankings passing in the data that has the query results grouped by an id.
IDataView predictions = predictionPipeline.Transform(data); Thanks, |
Thanks Nicole for your response and explanation. It is really helpful and now I understand better. However, I have another question. For the query "Testing tool", the application will frame the list for "Test Tool A", "Test Tool B " from the historical data and then assign the group id of 100, then pass to the model to rank. But when the query to rank "Testing Tool" is not there in historical data, then how that list will be prepared to pass to the link. In that case, the list passed to the model will be empty and nothing can be ranked. In my use case fo ranking is: I will have a query that is a combination of Source city and Destination city (ex. "ChicagoNewyork"). What happened if this query string not there in historical / training data. However, there might be queries like "ChicagoHouston", "ChicagoDallas", "BostonNewyork", "MemphisNewyork" etc. are there is the historical / training data. How to prepare the query list data to the model to rank? For additional input (feature vector) besides label column, the input to the model should include the feature vector for the new query? Right? How about the case when we do not know all the features fo the feature vector for the query? Basically, the question is: how to rank for a query that is not in the historical / training data? Appreciate your input on the above questions. |
Hi, @gangasahu - I should have clarified that in the example I gave in my previous response, I am referring to the case where you have a new query that does not exist in the historical data that is used for training. When you train the model initially with historical training data, you include the following as parameters: With this in mind, let's take the example that you mentioned involving Source\Destination city. Let's assume we have some training data for flights that looks like this:
Note that with the above data, suppose that we decide that the feature columns are the arrival and departure time columns. This is because in our example, we find that these column values are most important in determining the ideal rank of the results. Once the model is trained, let's now assume that a user enters a new query of "Boston to New York". Here's what needs to happen:
When this data is passed to the model, the model will look at the features columns that exist in this data (e.g. Departure\Arrival Time) and then use these as the basis for predicting the ranks for each result. The key here is that the predicted rankings are based on the feature columns that you decide to select when training your model. These same feature columns must also be present in the data used when making a prediction with the model. Let me know if you still have questions. |
Thanks Nicole for taking the time to explain all these that goes behind. Few Questions :
Whenever you have time, appreciate if you can provide your input to the above questions. I am still working on the model to prepare ML data for our use case, very similar to this. I will update you once it is ready. Thank you. |
Hi, @gangasahu - Here are answers to your questions above: 1.) Your model's ranking ability is going to depend on the quality of the data that you train it with. Let's assume that you decide to train the model with the following feature columns: Source City, Destination City, Departure Time, Arrival Time. If your query results (e.g. the data you provide to the model to rank) are for a Source City and Destination City that do not exist in the training data, the model will still return ranking based on all of the feature column values that it was trained with. However, you may find that the NDCG (see bullet 3 below) value is low and that the model doesn't rank these results as desired. As a result, you may need to consider expanding your training dataset to include these additional cities. 2.) I'm unsure of the question that you're asking here - the Label column exists in your training data to specify the actual rank that for a set of query results. The weight (also referred to as custom gains) allows you to say how much each label value is worth. For example, if you label your data with the following values: terrible = 0, bad = 1, ok = 2, good = 3, and perfect = 4. And, you decide that you want greater emphasis on perfect results being ranked higher in your search query - you could specify custom gains to be {0, 1, 2, 10, 100} which makes a perfect result 10x more important than a good result and 50x more important than an ok result. 3.) It does not matter if the Scores are sometimes negative - you should sort the scores in descending order as you mentioned. To measure the ranking ability, you should rely on the NDCG value that is returned based on the number of results that you are looking to evaluate. For example, NDCG@10 measures the likelihood that the top 10 results are ranked correctly - this score ranges from 0.0 to 1.0. The closer to 1.0, the better the ranking ability of your model. To continue increasing this score, you would need to experiment with:
Hope these answers help - your questions have been helpful to me in that I recognize there are areas in the sample where more detail should be provided. Let me know if you still have questions. ~Nicole |
Adding @ebarsoumMS and @justinormont from the team to help on the Ranking Model questions above, as well. |
Hello Nicole, @ebarsoumMS or @justinormont, Any update on the questions I have asked. I am in the middle of a project and waiting eagerly for your answers. Thank you. |
Hi @gangasahu,
You can also change some other advanced hyper parameters using this API:
Hope this helps, let me know if you have more questions. |
Hello @yaeldekel, Thank you very much for your answers. It really clarifies some of my questions. I was busy with other aspects of operationalizing the model. So late in responding. Here are some of my other questions:
Please advise. Thank you very much. |
The ranking task is not currently available in AutoML. If you file an issue asking for it, it will bring it to people's attention. /cc @JakeRadMSFT
The earlier CV/TrainTest modes are for getting metrics, which estimate how well the model will do in production. The last step is training the model to deploy to production; this final model is trained on all available data. More info: #549 (comment) |
In the ML.Net Ranking sample project, the consumption of the model is not clear. For example : After the model is trained and saved, it has to be used with the new query to rank the web searched. The example provided is not clear. Suppose, we want to use the web query for "testing tool", how does this has to be passed to the model so that model will return a list of URL_Ids with proper score so that it can be ranked. More challenge is : what happern when the query "?????" has not been queried before or there is no groupId in the test data, how will be ranked. Calling Prediction might return null.
So a concrete examples will be helpful.
May be anothe sample with the Hotel selection using ML.Net will be helpful.
Have been working on this for quite some time. Having trouble figuring out how to consume the model.
Appreciate any help / samples /tips.
Email : [email protected]
RR Donnelley
The text was updated successfully, but these errors were encountered: