-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Ranking Sample - Hotel Search Results (changed to Bing Search Engine result ranking) #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ranking Sample - Hotel Search Results (changed to Bing Search Engine result ranking) #533
Conversation
.../csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/Common/ConsoleHelper.cs
Outdated
Show resolved
Hide resolved
|
||
static void PrepDatasets(MLContext mlContext, string assetPath, string originalDatasetPath, string trainDatasetPath, string testDatasetPath) | ||
{ | ||
const string DatasetUrl = "https://www.kaggle.com/c/expedia-personalized-sort/download/data.zip"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File is not public. Is requiring a user to sign up for a kaggle account too onerous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you do need a Kaggle account for this - I had assumed that all Kaggle datasets required you have an account. But, based on your comment, that must not be the case, so this may be more overhead than we can expect for a user. You probably noticed but I have sent an email to CELA to learn if there is a possibility of using this dataset. If it turns out I can't, then I will look at using the benchmark dataset. Overall, I feel that the hotel data is a very compelling scenario which is why I decided to try pursuing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend all demos have datasets that can be downloaded automatically (no user action needed). Will CELA let us host the dataset in our CDN?
Granted Kaggle is pretty awesome and folks should hone there data science skill there. I think forcing our users to sign up for an external account before trying demo code is a bit rough.
We will have more ranking samples of course. For instance a short sample that teaches less aspects but is easy to modify for towards their own datasets. And perhaps a sample which does string comparisons (query vs. documents) for feature engineering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now switched the dataset to use the one you recommended for ranking search engine results.
#resolved
|
||
static void PrepDatasets(MLContext mlContext, string assetPath, string originalDatasetPath, string trainDatasetPath, string testDatasetPath) | ||
{ | ||
const string DatasetUrl = "https://www.kaggle.com/c/expedia-personalized-sort/download/data.zip"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend using the MSLR-WEB10K (or WEB30K) dataset from our CDN:
- https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KTrain720kRows.tsv
- https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KValidate240kRows.tsv
- https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KTest240kRows.tsv
As zip:
- MSLR-WEB10K:
https://express-tlcresources.azureedge.net/datasets/MSLR-WEB10K/MSLR-WEB10K.zip (README, LICENSE) - MSLR-WEB30K:
https://express-tlcresources.azureedge.net/datasets/MSLR-WEB30K/MSLR-WEB30K.zip (README, LICENSE)
We use the MSLR-WEB10K dataset for our ranking benchmarks. The dataset is auto-downloaded from the above CDN aka.ms URLs when you run build.cmd -- /t:DownloadExternalTestFiles /p:IncludeBenchmarkData=true
to download/run the benchmark datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above - I will use this as a substitution if I can't get approval from CELA on the expedia dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above - am now using this dataset. #resolved
...tting-started/Ranking_PersonalizedSort/PersonalizedRanking/DataStructures/HotelPrediction.cs
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/Program.cs
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/Program.cs
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
* Information on similar competitor hotel offerings. | ||
|
||
## ML Task - Ranking | ||
As previously mentioned, this sample uses the LightGbm Lambdarank algorithm which is applied using a supervised learning technique known as "Learning to Rank". This technique requires that train/test datasets contain groups of data instances that are labeled with their ideal ranking value. The label is a numerical\ordinal value, such as {4, 3, 2, 1, 0} or a text value {"Perfect", "Excellent", "Good", "Fair", or "Bad"}. The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts. Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed:
- LightGbm => LightGBM
- Ordering of
{4, 3, 2, 1, 0}
to{0, 1, 2, 3, 4}
to encourage users to use more/less ordinals should their data better map more/less relevance values. - "ideal ranking value" => "relevance scores"
- Linked: Learning to Rank
As previously mentioned, this sample uses the LightGbm Lambdarank algorithm which is applied using a supervised learning technique known as "Learning to Rank". This technique requires that train/test datasets contain groups of data instances that are labeled with their ideal ranking value. The label is a numerical\ordinal value, such as {4, 3, 2, 1, 0} or a text value {"Perfect", "Excellent", "Good", "Fair", or "Bad"}. The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts. Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach. | |
As previously mentioned, this sample uses the LightGBM LambdaRank algorithm which is applied using a supervised learning technique known as "[Learning to Rank](https://en.wikipedia.org/wiki/Learning_to_rank)". This technique requires that train/test datasets contain groups of data instances that are each labeled with their relevance scores. The label is a numerical\ordinal value, such as {0, 1, 2, 3, 4} or a text value {"Bad", "Fair", "Good", Excellent", or "Perfect"}. The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts. Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also want to note that it's expected to have many more "Bad" relevance scores than "Perfect". This helps users to avoid converting a ranked list directly into a equally sized bins of {0, 1, 2, 3, 4}.
The relevance scores are re-used. Generally, you'll have many items per group which are labeled 0, which means the result is "bad"; and only one or a few labeled 4, which means that result is "perfect".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced all references to "ideal ranking value" to "relevance score".
For your point on "This helps to avoid converting a ranked list directly into equally sized bins of {0, 1, 2, 3, 4}" - can you explain a bit more why this needs to be avoided? I'd like to understand this point and then also explain it in the sample. For example, would the model potentially end up predicting ranks that are nearly the same for each result?
Good point about reusing scores - I inadvertently left this out and will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a form of leakage -- I think I'd group the various styles of leakage in the following way...
Leakage styles:
- Feature availability (columns which aren't available at prediction time -- e.g. a MonthySalary columns when predicting YearlySalary; or MinutesLate when predicting IsLate; or more subtly NumOfLatePayments when predicting ShouldGiveLoan)
- Non-iid data:
- Data leakage (e.g. splitting a time-series dataset randomly instead of newer data in test set; or MinMax normalizing a dataset then splitting)
- Duplicate rows between train/validation/test (e.g. oversampling a dataset to pad its size before splitting; e.g. different rotations of an single image; bootstrap sampling before splitting; or duplicating rows to up sample minority class)
- Group leakage -- not including a grouping split column (e.g. Andrew Ng's team had 100k x-rays of 30k patients, meaning ~3 images per patient. They did random splitting instead of ensuring that all images of a patient was in the same split. Hence the model partially memorized the patients instead of learning to recognize pneumonia in chest x-rays. Revised paper had a noticeable drop in scores.)
The first level is if it's column based, or row based leakage. These leakages cause you to misestimate how well your model will perform in production.
This style of leakage is a style of row based leakage, where the lack of a samplingKeyColumnName
column causes group leakage which may also show up as data leakage or duplicate rows. I'm uncertain which I'd call it; or if my ontology of leakage styles is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#resolved
* Information on similar competitor hotel offerings. | ||
|
||
## ML Task - Ranking | ||
As previously mentioned, this sample uses the LightGbm Lambdarank algorithm which is applied using a supervised learning technique known as "Learning to Rank". This technique requires that train/test datasets contain groups of data instances that are labeled with their ideal ranking value. The label is a numerical\ordinal value, such as {4, 3, 2, 1, 0} or a text value {"Perfect", "Excellent", "Good", "Fair", or "Bad"}. The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts. Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has someone checked if {"Perfect", "Excellent", "Good", "Fair", or "Bad"}
still work as Labels? This works from the MAML command-line, but I'm uncertain that the estimators API accepts anything besides {4, 3, 2, 1, 0}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried it (I lifted this point from the TLC documentation and assumed it was still applicable). I will verify it to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed up on this by changing my label values to strings ("Perfect", "Fair", "Bad") - when I attempt to train the model I get an Invalid Op exception saying the following, so it looks like this isn't supported. Does a bug need to be logged for this?
"Splitter/consolidator worker encountered exception while consuming source data.
"Could not parse value Bad in line 62, column Label".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my above comment, I have removed the mention of using strings as label values. #resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
|
||
```CSharp | ||
// When splitting the data, 20% is held for the test dataset. | ||
// To avoid label leakage, the GroupId (e.g. search\query id) is specified as the samplingKeyColumnName. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// To avoid label leakage, the GroupId (e.g. search\query id) is specified as the samplingKeyColumnName. | |
// To avoid data leakage, the GroupId (e.g. search\query id) is specified as the samplingKeyColumnName. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the actual documentation provided for this parameter says the following: "This can be used to ensure no label leakage from the train to the test set."
This obviously is a minor issue in the doc, but sounds like the way it's currently written is incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/Ranking_PersonalizedSort/README.md
Outdated
Show resolved
Hide resolved
<TargetFramework>netcoreapp2.2</TargetFramework> | ||
</PropertyGroup> | ||
|
||
<ItemGroup> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the elements in this ItemGroup shouldn't be necessary. This whole group can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see you have a .csproj
nested under another .csproj
. This isn't typical. Can they be in sibling folders instead? Or can we just have a single .csproj
?
In reply to: 298226619 [](ancestors = 298226619)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my mistake when I was copying and pasting some things around while cleaning up my solution. I have deleted the extra .csproj file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#resolved
IDataView trainData = mlContext.Data.LoadFromTextFile<HotelData>(trainDatasetPath, separatorChar: ',', hasHeader: true); | ||
|
||
// Specify the columns to include in the feature input data. | ||
var featureCols = trainData.Schema.AsQueryable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this simply be:
var featureCols = new[] { nameof(HotelData.Price_USD), nameof(HotelData.Promotion_Flag), nameof(HotelData.Prop_Id), nameof(HotelData.Prop_Brand), nameof(HotelData.Prop_Review_Score)};
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have since moved to a new dataset where there are 136 columns - as a result, this code is now: var featureCols = trainData.Schema.AsQueryable()
.Select(s => s.Name)
.Where(c =>
c != nameof(SearchResultData.Label) &&
c != nameof(SearchResultData.GroupId))
.ToArray();
#resolved
{ | ||
const string DatasetUrl = "https://www.kaggle.com/c/expedia-personalized-sort/download/data.zip"; | ||
|
||
if (!File.Exists(trainDatasetPath) || !File.Exists(testDatasetPath)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) the whitespace is off here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#resolved
Console.WriteLine("===== Prepare the testing/training datasets ====="); | ||
|
||
// Load dataset using TextLoader by specifying the type name that holds the data's schema to be mapped with datasets. | ||
IDataView data = mlContext.Data.LoadFromTextFile<HotelData>(originalDatasetPath, separatorChar: ',', hasHeader: true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do HotelData
and HotelPrediction
come from? I don't see them in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are in the DataStructures directory - I see them in the PR...maybe you overlooked it? Anyway, since moving to the new dataset, these files have been removed and replaced with new types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#resolve
</PropertyGroup> | ||
|
||
<ItemGroup> | ||
<Compile Remove="DataStructures\HotelData.cs" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can all be removed, can't they? I don't see a file Program_old.cs
or Mapper.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep - i missed that - thanks! #resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#resolved
using System.Collections.Generic; | ||
using System.Linq; | ||
|
||
namespace PersonalizedRanking.Common |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming:
I might recommend a title for this sample: "MSLR-WEB10K Ranking" or more simply, "Web Ranking".
I don't think there is personalization in this dataset. Personalization in the ranking area generally means the each user gets individualized search results. In this dataset, there is not information about the user. This information would generally include { topics of interest of the user, demographics of the user, current location of the user, etc }.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#resolved
{ | ||
const string AssetsPath = @"../../../Assets"; | ||
const string TrainDatasetUrl = "https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KTrain720kRows.tsv"; | ||
const string TestDatasetUrl = "https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KTest240kRows.tsv"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't hurt to list the validation dataset. This would let users experiment on the dataset by iterating using the scores on the validation set, then testing their final solution, exactly once, on the test dataset.
Generally the DS pattern for ML should look like:
- Train on training set, get metrics from validation set
- (iterate many times until happy with pipeline -- goto 1)
- Train on found pipeline on combined train+validate set, get metrics on test set (exactly once) -- this is your final metrics
- Retrain pipeline on all data train+validate+test set. Send this newly created model to production.
The final estimate of how well your model will do in production is the metrics from step (3). The final model for production, trained on all available data, is trained at step (4).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#Resolved
In this sample, we show how to apply ranking to search engine results. To perform ranking, there are two algorithms currently available - FastTree Boosting (FastRank) and Light Gradient Boosting Machine (LightGBM). We use the LightGBM's LambdaRank implementation in this sample to automatically build an ML model to predict ranking. | ||
|
||
## Dataset | ||
The training and testing data used by this sample is based on a public [dataset provided by Microsoft](https://www.microsoft.com/en-us/research/project/mslr/) originally provided Microsoft Bing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added note about the licensing of the dataset:
The training and testing data used by this sample is based on a public [dataset provided by Microsoft](https://www.microsoft.com/en-us/research/project/mslr/) originally provided Microsoft Bing. | |
The training and testing data used by this sample is based on a public [dataset provided by Microsoft](https://www.microsoft.com/en-us/research/project/mslr/) originally provided Microsoft Bing. The dataset is released under a [CC-by 4.0](https://creativecommons.org/licenses/by/4.0/) license. | |
This information is also available in the datasets readme in the main repo -- https://github.com/dotnet/machinelearning/blob/056c60479304a3b5dbdf129c9bc6e853322bb090/test/data/README.md#mslr-web10k-mslr-web30k | |
May want to take the citation from there also. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#resolved
|
||
public uint Label { get; set; } | ||
|
||
// Prediction made by the model that is used to indicate the relative ranking of the benchmark data instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Prediction made by the model that is used to indicate the relative ranking of the benchmark data instances. | |
// Prediction made by the model that is used to indicate the relative ranking candidate search results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#resolved
#549) * Ranking Sample - Hotel Search Results (changed to Bing Search Engine result ranking) (#533) * Created ranking sample * removed todo * Fixed wording in ReadMe * Fixed typos * Modified RankingMetric code * Incorporated Justin's feedback * Fixed minor inconsistencies * Converted to new dataset * Changed code to download dataset since its zip is too large * fixed using statement * Removed unneeded license info for dataset * Renamed solution and minor changes * minor fixes * Justin's feedback for PR into master * fixed period and spacing inconsistencies
Please review the sample that I have created to show how to do ranking using Light GBM.
Note that I have changed this sample from using the Expedia dataset to instead use Bing's search result dataset.