Ranking Sample - Hotel Search Results (changed to Bing Search Engine result ranking) #533

nicolehaugen · 2019-06-25T00:58:37Z

Please review the sample that I have created to show how to do ranking using Light GBM.
Note that I have changed this sample from using the Expedia dataset to instead use Bing's search result dataset.

dnfclas · 2019-06-25T00:58:48Z

All CLA requirements met.

.../csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/Common/ConsoleHelper.cs

justinormont · 2019-06-26T01:59:12Z

samples/csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/Program.cs

+
+        static void PrepDatasets(MLContext mlContext, string assetPath, string originalDatasetPath, string trainDatasetPath, string testDatasetPath)
+        {
+            const string DatasetUrl = "https://www.kaggle.com/c/expedia-personalized-sort/download/data.zip";


File is not public. Is requiring a user to sign up for a kaggle account too onerous?

Yes you do need a Kaggle account for this - I had assumed that all Kaggle datasets required you have an account. But, based on your comment, that must not be the case, so this may be more overhead than we can expect for a user. You probably noticed but I have sent an email to CELA to learn if there is a possibility of using this dataset. If it turns out I can't, then I will look at using the benchmark dataset. Overall, I feel that the hotel data is a very compelling scenario which is why I decided to try pursuing it.

I'd recommend all demos have datasets that can be downloaded automatically (no user action needed). Will CELA let us host the dataset in our CDN?

Granted Kaggle is pretty awesome and folks should hone there data science skill there. I think forcing our users to sign up for an external account before trying demo code is a bit rough.

We will have more ranking samples of course. For instance a short sample that teaches less aspects but is easy to modify for towards their own datasets. And perhaps a sample which does string comparisons (query vs. documents) for feature engineering.

I have now switched the dataset to use the one you recommended for ranking search engine results.

#resolved

justinormont · 2019-06-26T01:59:51Z

samples/csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/Program.cs

+
+        static void PrepDatasets(MLContext mlContext, string assetPath, string originalDatasetPath, string trainDatasetPath, string testDatasetPath)
+        {
+            const string DatasetUrl = "https://www.kaggle.com/c/expedia-personalized-sort/download/data.zip";


I'd recommend using the MSLR-WEB10K (or WEB30K) dataset from our CDN:

https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KTrain720kRows.tsv

https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KValidate240kRows.tsv

https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KTest240kRows.tsv

As zip:

MSLR-WEB10K:
https://express-tlcresources.azureedge.net/datasets/MSLR-WEB10K/MSLR-WEB10K.zip (README, LICENSE)

MSLR-WEB30K:
https://express-tlcresources.azureedge.net/datasets/MSLR-WEB30K/MSLR-WEB30K.zip (README, LICENSE)

We use the MSLR-WEB10K dataset for our ranking benchmarks. The dataset is auto-downloaded from the above CDN aka.ms URLs when you run build.cmd -- /t:DownloadExternalTestFiles /p:IncludeBenchmarkData=true to download/run the benchmark datasets.

See comment above - I will use this as a substitution if I can't get approval from CELA on the expedia dataset.

Same comment as above - am now using this dataset. #resolved

...tting-started/Ranking_PersonalizedSort/PersonalizedRanking/DataStructures/HotelPrediction.cs

samples/csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/Program.cs

samples/csharp/getting-started/Ranking_PersonalizedSort/README.md

justinormont · 2019-06-26T02:45:28Z

samples/csharp/getting-started/Ranking_PersonalizedSort/README.md

+* Information on similar competitor hotel offerings.
+
+## ML Task - Ranking
+As previously mentioned, this sample uses the LightGbm Lambdarank algorithm which is applied using a supervised learning technique known as "Learning to Rank".  This technique requires that train/test datasets contain groups of data instances that are labeled with their ideal ranking value.  The label is a numerical\ordinal value, such as {4, 3, 2, 1, 0} or a text value {"Perfect", "Excellent", "Good", "Fair", or "Bad"}.  The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts.  Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach.


Fixed:

LightGbm => LightGBM

Ordering of {4, 3, 2, 1, 0} to {0, 1, 2, 3, 4} to encourage users to use more/less ordinals should their data better map more/less relevance values.

"ideal ranking value" => "relevance scores"

Linked: Learning to Rank

Suggested change

As previously mentioned, this sample uses the LightGbm Lambdarank algorithm which is applied using a supervised learning technique known as "Learning to Rank". This technique requires that train/test datasets contain groups of data instances that are labeled with their ideal ranking value. The label is a numerical\ordinal value, such as {4, 3, 2, 1, 0} or a text value {"Perfect", "Excellent", "Good", "Fair", or "Bad"}. The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts. Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach.

As previously mentioned, this sample uses the LightGBM LambdaRank algorithm which is applied using a supervised learning technique known as "[Learning to Rank](https://en.wikipedia.org/wiki/Learning_to_rank)". This technique requires that train/test datasets contain groups of data instances that are each labeled with their relevance scores. The label is a numerical\ordinal value, such as {0, 1, 2, 3, 4} or a text value {"Bad", "Fair", "Good", Excellent", or "Perfect"}. The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts. Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach.

We may also want to note that it's expected to have many more "Bad" relevance scores than "Perfect". This helps users to avoid converting a ranked list directly into a equally sized bins of {0, 1, 2, 3, 4}.

The relevance scores are re-used. Generally, you'll have many items per group which are labeled 0, which means the result is "bad"; and only one or a few labeled 4, which means that result is "perfect".

I replaced all references to "ideal ranking value" to "relevance score".

For your point on "This helps to avoid converting a ranked list directly into equally sized bins of {0, 1, 2, 3, 4}" - can you explain a bit more why this needs to be avoided? I'd like to understand this point and then also explain it in the sample. For example, would the model potentially end up predicting ranks that are nearly the same for each result?

Good point about reusing scores - I inadvertently left this out and will add it.

It is a form of leakage -- I think I'd group the various styles of leakage in the following way...

Leakage styles:

Feature availability (columns which aren't available at prediction time -- e.g. a MonthySalary columns when predicting YearlySalary; or MinutesLate when predicting IsLate; or more subtly NumOfLatePayments when predicting ShouldGiveLoan)

Non-iid data:

Data leakage (e.g. splitting a time-series dataset randomly instead of newer data in test set; or MinMax normalizing a dataset then splitting)

Duplicate rows between train/validation/test (e.g. oversampling a dataset to pad its size before splitting; e.g. different rotations of an single image; bootstrap sampling before splitting; or duplicating rows to up sample minority class)

Group leakage -- not including a grouping split column (e.g. Andrew Ng's team had 100k x-rays of 30k patients, meaning ~3 images per patient. They did random splitting instead of ensuring that all images of a patient was in the same split. Hence the model partially memorized the patients instead of learning to recognize pneumonia in chest x-rays. Revised paper had a noticeable drop in scores.)

The first level is if it's column based, or row based leakage. These leakages cause you to misestimate how well your model will perform in production.

This style of leakage is a style of row based leakage, where the lack of a samplingKeyColumnName column causes group leakage which may also show up as data leakage or duplicate rows. I'm uncertain which I'd call it; or if my ontology of leakage styles is sufficient.

justinormont · 2019-06-26T02:46:50Z

samples/csharp/getting-started/Ranking_PersonalizedSort/README.md

+* Information on similar competitor hotel offerings.
+
+## ML Task - Ranking
+As previously mentioned, this sample uses the LightGbm Lambdarank algorithm which is applied using a supervised learning technique known as "Learning to Rank".  This technique requires that train/test datasets contain groups of data instances that are labeled with their ideal ranking value.  The label is a numerical\ordinal value, such as {4, 3, 2, 1, 0} or a text value {"Perfect", "Excellent", "Good", "Fair", or "Bad"}.  The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts.  Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach.


Has someone checked if {"Perfect", "Excellent", "Good", "Fair", or "Bad"} still work as Labels? This works from the MAML command-line, but I'm uncertain that the estimators API accepts anything besides {4, 3, 2, 1, 0}.

I haven't tried it (I lifted this point from the TLC documentation and assumed it was still applicable). I will verify it to confirm.

I followed up on this by changing my label values to strings ("Perfect", "Fair", "Bad") - when I attempt to train the model I get an Invalid Op exception saying the following, so it looks like this isn't supported. Does a bug need to be logged for this?

"Splitter/consolidator worker encountered exception while consuming source data.
"Could not parse value Bad in line 62, column Label".

Based on my above comment, I have removed the mention of using strings as label values. #resolved

samples/csharp/getting-started/Ranking_PersonalizedSort/README.md

justinormont · 2019-06-26T03:06:08Z

samples/csharp/getting-started/Ranking_PersonalizedSort/README.md

+
+ ```CSharp
+// When splitting the data, 20% is held for the test dataset.
+// To avoid label leakage, the GroupId (e.g. search\query id) is specified as the samplingKeyColumnName.  


Suggested change

// To avoid label leakage, the GroupId (e.g. search\query id) is specified as the samplingKeyColumnName.

// To avoid data leakage, the GroupId (e.g. search\query id) is specified as the samplingKeyColumnName.

Note that the actual documentation provided for this parameter says the following: "This can be used to ensure no label leakage from the train to the test set."

This obviously is a minor issue in the doc, but sounds like the way it's currently written is incorrect?

samples/csharp/getting-started/Ranking_PersonalizedSort/README.md

eerhardt · 2019-06-27T15:07:04Z

samples/csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking.csproj

+    <TargetFramework>netcoreapp2.2</TargetFramework>
+  </PropertyGroup>
+
+  <ItemGroup>


All the elements in this ItemGroup shouldn't be necessary. This whole group can be deleted.

oh, I see you have a .csproj nested under another .csproj. This isn't typical. Can they be in sibling folders instead? Or can we just have a single .csproj?

In reply to: 298226619 [](ancestors = 298226619)

Yes, that was my mistake when I was copying and pasting some things around while cleaning up my solution. I have deleted the extra .csproj file.

eerhardt · 2019-06-27T15:28:38Z

samples/csharp/getting-started/Ranking_PersonalizedSort/README.md

+IDataView trainData = mlContext.Data.LoadFromTextFile<HotelData>(trainDatasetPath, separatorChar: ',', hasHeader: true);
+
+// Specify the columns to include in the feature input data.
+var featureCols = trainData.Schema.AsQueryable()


Could this simply be:

var featureCols = new[] { nameof(HotelData.Price_USD), nameof(HotelData.Promotion_Flag), nameof(HotelData.Prop_Id), nameof(HotelData.Prop_Brand), nameof(HotelData.Prop_Review_Score)};

?

I have since moved to a new dataset where there are 136 columns - as a result, this code is now: var featureCols = trainData.Schema.AsQueryable()
.Select(s => s.Name)
.Where(c =>
c != nameof(SearchResultData.Label) &&
c != nameof(SearchResultData.GroupId))
.ToArray();

#resolved

eerhardt · 2019-06-27T18:22:12Z

samples/csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/Program.cs

+        {
+            const string DatasetUrl = "https://www.kaggle.com/c/expedia-personalized-sort/download/data.zip";
+
+                if (!File.Exists(trainDatasetPath) || !File.Exists(testDatasetPath))


(nit) the whitespace is off here.

eerhardt · 2019-06-27T18:23:49Z

samples/csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/Program.cs

+                    Console.WriteLine("===== Prepare the testing/training datasets =====");
+
+                    // Load dataset using TextLoader by specifying the type name that holds the data's schema to be mapped with datasets.
+                    IDataView data = mlContext.Data.LoadFromTextFile<HotelData>(originalDatasetPath, separatorChar: ',', hasHeader: true);


Where do HotelData and HotelPrediction come from? I don't see them in this PR.

They are in the DataStructures directory - I see them in the PR...maybe you overlooked it? Anyway, since moving to the new dataset, these files have been removed and replaced with new types.

eerhardt · 2019-06-28T19:36:30Z

...harp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/PersonalizedRanking.csproj

+  </PropertyGroup>
+
+  <ItemGroup>
+    <Compile Remove="DataStructures\HotelData.cs" />


These can all be removed, can't they? I don't see a file Program_old.cs or Mapper.cs

yep - i missed that - thanks! #resolved

justinormont · 2019-06-28T20:29:01Z

.../csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/Common/ConsoleHelper.cs

+using System.Collections.Generic;
+using System.Linq;
+
+namespace PersonalizedRanking.Common


Renaming:
I might recommend a title for this sample: "MSLR-WEB10K Ranking" or more simply, "Web Ranking".

I don't think there is personalization in this dataset. Personalization in the ranking area generally means the each user gets individualized search results. In this dataset, there is not information about the user. This information would generally include { topics of interest of the user, demographics of the user, current location of the user, etc }.

justinormont · 2019-06-28T20:54:51Z

samples/csharp/getting-started/Ranking_PersonalizedSort/PersonalizedRanking/Program.cs

+    {
+        const string AssetsPath = @"../../../Assets";
+        const string TrainDatasetUrl = "https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KTrain720kRows.tsv";
+        const string TestDatasetUrl = "https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KTest240kRows.tsv";


Doesn't hurt to list the validation dataset. This would let users experiment on the dataset by iterating using the scores on the validation set, then testing their final solution, exactly once, on the test dataset.

Generally the DS pattern for ML should look like:

Train on training set, get metrics from validation set

(iterate many times until happy with pipeline -- goto 1)

Train on found pipeline on combined train+validate set, get metrics on test set (exactly once) -- this is your final metrics

Retrain pipeline on all data train+validate+test set. Send this newly created model to production.

The final estimate of how well your model will do in production is the metrics from step (3). The final model for production, trained on all available data, is trained at step (4).

justinormont · 2019-06-28T21:10:38Z

samples/csharp/getting-started/Ranking_PersonalizedSort/README.md

+In this sample, we show how to apply ranking to search engine results.  To perform ranking, there are two algorithms currently available - FastTree Boosting (FastRank) and Light Gradient Boosting Machine (LightGBM).  We use the LightGBM's LambdaRank implementation in this sample to automatically build an ML model to predict ranking. 
+
+## Dataset
+The training and testing data used by this sample is based on a public [dataset provided by Microsoft](https://www.microsoft.com/en-us/research/project/mslr/) originally provided Microsoft Bing.


Added note about the licensing of the dataset:

Suggested change

The training and testing data used by this sample is based on a public [dataset provided by Microsoft](https://www.microsoft.com/en-us/research/project/mslr/) originally provided Microsoft Bing.

The training and testing data used by this sample is based on a public [dataset provided by Microsoft](https://www.microsoft.com/en-us/research/project/mslr/) originally provided Microsoft Bing. The dataset is released under a [CC-by 4.0](https://creativecommons.org/licenses/by/4.0/) license.

This information is also available in the datasets readme in the main repo -- https://github.com/dotnet/machinelearning/blob/056c60479304a3b5dbdf129c9bc6e853322bb090/test/data/README.md#mslr-web10k-mslr-web30k

May want to take the citation from there also.

justinormont · 2019-06-28T21:21:34Z

...tarted/Ranking_PersonalizedSort/PersonalizedRanking/DataStructures/SearchResultPrediction.cs

+
+        public uint Label { get; set; }
+
+        // Prediction made by the model that is used to indicate the relative ranking of the benchmark data instances.


Suggested change

// Prediction made by the model that is used to indicate the relative ranking of the benchmark data instances.

// Prediction made by the model that is used to indicate the relative ranking candidate search results

#549) * Ranking Sample - Hotel Search Results (changed to Bing Search Engine result ranking) (#533) * Created ranking sample * removed todo * Fixed wording in ReadMe * Fixed typos * Modified RankingMetric code * Incorporated Justin's feedback * Fixed minor inconsistencies * Converted to new dataset * Changed code to download dataset since its zip is too large * fixed using statement * Removed unneeded license info for dataset * Renamed solution and minor changes * minor fixes * Justin's feedback for PR into master * fixed period and spacing inconsistencies

Created ranking sample

bf4f7e4

nicolehaugen changed the base branch from master to features/ranking-sample June 25, 2019 00:59

nicolehaugen added 4 commits June 24, 2019 20:12

removed todo

6cbfc27

Fixed wording in ReadMe

9ff9c37

Fixed typos

996621a

Modified RankingMetric code

abe226f