Skip to content

Ranking Sample - Hotel Search Results (changed to Bing Search Engine result ranking) #533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nicolehaugen
Copy link
Contributor

@nicolehaugen nicolehaugen commented Jun 25, 2019

Please review the sample that I have created to show how to do ranking using Light GBM.
Note that I have changed this sample from using the Expedia dataset to instead use Bing's search result dataset.

@dnfclas
Copy link

dnfclas commented Jun 25, 2019

CLA assistant check
All CLA requirements met.

@nicolehaugen nicolehaugen changed the base branch from master to features/ranking-sample June 25, 2019 00:59

static void PrepDatasets(MLContext mlContext, string assetPath, string originalDatasetPath, string trainDatasetPath, string testDatasetPath)
{
const string DatasetUrl = "https://www.kaggle.com/c/expedia-personalized-sort/download/data.zip";
Copy link
Contributor

@justinormont justinormont Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File is not public. Is requiring a user to sign up for a kaggle account too onerous?

Copy link
Contributor Author

@nicolehaugen nicolehaugen Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you do need a Kaggle account for this - I had assumed that all Kaggle datasets required you have an account. But, based on your comment, that must not be the case, so this may be more overhead than we can expect for a user. You probably noticed but I have sent an email to CELA to learn if there is a possibility of using this dataset. If it turns out I can't, then I will look at using the benchmark dataset. Overall, I feel that the hotel data is a very compelling scenario which is why I decided to try pursuing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend all demos have datasets that can be downloaded automatically (no user action needed). Will CELA let us host the dataset in our CDN?

Granted Kaggle is pretty awesome and folks should hone there data science skill there. I think forcing our users to sign up for an external account before trying demo code is a bit rough.

We will have more ranking samples of course. For instance a short sample that teaches less aspects but is easy to modify for towards their own datasets. And perhaps a sample which does string comparisons (query vs. documents) for feature engineering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now switched the dataset to use the one you recommended for ranking search engine results.

#resolved


static void PrepDatasets(MLContext mlContext, string assetPath, string originalDatasetPath, string trainDatasetPath, string testDatasetPath)
{
const string DatasetUrl = "https://www.kaggle.com/c/expedia-personalized-sort/download/data.zip";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using the MSLR-WEB10K (or WEB30K) dataset from our CDN:

As zip:

We use the MSLR-WEB10K dataset for our ranking benchmarks. The dataset is auto-downloaded from the above CDN aka.ms URLs when you run build.cmd -- /t:DownloadExternalTestFiles /p:IncludeBenchmarkData=true to download/run the benchmark datasets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above - I will use this as a substitution if I can't get approval from CELA on the expedia dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above - am now using this dataset. #resolved

* Information on similar competitor hotel offerings.

## ML Task - Ranking
As previously mentioned, this sample uses the LightGbm Lambdarank algorithm which is applied using a supervised learning technique known as "Learning to Rank". This technique requires that train/test datasets contain groups of data instances that are labeled with their ideal ranking value. The label is a numerical\ordinal value, such as {4, 3, 2, 1, 0} or a text value {"Perfect", "Excellent", "Good", "Fair", or "Bad"}. The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts. Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach.
Copy link
Contributor

@justinormont justinormont Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed:

  • LightGbm => LightGBM
  • Ordering of {4, 3, 2, 1, 0} to {0, 1, 2, 3, 4} to encourage users to use more/less ordinals should their data better map more/less relevance values.
  • "ideal ranking value" => "relevance scores"
  • Linked: Learning to Rank
Suggested change
As previously mentioned, this sample uses the LightGbm Lambdarank algorithm which is applied using a supervised learning technique known as "Learning to Rank". This technique requires that train/test datasets contain groups of data instances that are labeled with their ideal ranking value. The label is a numerical\ordinal value, such as {4, 3, 2, 1, 0} or a text value {"Perfect", "Excellent", "Good", "Fair", or "Bad"}. The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts. Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach.
As previously mentioned, this sample uses the LightGBM LambdaRank algorithm which is applied using a supervised learning technique known as "[Learning to Rank](https://en.wikipedia.org/wiki/Learning_to_rank)". This technique requires that train/test datasets contain groups of data instances that are each labeled with their relevance scores. The label is a numerical\ordinal value, such as {0, 1, 2, 3, 4} or a text value {"Bad", "Fair", "Good", Excellent", or "Perfect"}. The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts. Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach.

Copy link
Contributor

@justinormont justinormont Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also want to note that it's expected to have many more "Bad" relevance scores than "Perfect". This helps users to avoid converting a ranked list directly into a equally sized bins of {0, 1, 2, 3, 4}.

The relevance scores are re-used. Generally, you'll have many items per group which are labeled 0, which means the result is "bad"; and only one or a few labeled 4, which means that result is "perfect".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced all references to "ideal ranking value" to "relevance score".

For your point on "This helps to avoid converting a ranked list directly into equally sized bins of {0, 1, 2, 3, 4}" - can you explain a bit more why this needs to be avoided? I'd like to understand this point and then also explain it in the sample. For example, would the model potentially end up predicting ranks that are nearly the same for each result?

Good point about reusing scores - I inadvertently left this out and will add it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a form of leakage -- I think I'd group the various styles of leakage in the following way...

Leakage styles:

  • Feature availability (columns which aren't available at prediction time -- e.g. a MonthySalary columns when predicting YearlySalary; or MinutesLate when predicting IsLate; or more subtly NumOfLatePayments when predicting ShouldGiveLoan)
  • Non-iid data:
    • Data leakage (e.g. splitting a time-series dataset randomly instead of newer data in test set; or MinMax normalizing a dataset then splitting)
    • Duplicate rows between train/validation/test (e.g. oversampling a dataset to pad its size before splitting; e.g. different rotations of an single image; bootstrap sampling before splitting; or duplicating rows to up sample minority class)
    • Group leakage -- not including a grouping split column (e.g. Andrew Ng's team had 100k x-rays of 30k patients, meaning ~3 images per patient. They did random splitting instead of ensuring that all images of a patient was in the same split. Hence the model partially memorized the patients instead of learning to recognize pneumonia in chest x-rays. Revised paper had a noticeable drop in scores.)

The first level is if it's column based, or row based leakage. These leakages cause you to misestimate how well your model will perform in production.

This style of leakage is a style of row based leakage, where the lack of a samplingKeyColumnName column causes group leakage which may also show up as data leakage or duplicate rows. I'm uncertain which I'd call it; or if my ontology of leakage styles is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#resolved

* Information on similar competitor hotel offerings.

## ML Task - Ranking
As previously mentioned, this sample uses the LightGbm Lambdarank algorithm which is applied using a supervised learning technique known as "Learning to Rank". This technique requires that train/test datasets contain groups of data instances that are labeled with their ideal ranking value. The label is a numerical\ordinal value, such as {4, 3, 2, 1, 0} or a text value {"Perfect", "Excellent", "Good", "Fair", or "Bad"}. The process for labeling these data instances with their ideal ranking value can be done manually by subject matter experts. Or, the labels can be determined using other metrics, such as the number of clicks on a given search result. This sample uses the latter approach.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has someone checked if {"Perfect", "Excellent", "Good", "Fair", or "Bad"} still work as Labels? This works from the MAML command-line, but I'm uncertain that the estimators API accepts anything besides {4, 3, 2, 1, 0}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried it (I lifted this point from the TLC documentation and assumed it was still applicable). I will verify it to confirm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed up on this by changing my label values to strings ("Perfect", "Fair", "Bad") - when I attempt to train the model I get an Invalid Op exception saying the following, so it looks like this isn't supported. Does a bug need to be logged for this?

"Splitter/consolidator worker encountered exception while consuming source data.
"Could not parse value Bad in line 62, column Label".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my above comment, I have removed the mention of using strings as label values. #resolved


```CSharp
// When splitting the data, 20% is held for the test dataset.
// To avoid label leakage, the GroupId (e.g. search\query id) is specified as the samplingKeyColumnName.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// To avoid label leakage, the GroupId (e.g. search\query id) is specified as the samplingKeyColumnName.
// To avoid data leakage, the GroupId (e.g. search\query id) is specified as the samplingKeyColumnName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the actual documentation provided for this parameter says the following: "This can be used to ensure no label leakage from the train to the test set."

This obviously is a minor issue in the doc, but sounds like the way it's currently written is incorrect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#resolved

<TargetFramework>netcoreapp2.2</TargetFramework>
</PropertyGroup>

<ItemGroup>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the elements in this ItemGroup shouldn't be necessary. This whole group can be deleted.

Copy link
Member

@eerhardt eerhardt Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see you have a .csproj nested under another .csproj. This isn't typical. Can they be in sibling folders instead? Or can we just have a single .csproj?


In reply to: 298226619 [](ancestors = 298226619)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my mistake when I was copying and pasting some things around while cleaning up my solution. I have deleted the extra .csproj file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#resolved

IDataView trainData = mlContext.Data.LoadFromTextFile<HotelData>(trainDatasetPath, separatorChar: ',', hasHeader: true);

// Specify the columns to include in the feature input data.
var featureCols = trainData.Schema.AsQueryable()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this simply be:

var featureCols = new[] { nameof(HotelData.Price_USD), nameof(HotelData.Promotion_Flag), nameof(HotelData.Prop_Id), nameof(HotelData.Prop_Brand), nameof(HotelData.Prop_Review_Score)};

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have since moved to a new dataset where there are 136 columns - as a result, this code is now: var featureCols = trainData.Schema.AsQueryable()
.Select(s => s.Name)
.Where(c =>
c != nameof(SearchResultData.Label) &&
c != nameof(SearchResultData.GroupId))
.ToArray();

#resolved

{
const string DatasetUrl = "https://www.kaggle.com/c/expedia-personalized-sort/download/data.zip";

if (!File.Exists(trainDatasetPath) || !File.Exists(testDatasetPath))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) the whitespace is off here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#resolved

Console.WriteLine("===== Prepare the testing/training datasets =====");

// Load dataset using TextLoader by specifying the type name that holds the data's schema to be mapped with datasets.
IDataView data = mlContext.Data.LoadFromTextFile<HotelData>(originalDatasetPath, separatorChar: ',', hasHeader: true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do HotelData and HotelPrediction come from? I don't see them in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are in the DataStructures directory - I see them in the PR...maybe you overlooked it? Anyway, since moving to the new dataset, these files have been removed and replaced with new types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#resolve

@nicolehaugen nicolehaugen changed the title Ranking Sample - Hotel Search Results Ranking Sample - Hotel Search Results (changed to Bing Search Engine result ranking) Jun 28, 2019
</PropertyGroup>

<ItemGroup>
<Compile Remove="DataStructures\HotelData.cs" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can all be removed, can't they? I don't see a file Program_old.cs or Mapper.cs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep - i missed that - thanks! #resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#resolved

using System.Collections.Generic;
using System.Linq;

namespace PersonalizedRanking.Common
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming:
I might recommend a title for this sample: "MSLR-WEB10K Ranking" or more simply, "Web Ranking".

I don't think there is personalization in this dataset. Personalization in the ranking area generally means the each user gets individualized search results. In this dataset, there is not information about the user. This information would generally include { topics of interest of the user, demographics of the user, current location of the user, etc }.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#resolved

{
const string AssetsPath = @"../../../Assets";
const string TrainDatasetUrl = "https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KTrain720kRows.tsv";
const string TestDatasetUrl = "https://aka.ms/mlnet-resources/benchmarks/MSLRWeb10KTest240kRows.tsv";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't hurt to list the validation dataset. This would let users experiment on the dataset by iterating using the scores on the validation set, then testing their final solution, exactly once, on the test dataset.

Generally the DS pattern for ML should look like:

  1. Train on training set, get metrics from validation set
  2. (iterate many times until happy with pipeline -- goto 1)
  3. Train on found pipeline on combined train+validate set, get metrics on test set (exactly once) -- this is your final metrics
  4. Retrain pipeline on all data train+validate+test set. Send this newly created model to production.

The final estimate of how well your model will do in production is the metrics from step (3). The final model for production, trained on all available data, is trained at step (4).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#Resolved

In this sample, we show how to apply ranking to search engine results. To perform ranking, there are two algorithms currently available - FastTree Boosting (FastRank) and Light Gradient Boosting Machine (LightGBM). We use the LightGBM's LambdaRank implementation in this sample to automatically build an ML model to predict ranking.

## Dataset
The training and testing data used by this sample is based on a public [dataset provided by Microsoft](https://www.microsoft.com/en-us/research/project/mslr/) originally provided Microsoft Bing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added note about the licensing of the dataset:

Suggested change
The training and testing data used by this sample is based on a public [dataset provided by Microsoft](https://www.microsoft.com/en-us/research/project/mslr/) originally provided Microsoft Bing.
The training and testing data used by this sample is based on a public [dataset provided by Microsoft](https://www.microsoft.com/en-us/research/project/mslr/) originally provided Microsoft Bing. The dataset is released under a [CC-by 4.0](https://creativecommons.org/licenses/by/4.0/) license.
This information is also available in the datasets readme in the main repo -- https://github.com/dotnet/machinelearning/blob/056c60479304a3b5dbdf129c9bc6e853322bb090/test/data/README.md#mslr-web10k-mslr-web30k
May want to take the citation from there also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#resolved


public uint Label { get; set; }

// Prediction made by the model that is used to indicate the relative ranking of the benchmark data instances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Prediction made by the model that is used to indicate the relative ranking of the benchmark data instances.
// Prediction made by the model that is used to indicate the relative ranking candidate search results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#resolved

@nicolehaugen nicolehaugen merged commit 7b5dbe0 into dotnet:features/ranking-sample Jun 29, 2019
CESARDELATORRE pushed a commit that referenced this pull request Jul 16, 2019
#549)

* Ranking Sample - Hotel Search Results (changed to Bing Search Engine result ranking) (#533)

* Created ranking sample

* removed todo

* Fixed wording in ReadMe

* Fixed typos

* Modified RankingMetric code

* Incorporated Justin's feedback

* Fixed minor inconsistencies

* Converted to new dataset

* Changed code to download dataset since its zip is too large

* fixed using statement

* Removed unneeded license info for dataset

* Renamed solution and minor changes

* minor fixes

* Justin's feedback for PR into master

* fixed period and spacing inconsistencies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants