Skip to content

RandomizedPCA Anomaly Detection fraud detection sample #589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Aug 2, 2019

Conversation

colbylwilliams
Copy link
Member

@bamurtaugh
Copy link
Member

Looks good when I run it.

However, a couple weeks ago, we completed migration to ML.NET version 1.2.0 and preview version 0.14.0. I noticed the Directory.Build.props file in this project is using MicrosoftMLVersion 1.0.0 and MicrosoftMLPreviewVersion 0.12.0. Can you please verify that everything works with the most updated versions as well?

@CESARDELATORRE
Copy link
Contributor

Hey @colbylwilliams , when possible for you can you switch to the latest versions of ML.NET (v1.2) and make sure the sample is working properly, then we can make a final test and merge it? :)

if (!File.Exists(Path.Combine(trainOutput, "testData.csv")) ||
!File.Exists(Path.Combine(trainOutput, "randomizedPca.zip")))
{
Console.WriteLine("***** YOU NEED TO RUN THE TRAINING PROJECT FIRST *****");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't include a copy of randomizedPca.zip for the user already (i.e. is it too large of a file to upload to GitHub)? For the other samples, I believe we include a copy of the model/files produced from training, and the user doesn't have to train first themselves to get the predictor to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I based this on how it was done in the BinaryClassification version of this sample. Although, it actually looks like that sample does commit the model into the input directory of the sample's Predictor project. However, as mentioned above, both the test data csv and the model are generated by the Trainer project (in both samples) and looks like the BinaryClassification version of this sample doesn't commit that csv either. So even in the BinaryClassification sample, the Trainer project will need to be run before the Predictor sample will work.

I'm happy to follow you guidance here. If we want to commit the model and test data set into the Predictor project, I can definitely make the change, but we should probably do it for both samples. I believe the longer-term goal is to have all these samples run as a single project vs. separate Trainer/Predictor projects. But again, that's a change that should be made in both the CreditCardFraudDetection samples, and may be outside the scope of this PR.

Let me know how you want to move forward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CESARDELATORRE Would love to get your thoughts here!

Copy link
Contributor

@CESARDELATORRE CESARDELATORRE Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concrete sample has two projects. The predictor/scoring project should have the model.ZIP file plus the Test dataset for doing multiple predictions, so it works out-of-the-box without running the training project.

But the training project doesn't need to have the model.zip file since it will generate it after training and since the code is split in two projects it is even better that the training project doesn't have it.

About the dataset, we're only committing/pushing a dataset .zip file (instead of directly the .csv files) because this concrete dataset is larger than 100MB and that's not allowed by GitHub, therefore we have the .zip file for the dataset which is a bit smaller than 100MB.

But for the predictor/scored I think we could include both the model .zip and the test dataset so a user could just try predictions, if desired, and it'll work out-of-the-box instead of raising an error and saying that you first need to run the training project.

In any case we can merge as it is now and we can change those details while reviewing it further. It is not critical. . :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, never mind. For this case it might be better if the scoring/client app copies the model so it takes the latest training. We'll change the code so it is only copying the .zip model and the test dataset. The scoring project doesn't need the training dataset and git ignore files that are being copied currently. But this is a minor improvement for clarity.

Thanks! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants