-
Notifications
You must be signed in to change notification settings - Fork 2.7k
RandomizedPCA Anomaly Detection fraud detection sample #589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
colbylwilliams
commented
Jul 29, 2019
- RandomizedPCA sample based on this Binary Classification sample
- Note that final completion of this sample is waiting on fix of this issue
…tCardFraudDetection
…editCardFraudDetection
Looks good when I run it. However, a couple weeks ago, we completed migration to ML.NET version 1.2.0 and preview version 0.14.0. I noticed the Directory.Build.props file in this project is using MicrosoftMLVersion 1.0.0 and MicrosoftMLPreviewVersion 0.12.0. Can you please verify that everything works with the most updated versions as well? |
Hey @colbylwilliams , when possible for you can you switch to the latest versions of ML.NET (v1.2) and make sure the sample is working properly, then we can make a final test and merge it? :) |
...rted/AnomalyDetection_CreditCardFraudDetection/CreditCardFraudDetection.Predictor/Program.cs
Show resolved
Hide resolved
if (!File.Exists(Path.Combine(trainOutput, "testData.csv")) || | ||
!File.Exists(Path.Combine(trainOutput, "randomizedPca.zip"))) | ||
{ | ||
Console.WriteLine("***** YOU NEED TO RUN THE TRAINING PROJECT FIRST *****"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we don't include a copy of randomizedPca.zip for the user already (i.e. is it too large of a file to upload to GitHub)? For the other samples, I believe we include a copy of the model/files produced from training, and the user doesn't have to train first themselves to get the predictor to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I based this on how it was done in the BinaryClassification version of this sample. Although, it actually looks like that sample does commit the model into the input directory of the sample's Predictor project. However, as mentioned above, both the test data csv and the model are generated by the Trainer project (in both samples) and looks like the BinaryClassification version of this sample doesn't commit that csv either. So even in the BinaryClassification sample, the Trainer project will need to be run before the Predictor sample will work.
I'm happy to follow you guidance here. If we want to commit the model and test data set into the Predictor project, I can definitely make the change, but we should probably do it for both samples. I believe the longer-term goal is to have all these samples run as a single project vs. separate Trainer/Predictor projects. But again, that's a change that should be made in both the CreditCardFraudDetection samples, and may be outside the scope of this PR.
Let me know how you want to move forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CESARDELATORRE Would love to get your thoughts here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This concrete sample has two projects. The predictor/scoring project should have the model.ZIP file plus the Test dataset for doing multiple predictions, so it works out-of-the-box without running the training project.
But the training project doesn't need to have the model.zip file since it will generate it after training and since the code is split in two projects it is even better that the training project doesn't have it.
About the dataset, we're only committing/pushing a dataset .zip file (instead of directly the .csv files) because this concrete dataset is larger than 100MB and that's not allowed by GitHub, therefore we have the .zip file for the dataset which is a bit smaller than 100MB.
But for the predictor/scored I think we could include both the model .zip and the test dataset so a user could just try predictions, if desired, and it'll work out-of-the-box instead of raising an error and saying that you first need to run the training project.
In any case we can merge as it is now and we can change those details while reviewing it further. It is not critical. . :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, never mind. For this case it might be better if the scoring/client app copies the model so it takes the latest training. We'll change the code so it is only copying the .zip model and the test dataset. The scoring project doesn't need the training dataset and git ignore files that are being copied currently. But this is a minor improvement for clarity.
Thanks! 👍
samples/csharp/getting-started/AnomalyDetection_CreditCardFraudDetection/Readme.md
Outdated
Show resolved
Hide resolved
samples/csharp/getting-started/AnomalyDetection_CreditCardFraudDetection/Readme.md
Outdated
Show resolved
Hide resolved
…dDetection/Readme.md Co-Authored-By: Brigit Murtaugh <[email protected]>