Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
RandomizedPCA Anomaly Detection fraud detection sample #589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RandomizedPCA Anomaly Detection fraud detection sample #589
Changes from all commits
44b264b
b596587
0f1d6c7
ded041f
c3c950e
f69e1c6
e7f45be
0fae83f
16509f5
9879fd9
fa33ef3
c991cf8
1d374d7
0bc7045
fcf7724
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we don't include a copy of randomizedPca.zip for the user already (i.e. is it too large of a file to upload to GitHub)? For the other samples, I believe we include a copy of the model/files produced from training, and the user doesn't have to train first themselves to get the predictor to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I based this on how it was done in the BinaryClassification version of this sample. Although, it actually looks like that sample does commit the model into the input directory of the sample's Predictor project. However, as mentioned above, both the test data csv and the model are generated by the Trainer project (in both samples) and looks like the BinaryClassification version of this sample doesn't commit that csv either. So even in the BinaryClassification sample, the Trainer project will need to be run before the Predictor sample will work.
I'm happy to follow you guidance here. If we want to commit the model and test data set into the Predictor project, I can definitely make the change, but we should probably do it for both samples. I believe the longer-term goal is to have all these samples run as a single project vs. separate Trainer/Predictor projects. But again, that's a change that should be made in both the CreditCardFraudDetection samples, and may be outside the scope of this PR.
Let me know how you want to move forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CESARDELATORRE Would love to get your thoughts here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This concrete sample has two projects. The predictor/scoring project should have the model.ZIP file plus the Test dataset for doing multiple predictions, so it works out-of-the-box without running the training project.
But the training project doesn't need to have the model.zip file since it will generate it after training and since the code is split in two projects it is even better that the training project doesn't have it.
About the dataset, we're only committing/pushing a dataset .zip file (instead of directly the .csv files) because this concrete dataset is larger than 100MB and that's not allowed by GitHub, therefore we have the .zip file for the dataset which is a bit smaller than 100MB.
But for the predictor/scored I think we could include both the model .zip and the test dataset so a user could just try predictions, if desired, and it'll work out-of-the-box instead of raising an error and saying that you first need to run the training project.
In any case we can merge as it is now and we can change those details while reviewing it further. It is not critical. . :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, never mind. For this case it might be better if the scoring/client app copies the model so it takes the latest training. We'll change the code so it is only copying the .zip model and the test dataset. The scoring project doesn't need the training dataset and git ignore files that are being copied currently. But this is a minor improvement for clarity.
Thanks! 👍