-
-
Notifications
You must be signed in to change notification settings - Fork 267
New example notebook for auto-imputation aka handle missing values with a simple dataset and full workflow #722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New example notebook for auto-imputation aka handle missing values with a simple dataset and full workflow #722
Conversation
+ ran pre-commit checks etc
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
+ remove recently deprecated sym kwarg in seaborn boxplot + improved a couple of explanations + reran precommit etc checks
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:45Z Change title to jonsedar commented on 2024-11-09T13:30:53Z Sure, that makes sense - I've updated the text to mention covariates specifically |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:46Z What is the reason behind emphasizing the "numeric"? This should work fine for missing categorical predictors as well? jonsedar commented on 2024-11-09T13:37:06Z Interesting point, and possibly my lack of knowledge. I've used hierarchical priors ricardoV94 commented on 2024-11-11T14:31:54Z No need to extend the example, but if you have a categorical predictor you would conceivably have a jonsedar commented on 2024-11-12T05:38:13Z Aha, yes nice idea! As you suggest, let's leave this one here, but that seems like a really useful thing to demonstrate, and would be potentially a nice companion to my other new ordinal-features example (#717).
When I get a minute I'll create a new NB (based on this one) that includes Categorical features, and I'll omit the out-of-sample forecast to keep it lean |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:47Z Line #14. warnings.simplefilter(action="ignore", category=FutureWarning) # isort:skip Don't? FutureWarning filter is pretty broad jonsedar commented on 2024-11-09T14:00:23Z Haha yeah, good point - I tend to leave it in cos seaborn v12 (which I still use because they massively changed the API in >12) gets really annoying with all the warnings... But there's doesnt seem to be any here after I remove it, so, removed :) |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:48Z Line #11. # set target_accept quite high to minimise divergences in mdlb target_accept is not being changed jonsedar commented on 2024-11-09T13:38:22Z Good catch, that's a cutpaste from elsewhere |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:48Z This plot_univariate is not that great? How does a histogram/kdeplot look like? jonsedar commented on 2024-11-09T13:39:35Z I quite like violins :D I'll make it taller, then it's basically a kdeplot |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:49Z Why is the observed off scale? How is it reasonable? jonsedar commented on 2024-11-09T13:45:18Z FWIW I wouldnt expect the prior to be too close, but I've added a clarification |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:50Z Confusing helper name?
Also should be done before calling
jonsedar commented on 2024-11-09T13:48:20Z Fair call, I've renamed it to |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:50Z Remove section? jonsedar commented on 2024-11-09T13:48:54Z I quite like leaving the placeholder, but sure, it could be confusing - removed! |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:51Z Line #3. mdl0.set_data("y", dfrawx_holdout[ft_y].values, coords=COORDS_F) No need to set dummy data for posterior_predictive jonsedar commented on 2024-11-09T13:49:57Z Good point, thanks! |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:52Z This picture needs a legend, only the green dot is explained jonsedar commented on 2024-11-09T13:52:13Z Legends get hairy with seaborn... I've added more explanation in the title! |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:52Z There is little cost to "rebuilding" the model compared to reusing an existing model for sample_posterior_predictive, because none of the random functions are cached anyway across calls jonsedar commented on 2024-11-09T14:37:42Z Thanks for the clarification - I've changed the language to "re-specify" (only a minor concern to my mind because it's not DRY) |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:53Z Line #10. # NOTE: ... but there's no way to put a nan-containing array into a pm.Data, This is slightly incorrect. You can put jonsedar commented on 2024-11-09T14:07:39Z FWIW I've found that this isn't possible... e.g. if were to put this line into the model spec (which tries to create a xkk = pm.Data("xkk", dfx_holdout[FTS_XK].values, dims=("oid", "xj_nm")) I get error:
ricardoV94 commented on 2024-11-11T14:33:30Z Ahh pm.Data is trying being helpful, jonsedar commented on 2024-11-12T04:48:05Z Ah, interesting - so is it worthwhile for me to drop down to use a pt.shared instead? I'll give that a try |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:54Z Not sure these sort of table outputs are useful? If so, perhaps discuss what can be gathered from them? If not, ommit? jonsedar commented on 2024-11-09T14:08:59Z Yeah they can go - I was just trying to show the build-up of the dataframe, but people can introspect if they want |
View / edit / reply to this conversation on ReviewNB ricardoV94 commented on 2024-11-09T13:05:54Z Mention multivariate prior for missing x, perhaps link to Junpeng talk? https://discourse.pymc.io/t/partial-missing-multivariate-observation-and-what-to-do-with-them-by-junpeng-lao/6050 jonsedar commented on 2024-11-09T14:20:21Z Aha yes, I was looking for @junpenglao 's old blogposts on this technique, esp w.r.t the hierarchical prior |
@jonsedar seems like a great addition. I left some comments above. I hope we can make the prediction parts nicer in the future, with a helper that accepts a shared mask. |
Thanks for the review! I'll go through and make changes etc now |
Sure, that makes sense - I've updated the text to mention covariates specifically View entire conversation on ReviewNB |
Interesting point, and possibly my lack of knowledge. I've used hierarchical priors View entire conversation on ReviewNB |
Good catch, that's a cutpaste from elsewhere View entire conversation on ReviewNB |
I quite like violins :D I'll add a kde too View entire conversation on ReviewNB |
FWIW I wouldnt expect the prior to be too close, but I've added a clarification View entire conversation on ReviewNB |
Fair call, I've renamed it to View entire conversation on ReviewNB |
I quite like leaving the placeholder, but sure, it could be confusing - removed! View entire conversation on ReviewNB |
Good point, thanks! View entire conversation on ReviewNB |
Legends get hairy with seaborn... I've added more explanation in the title! View entire conversation on ReviewNB |
Haha yeah, good point - I tend to leave it in cos seaborn v12 (which I still use because they massively changed the API in >12) gets really annoying with all the warnings... But there's doesnt seem to be any here after I remove it, so, removed :) View entire conversation on ReviewNB |
* + added new notebook GLM-ordinal-features.ipynb + already complete and created in another env * Created using Colab * minor updates for latest seaborn * Tweaks for collab and included authors * added header * + added new reference article and online + cited inside Notebook + adjusted errata * + ran black again... * + rep Burkner as B{\"u}rkner * + maybe fixed readthedocs complaint pybtex.database.InvalidNameString: Too many commas in 'B{\\"u}rkner, P., & Charpentier, E.' * + ran black-jupyter, let's see * + another run of black-jupyter * + installed local pymc_examples env + ran pre-commit in full + autocreated *.myst file * + update tags * + possibly persuaded the precommits to all pass * + rerun on colab to confirm all good post new pre-commit process * + okay, reran in colab again... lets see if this passes * + added (again) the myst.md * + minor updates post review + new work to positively include a coeff in mdlb for d450 = c4 * + reran precommit and readded myst.md * + rerun localyl e2e * + added myst.md again * + reran again to ensure cell execution order even for markdown cells * + reran again again to ensure order * + minor update: forced addtional level c4 into d450 categorical feature + fixed a couple of typos * + changed rating to intermediate
* Draft update of BNN notebook * Pre-commit fixes * Address reviewer comments * Additional edits * Removed bivariate example; updated formatting * Updated GEV * Changed pymc_experimental to pymc_extras --------- Co-authored-by: Chris Fonnesbeck <[email protected]>
View / edit / reply to this conversation on ReviewNB fonnesbeck commented on 2024-12-15T15:34:08Z There is a difference between missing at random (MAR) and missing completely at random (MCAR). With the latter you can do complete case analysis and only suffer loss of power via reduction in sample size; the former requires knowledge of a model of missingness.
But, given your disclaimer, you can feel free to ignore this distinction. jonsedar commented on 2024-12-16T06:35:48Z Would it be more correct then, to replace where I've written "Missing at Random" with "Missing Completely at Random" ? It's a weird nuance and unhelpful terminology for sure jonsedar commented on 2024-12-16T06:57:05Z Okay, got it. https://stats.stackexchange.com/questions/23090/distinguishing-missing-at-random-mar-from-missing-completely-at-random-mcar
I'll replace with "Missing Completely at Random" |
View / edit / reply to this conversation on ReviewNB fonnesbeck commented on 2024-12-15T15:34:09Z Correlation may be easier to see (for a and c at least) if they were standardized? jonsedar commented on 2024-12-16T06:38:02Z I'll set sharex =False, will achieve the same thing in the plot :) |
View / edit / reply to this conversation on ReviewNB fonnesbeck commented on 2024-12-15T15:34:10Z I'm getting a little confused by the cryptic (to me) variable names. For example, what is "oid" for? As I write this I assume it means "observation ID". Recommend using human-readable variable names to lower the cognitive overhead. jonsedar commented on 2024-12-16T06:41:32Z I'll include a comment explanation. Typically I like to vectorise the feature names in Bx using e.g. FTS_XJ, and yep oid is observation ID |
View / edit / reply to this conversation on ReviewNB fonnesbeck commented on 2024-12-15T15:34:10Z Don't undersell -- I think the marginal energy is about as good as you can expect! jonsedar commented on 2024-12-16T06:43:50Z Haha well, I say apparently reasonable but it's clearly reasonable :)
I'll replace > with >> |
View / edit / reply to this conversation on ReviewNB fonnesbeck commented on 2024-12-15T15:34:11Z Plot is a little hard to read. jonsedar commented on 2024-12-16T06:51:18Z Agreed, I'll make it taller |
View / edit / reply to this conversation on ReviewNB fonnesbeck commented on 2024-12-15T15:34:12Z You should not need a masked array -- just passing a numpy array with missing values will trigger imputation. Might as well skip the extra step. jonsedar commented on 2024-12-16T07:26:40Z Ah that's interesting - when did that functionality change?
Just tested working the same, so I'll update the modelspec and related markdown discussion - thanks! |
+ ran pre-commit checks etc
+ remove recently deprecated sym kwarg in seaborn boxplot + improved a couple of explanations + reran precommit etc checks
…, formatting, explanatory text + also changed notebook title and filename to be more clear + also added new reference to jungpenlao's talk + reran precommit and created new myst file
+ fixed a couple of typos
…ub.com/jonsedar/pymc-examples into new-example-glm-missing-numeric-values
Urgh FML, I made the stupid mistake of trying to rebase this branch on top of most recent branches... Now a total mess I'll address youe points above and resubmit a new PR! |
Would it be more correct then, to replace where I've written "Missing at Random" with "Missing Completely at Random" ? It's a weird nuance and unhelpful terminology for sure View entire conversation on ReviewNB |
I'll set sharex =False, will achieve the same thing in the plot :) View entire conversation on ReviewNB |
I'll include a comment explanation. Typically I like to vectorise the feature names in Bx using e.g. FTS_XJ View entire conversation on ReviewNB |
Haha well, I say apparently reasonable but it's clearly reasonable :)
I'll replace > with >> View entire conversation on ReviewNB |
Yeah, I'll make it taller View entire conversation on ReviewNB |
Okay, got it. https://stats.stackexchange.com/questions/23090/distinguishing-missing-at-random-mar-from-missing-completely-at-random-mcar
I'll replace with "Missing Completely at Random" View entire conversation on ReviewNB |
Ah that's interesting - when did that fucntionality change? View entire conversation on ReviewNB |
…values in covariates + update NB with various minor improvements following PR review in the old branch pymc-devs#722 + additional typos etc
Closing this in favor of newer, bettter, shinier PR 753 #753 |
…ymc-devs#753) * + take a new up-to-date branch off master and dump in new NB missing values in covariates + update NB with various minor improvements following PR review in the old branch pymc-devs#722 + additional typos etc * + ran pre-commit autoupdate * + ran precommit on new NB, created myst file
…ymc-devs#753) * + take a new up-to-date branch off master and dump in new NB missing values in covariates + update NB with various minor improvements following PR review in the old branch pymc-devs#722 + additional typos etc * + ran pre-commit autoupdate * + ran precommit on new NB, created myst file
…ymc-devs#753) * + take a new up-to-date branch off master and dump in new NB missing values in covariates + update NB with various minor improvements following PR review in the old branch pymc-devs#722 + additional typos etc * + ran pre-commit autoupdate * + ran precommit on new NB, created myst file
…ymc-devs#753) * + take a new up-to-date branch off master and dump in new NB missing values in covariates + update NB with various minor improvements following PR review in the old branch pymc-devs#722 + additional typos etc * + ran pre-commit autoupdate * + ran precommit on new NB, created myst file
PR to implement a response to #721
📚 Documentation preview 📚: https://pymc-examples--722.org.readthedocs.build/en/722/