Skip to content

New example notebook for auto-imputation aka handle missing values with a simple dataset and full workflow #722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jonsedar
Copy link
Contributor

@jonsedar jonsedar commented Nov 9, 2024

PR to implement a response to #721


📚 Documentation preview 📚: https://pymc-examples--722.org.readthedocs.build/en/722/

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

+ remove recently deprecated sym kwarg in seaborn boxplot
+ improved a couple of explanations
+ reran precommit etc checks
Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:45Z
----------------------------------------------------------------

Change title to missing covariates,emphasize that unlike other examples (namely the coal mining disaster) in this case we're missing covariates besides/in addition to y. That should make it more discoverable and clear the focus?


jonsedar commented on 2024-11-09T13:30:53Z
----------------------------------------------------------------

Sure, that makes sense - I've updated the text to mention covariates specifically

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:46Z
----------------------------------------------------------------

What is the reason behind emphasizing the "numeric"? This should work fine for missing categorical predictors as well?


jonsedar commented on 2024-11-09T13:37:06Z
----------------------------------------------------------------

Interesting point, and possibly my lack of knowledge. I've used hierarchical priors xk_mu on xk_unobserved and assumed that xk is a continuous (and zscored) value. How would you suggest transforming to allow a categorical index? I'm totally open to extending the example to categoricals!

ricardoV94 commented on 2024-11-11T14:31:54Z
----------------------------------------------------------------

No need to extend the example, but if you have a categorical predictor you would conceivably have a pm.Categorical prior on it, and have it partially observed just the same as you did with pm.Normal

jonsedar commented on 2024-11-12T05:38:13Z
----------------------------------------------------------------

Aha, yes nice idea! As you suggest, let's leave this one here, but that seems like a really useful thing to demonstrate, and would be potentially a nice companion to my other new ordinal-features example (#717).

When I get a minute I'll create a new NB (based on this one) that includes Categorical features, and I'll omit the out-of-sample forecast to keep it lean

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:47Z
----------------------------------------------------------------

Line #14.    warnings.simplefilter(action="ignore", category=FutureWarning)  # isort:skip

Don't? FutureWarning filter is pretty broad


jonsedar commented on 2024-11-09T14:00:23Z
----------------------------------------------------------------

Haha yeah, good point - I tend to leave it in cos seaborn v12 (which I still use because they massively changed the API in >12) gets really annoying with all the warnings... But there's doesnt seem to be any here after I remove it, so, removed :)

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:48Z
----------------------------------------------------------------

Line #11.    # set target_accept quite high to minimise divergences in mdlb

target_accept is not being changed


jonsedar commented on 2024-11-09T13:38:22Z
----------------------------------------------------------------

Good catch, that's a cutpaste from elsewhere

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:48Z
----------------------------------------------------------------

This plot_univariate is not that great? How does a histogram/kdeplot look like?


jonsedar commented on 2024-11-09T13:39:35Z
----------------------------------------------------------------

I quite like violins :D I'll make it taller, then it's basically a kdeplot

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:49Z
----------------------------------------------------------------

Why is the observed off scale? How is it reasonable?


jonsedar commented on 2024-11-09T13:45:18Z
----------------------------------------------------------------

FWIW I wouldnt expect the prior to be too close, but I've added a clarification

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:50Z
----------------------------------------------------------------

Confusing helper name? plot_distribution?

Also should be done before calling pm.sample?


jonsedar commented on 2024-11-09T13:48:20Z
----------------------------------------------------------------

Fair call, I've renamed it to plot_krushke , since that's what it's doing :) FWIW arviz calls it plot_posterior

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:50Z
----------------------------------------------------------------

Remove section?


jonsedar commented on 2024-11-09T13:48:54Z
----------------------------------------------------------------

I quite like leaving the placeholder, but sure, it could be confusing - removed!

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:51Z
----------------------------------------------------------------

Line #3.    mdl0.set_data("y", dfrawx_holdout[ft_y].values, coords=COORDS_F)

No need to set dummy data for posterior_predictive


jonsedar commented on 2024-11-09T13:49:57Z
----------------------------------------------------------------

Good point, thanks!

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:52Z
----------------------------------------------------------------

This picture needs a legend, only the green dot is explained


jonsedar commented on 2024-11-09T13:52:13Z
----------------------------------------------------------------

Legends get hairy with seaborn... I've added more explanation in the title!

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:52Z
----------------------------------------------------------------

There is little cost to "rebuilding" the model compared to reusing an existing model for sample_posterior_predictive, because none of the random functions are cached anyway across calls


jonsedar commented on 2024-11-09T14:37:42Z
----------------------------------------------------------------

Thanks for the clarification - I've changed the language to "re-specify" (only a minor concern to my mind because it's not DRY)

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:53Z
----------------------------------------------------------------

Line #10.        # NOTE: ... but there's no way to put a nan-containing array into a pm.Data,

This is slightly incorrect. You can put nan but they won't trigger the special behavior that passing a numpy with nan to observed does. And you'll have nan in the logp. Not sure it matters to be precise


jonsedar commented on 2024-11-09T14:07:39Z
----------------------------------------------------------------

FWIW I've found that this isn't possible... e.g. if were to put this line into the model spec (which tries to create a pm.Data with an array that contains nans):

xkk = pm.Data("xkk", dfx_holdout[FTS_XK].values, dims=("oid", "xj_nm"))  

I get error:

NotImplementedError: Masked arrays or arrays with nan entries are not supported. Pass them directly to observed if you want to trigger auto-imputation

ricardoV94 commented on 2024-11-11T14:33:30Z
----------------------------------------------------------------

Ahh pm.Data is trying being helpful, pytensor.shared definitely doesn't mind numpy arrays with nan

jonsedar commented on 2024-11-12T04:48:05Z
----------------------------------------------------------------

Ah, interesting - so is it worthwhile for me to drop down to use a pt.shared instead? I'll give that a try

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:54Z
----------------------------------------------------------------

Not sure these sort of table outputs are useful? If so, perhaps discuss what can be gathered from them? If not, ommit?


jonsedar commented on 2024-11-09T14:08:59Z
----------------------------------------------------------------

Yeah they can go - I was just trying to show the build-up of the dataframe, but people can introspect if they want

Copy link

review-notebook-app bot commented Nov 9, 2024

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2024-11-09T13:05:54Z
----------------------------------------------------------------

Mention multivariate prior for missing x, perhaps link to Junpeng talk? https://discourse.pymc.io/t/partial-missing-multivariate-observation-and-what-to-do-with-them-by-junpeng-lao/6050


jonsedar commented on 2024-11-09T14:20:21Z
----------------------------------------------------------------

Aha yes, I was looking for @junpenglao 's old blogposts on this technique, esp w.r.t the hierarchical prior xk_mu.I'll make reference!

@ricardoV94
Copy link
Member

@jonsedar seems like a great addition. I left some comments above. I hope we can make the prediction parts nicer in the future, with a helper that accepts a shared mask.

@jonsedar
Copy link
Contributor Author

jonsedar commented Nov 9, 2024

Thanks for the review! I'll go through and make changes etc now

@jonsedar jonsedar marked this pull request as draft November 9, 2024 13:22
Copy link
Contributor Author

jonsedar commented Nov 9, 2024

Sure, that makes sense - I've updated the text to mention covariates specifically


View entire conversation on ReviewNB

Copy link
Contributor Author

jonsedar commented Nov 9, 2024

Interesting point, and possibly my lack of knowledge. I've used hierarchical priors xk_mu on xk_unobserved and assumed that xk is a continuous (and zscored) value. How would you suggest transforming to allow a categorical index? I'm totally open to extending the example to categoricals!


View entire conversation on ReviewNB

Copy link
Contributor Author

jonsedar commented Nov 9, 2024

Good catch, that's a cutpaste from elsewhere


View entire conversation on ReviewNB

Copy link
Contributor Author

jonsedar commented Nov 9, 2024

I quite like violins :D I'll add a kde too


View entire conversation on ReviewNB

Copy link
Contributor Author

jonsedar commented Nov 9, 2024

FWIW I wouldnt expect the prior to be too close, but I've added a clarification


View entire conversation on ReviewNB

Copy link
Contributor Author

jonsedar commented Nov 9, 2024

Fair call, I've renamed it to plot_krushke , since that's what it's doing :) FWIW arviz calls it plot_posterior


View entire conversation on ReviewNB

Copy link
Contributor Author

jonsedar commented Nov 9, 2024

I quite like leaving the placeholder, but sure, it could be confusing - removed!


View entire conversation on ReviewNB

Copy link
Contributor Author

jonsedar commented Nov 9, 2024

Good point, thanks!


View entire conversation on ReviewNB

Copy link
Contributor Author

jonsedar commented Nov 9, 2024

Legends get hairy with seaborn... I've added more explanation in the title!


View entire conversation on ReviewNB

Copy link
Contributor Author

jonsedar commented Nov 9, 2024

Haha yeah, good point - I tend to leave it in cos seaborn v12 (which I still use because they massively changed the API in >12) gets really annoying with all the warnings... But there's doesnt seem to be any here after I remove it, so, removed :)


View entire conversation on ReviewNB

jonsedar and others added 2 commits December 14, 2024 13:54
* + added new notebook GLM-ordinal-features.ipynb
+ already complete and created in another env

* Created using Colab

* minor updates for latest seaborn

* Tweaks for collab and included authors

* added header

* + added new reference article and online
+ cited inside Notebook
+ adjusted errata

* + ran black again...

* + rep Burkner as B{\"u}rkner

* + maybe fixed readthedocs complaint pybtex.database.InvalidNameString: Too many commas in 'B{\\"u}rkner, P., & Charpentier, E.'

* + ran black-jupyter, let's see

* + another run of black-jupyter

* + installed local pymc_examples env
+ ran pre-commit in full
+ autocreated *.myst file

* + update tags

* + possibly persuaded the precommits to all pass

* + rerun on colab to confirm all good post new pre-commit process

* + okay, reran in colab again... lets see if this passes

* + added (again) the myst.md

* + minor updates post review
+ new work to positively include a coeff in mdlb for d450 = c4

* + reran precommit and readded myst.md

* + rerun localyl e2e

* + added myst.md again

* + reran again to ensure cell execution order even for markdown cells

* + reran again again to ensure order

* + minor update: forced addtional level c4 into d450 categorical feature
+ fixed a couple of typos

* + changed rating to intermediate
* Draft update of BNN notebook

* Pre-commit fixes

* Address reviewer comments

* Additional edits

* Removed bivariate example; updated formatting

* Updated GEV

* Changed pymc_experimental to pymc_extras

---------

Co-authored-by: Chris Fonnesbeck <[email protected]>
Copy link

review-notebook-app bot commented Dec 15, 2024

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2024-12-15T15:34:08Z
----------------------------------------------------------------

There is a difference between missing at random (MAR) and missing completely at random (MCAR). With the latter you can do complete case analysis and only suffer loss of power via reduction in sample size; the former requires knowledge of a model of missingness.

But, given your disclaimer, you can feel free to ignore this distinction.


jonsedar commented on 2024-12-16T06:35:48Z
----------------------------------------------------------------

Would it be more correct then, to replace where I've written "Missing at Random" with "Missing Completely at Random" ? It's a weird nuance and unhelpful terminology for sure

jonsedar commented on 2024-12-16T06:57:05Z
----------------------------------------------------------------

Okay, got it. https://stats.stackexchange.com/questions/23090/distinguishing-missing-at-random-mar-from-missing-completely-at-random-mcar

I'll replace with "Missing Completely at Random"

Copy link

review-notebook-app bot commented Dec 15, 2024

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2024-12-15T15:34:09Z
----------------------------------------------------------------

Correlation may be easier to see (for a and c at least) if they were standardized?


jonsedar commented on 2024-12-16T06:38:02Z
----------------------------------------------------------------

I'll set sharex =False, will achieve the same thing in the plot :)

Copy link

review-notebook-app bot commented Dec 15, 2024

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2024-12-15T15:34:10Z
----------------------------------------------------------------

I'm getting a little confused by the cryptic (to me) variable names. For example, what is "oid" for? As I write this I assume it means "observation ID". Recommend using human-readable variable names to lower the cognitive overhead.


jonsedar commented on 2024-12-16T06:41:32Z
----------------------------------------------------------------

I'll include a comment explanation. Typically I like to vectorise the feature names in Bx using e.g. FTS_XJ, and yep oid is observation ID

Copy link

review-notebook-app bot commented Dec 15, 2024

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2024-12-15T15:34:10Z
----------------------------------------------------------------

Don't undersell -- I think the marginal energy is about as good as you can expect!


jonsedar commented on 2024-12-16T06:43:50Z
----------------------------------------------------------------

Haha well, I say apparently reasonable but it's clearly reasonable :)

I'll replace > with >>

Copy link

review-notebook-app bot commented Dec 15, 2024

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2024-12-15T15:34:11Z
----------------------------------------------------------------

Plot is a little hard to read.


jonsedar commented on 2024-12-16T06:51:18Z
----------------------------------------------------------------

Agreed, I'll make it taller

Copy link

review-notebook-app bot commented Dec 15, 2024

View / edit / reply to this conversation on ReviewNB

fonnesbeck commented on 2024-12-15T15:34:12Z
----------------------------------------------------------------

You should not need a masked array -- just passing a numpy array with missing values will trigger imputation. Might as well skip the extra step.


jonsedar commented on 2024-12-16T07:26:40Z
----------------------------------------------------------------

Ah that's interesting - when did that functionality change?

Just tested working the same, so I'll update the modelspec and related markdown discussion - thanks!

+ remove recently deprecated sym kwarg in seaborn boxplot
+ improved a couple of explanations
+ reran precommit etc checks
…, formatting, explanatory text

+ also changed notebook title and filename to be more clear
+ also added new reference to jungpenlao's talk
+ reran precommit and created new myst file
+ fixed a couple of typos
@jonsedar
Copy link
Contributor Author

Urgh FML, I made the stupid mistake of trying to rebase this branch on top of most recent branches... Now a total mess

I'll address youe points above and resubmit a new PR!

Copy link
Contributor Author

Would it be more correct then, to replace where I've written "Missing at Random" with "Missing Completely at Random" ? It's a weird nuance and unhelpful terminology for sure


View entire conversation on ReviewNB

Copy link
Contributor Author

I'll set sharex =False, will achieve the same thing in the plot :)


View entire conversation on ReviewNB

Copy link
Contributor Author

I'll include a comment explanation. Typically I like to vectorise the feature names in Bx using e.g. FTS_XJ


View entire conversation on ReviewNB

Copy link
Contributor Author

Haha well, I say apparently reasonable but it's clearly reasonable :)

I'll replace > with >>


View entire conversation on ReviewNB

Copy link
Contributor Author

Yeah, I'll make it taller


View entire conversation on ReviewNB

Copy link
Contributor Author

Copy link
Contributor Author

Ah that's interesting - when did that fucntionality change?


View entire conversation on ReviewNB

jonsedar added a commit to jonsedar/pymc-examples that referenced this pull request Dec 16, 2024
…values in covariates

+ update NB with various minor improvements following PR review in the old branch pymc-devs#722
+ additional typos etc
@jonsedar
Copy link
Contributor Author

Closing this in favor of newer, bettter, shinier PR 753 #753

@jonsedar jonsedar closed this Dec 16, 2024
fonnesbeck pushed a commit that referenced this pull request Dec 16, 2024
…753)

* + take a new up-to-date branch off master and dump in new NB missing values in covariates
+ update NB with various minor improvements following PR review in the old branch #722
+ additional typos etc

* + ran pre-commit autoupdate

* + ran precommit on new NB, created myst file
fonnesbeck pushed a commit to fonnesbeck/pymc-examples that referenced this pull request Dec 20, 2024
…ymc-devs#753)

* + take a new up-to-date branch off master and dump in new NB missing values in covariates
+ update NB with various minor improvements following PR review in the old branch pymc-devs#722
+ additional typos etc

* + ran pre-commit autoupdate

* + ran precommit on new NB, created myst file
fonnesbeck pushed a commit to fonnesbeck/pymc-examples that referenced this pull request Dec 20, 2024
…ymc-devs#753)

* + take a new up-to-date branch off master and dump in new NB missing values in covariates
+ update NB with various minor improvements following PR review in the old branch pymc-devs#722
+ additional typos etc

* + ran pre-commit autoupdate

* + ran precommit on new NB, created myst file
fonnesbeck pushed a commit to fonnesbeck/pymc-examples that referenced this pull request Jan 19, 2025
…ymc-devs#753)

* + take a new up-to-date branch off master and dump in new NB missing values in covariates
+ update NB with various minor improvements following PR review in the old branch pymc-devs#722
+ additional typos etc

* + ran pre-commit autoupdate

* + ran precommit on new NB, created myst file
fonnesbeck pushed a commit to fonnesbeck/pymc-examples that referenced this pull request Jan 19, 2025
…ymc-devs#753)

* + take a new up-to-date branch off master and dump in new NB missing values in covariates
+ update NB with various minor improvements following PR review in the old branch pymc-devs#722
+ additional typos etc

* + ran pre-commit autoupdate

* + ran precommit on new NB, created myst file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants