Skip to content

POC: pass date_unit to values_for_json #54198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jbrockmendel
Copy link
Member

@WillAyd this isn't working (or even compiling), just putting it here to get your thoughts on passing date_unit to values_for_json instead of handling it in the C code. The relevant questions are 1) would this allow for non-trivial simplification of the C code? and 2) perf impact?

@jbrockmendel jbrockmendel requested a review from WillAyd as a code owner July 19, 2023 20:48
@jbrockmendel
Copy link
Member Author

xref #28180

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is an interesting take. I'm still a little lukewarm on the idea but this is more feasible than I originally thought, especially if values_for_json is expected to take kwargs

NPY_DATETIMEUNIT dunit = (PyObjectEncoder *)tc->datetimeUnit;
PyObject *date_unit;
if (dunit == NPY_FR_s) {
date_unit = "s";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you want to do here is create a Python object from the literal string, so PyUnicode_FromString("s")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this comment if you change L730 to PyObject_CallMethod - in that case you can just declare date_unit as a char * and call it a day


PyObject *mgr = PyObject_GetAttrString(obj, "_mgr");
PyObject *name = "column_arrays";
arrays = PyObject_CallMethodOneArg(mgr, name, date_unit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyObject_CallMethodOneArg expects Python objects as arguments, but you are working with a const char[] with the literal "column_arrays"

What you should do instead is PyObject_CallMethod(mgr, "column_arrays", "%s", date_unit) to pass the date_unit as a unicode object to the column_arrays method of the mgr object

PyObject *mgr = PyObject_GetAttrString(obj, "_mgr");
PyObject *name = "column_arrays";
arrays = PyObject_CallMethodOneArg(mgr, name, date_unit);

if (!arrays) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you are done with mgr you will want to Py_DECREF(mgr) to avoid memory leaks

@jbrockmendel
Copy link
Member Author

Applied suggestions and now it compiles but segfaults. Best guess is the NPY_DATETIMEUNIT dunit = ((PyObjectEncoder *)tc)->datetimeUnit; bit.

Ah this is an interesting take. I'm still a little lukewarm on the idea but this is more feasible than I originally thought, especially if values_for_json is expected to take kwargs

I spent a bunch of yesterday working through objToJSON.c and think I have a handle on the big picture. Among other things, I think that something like this would make a big difference for pyarrow dtypes.

@@ -713,7 +713,22 @@ void PdBlock_iterBegin(JSOBJ _obj, JSONTypeContext *tc) {
return;
}

arrays = get_sub_attr(obj, "_mgr", "column_arrays");
NPY_DATETIMEUNIT dunit = ((PyObjectEncoder *)tc)->datetimeUnit;
PyObject *date_unit;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just declare this as a char * - by making it a PyObject you have to manage the lifecycle and DECREF, which is currently missing. It also adds unnecessary overhead

@lithomas1
Copy link
Member

What's the big picture here? Is the idea to do the string conversions before the JSON C code.

It would feel weird to only have dates/datetimes and floats(?, there is also a float_format argument) casted to string.

Not formally objecting to this, just curious about the thought process.

date_unit = "ns";
}

PyObject *mgr = PyObject_GetAttrString(obj, "_mgr");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some code segments that hit this but don't have a _mgr attr. This would return NULL and could explain the segfaults

So you will want to check if the result is equal to NULL and return an appropriate error message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update, but we only get here with a DataFrame.

@jbrockmendel
Copy link
Member Author

It would feel weird to only have dates/datetimes and floats(?, there is also a float_format argument) casted to string.

If we go down this route I'd be OK with adding float_format to values_for_json too.

What's the big picture here? Is the idea to do the string conversions before the JSON C code.

I have a few motivations in mind, many of which are less motivations for this than just gripes about the current implementation.

  1. This code is very hard to maintain. I consider it a win to get logic out of objToJSON.c.
  2. ATM EAs cannot specify to_json behavior (xref to_json of Series with period dtype results in AttributeError #31917)
  3. I want to get REF: move values_for_json to EA #53680 in bc that logic doesn't belong in Block; am reticent about pushing harder on that bc of lack of tests, which in turn I'm not ready to write because of a lack of clarity in the desired API.
  4. e.g. pyarrow arrays we currently cast to object, which would be nice to avoid.

On the flip side, casting to string in values_for_json also means creating an object-dtype ndarray, which might not be an improvement.

So ATM I'm just looking to see if we can get this working and determine if that would allow any simplifications of the C code.

@WillAyd
Copy link
Member

WillAyd commented Jul 28, 2023

On the topic of simplifying the C code, I think one of the things that makes it hard to maintain is the sheer amount of formats we offer. I've often wondered if there is value in doing that, or if we should consider deprecating some formats. I haven't come across other libraries that offer things like the split format. values is very lossy and in a weird spot (if you are trying to save space, JSON just isn't a good format to begin with)

@jbrockmendel
Copy link
Member Author

or if we should consider deprecating some formats

I wouldn't complain, have never had a use case for to_json myself.

On the topic of simplifying the C code, I think one of the things that makes it hard to maintain is the sheer amount of formats we offer

I'm focused on the fact that ujson is effectively re-implementing the entire stdlib json module, whereas we really only care about Series/DataFrame. Most of the complexity is driven by pinning of callbacks and pass-throughs, which in turn is made necessary by the need to handle recursive structures. The only non-test place where we use ujson_dumps is with obj_to_write in io.json._json. If we ruled out e.g. dicts held inside object-dtype cases, I think we could simplify a ton.

@gfyoung gfyoung added the IO JSON read_json, to_json, json_normalize label Aug 17, 2023
@jbrockmendel jbrockmendel deleted the ref-values_for_json-2 branch August 22, 2023 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants