-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
POC: pass date_unit to values_for_json #54198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
xref #28180 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is an interesting take. I'm still a little lukewarm on the idea but this is more feasible than I originally thought, especially if values_for_json
is expected to take kwargs
NPY_DATETIMEUNIT dunit = (PyObjectEncoder *)tc->datetimeUnit; | ||
PyObject *date_unit; | ||
if (dunit == NPY_FR_s) { | ||
date_unit = "s"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you want to do here is create a Python object from the literal string, so PyUnicode_FromString("s")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore this comment if you change L730 to PyObject_CallMethod
- in that case you can just declare date_unit
as a char *
and call it a day
|
||
PyObject *mgr = PyObject_GetAttrString(obj, "_mgr"); | ||
PyObject *name = "column_arrays"; | ||
arrays = PyObject_CallMethodOneArg(mgr, name, date_unit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyObject_CallMethodOneArg
expects Python objects as arguments, but you are working with a const char[]
with the literal "column_arrays"
What you should do instead is PyObject_CallMethod(mgr, "column_arrays", "%s", date_unit)
to pass the date_unit
as a unicode object to the column_arrays method of the mgr object
PyObject *mgr = PyObject_GetAttrString(obj, "_mgr"); | ||
PyObject *name = "column_arrays"; | ||
arrays = PyObject_CallMethodOneArg(mgr, name, date_unit); | ||
|
||
if (!arrays) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After you are done with mgr
you will want to Py_DECREF(mgr)
to avoid memory leaks
Applied suggestions and now it compiles but segfaults. Best guess is the
I spent a bunch of yesterday working through objToJSON.c and think I have a handle on the big picture. Among other things, I think that something like this would make a big difference for pyarrow dtypes. |
@@ -713,7 +713,22 @@ void PdBlock_iterBegin(JSOBJ _obj, JSONTypeContext *tc) { | |||
return; | |||
} | |||
|
|||
arrays = get_sub_attr(obj, "_mgr", "column_arrays"); | |||
NPY_DATETIMEUNIT dunit = ((PyObjectEncoder *)tc)->datetimeUnit; | |||
PyObject *date_unit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should just declare this as a char *
- by making it a PyObject
you have to manage the lifecycle and DECREF, which is currently missing. It also adds unnecessary overhead
What's the big picture here? Is the idea to do the string conversions before the JSON C code. It would feel weird to only have dates/datetimes and floats(?, there is also a Not formally objecting to this, just curious about the thought process. |
date_unit = "ns"; | ||
} | ||
|
||
PyObject *mgr = PyObject_GetAttrString(obj, "_mgr"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some code segments that hit this but don't have a _mgr attr. This would return NULL and could explain the segfaults
So you will want to check if the result is equal to NULL and return an appropriate error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update, but we only get here with a DataFrame.
If we go down this route I'd be OK with adding float_format to values_for_json too.
I have a few motivations in mind, many of which are less motivations for this than just gripes about the current implementation.
On the flip side, casting to string in So ATM I'm just looking to see if we can get this working and determine if that would allow any simplifications of the C code. |
On the topic of simplifying the C code, I think one of the things that makes it hard to maintain is the sheer amount of formats we offer. I've often wondered if there is value in doing that, or if we should consider deprecating some formats. I haven't come across other libraries that offer things like the |
I wouldn't complain, have never had a use case for to_json myself.
I'm focused on the fact that ujson is effectively re-implementing the entire stdlib json module, whereas we really only care about Series/DataFrame. Most of the complexity is driven by pinning of callbacks and pass-throughs, which in turn is made necessary by the need to handle recursive structures. The only non-test place where we use ujson_dumps is with obj_to_write in io.json._json. If we ruled out e.g. dicts held inside object-dtype cases, I think we could simplify a ton. |
@WillAyd this isn't working (or even compiling), just putting it here to get your thoughts on passing date_unit to values_for_json instead of handling it in the C code. The relevant questions are 1) would this allow for non-trivial simplification of the C code? and 2) perf impact?