-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_sql_table bug on columns with dots #10279
Comments
This is the result of sqlalchemy returning this as the key of the result set:
But, in the Table object it has the correct column name:
@zzzeek Is there a reason for this difference? |
in the former case you're looking at sqlite3's cursor.description, and I believe we have some workarounds in place for when sqlite3 does the wrong thing with UNION statements that is specifically truncating off dotted names. periods inside of table and column names are a really bad idea. see item #1 here: https://periscope.io/blog/better-sql-schema.html |
call the "sqlite_raw_colnames" execution option to turn this off:
or
|
here's that bug in case you're curious
output:
see the problem? add a UNION, the driver totally blows the cursor.description. those break SQLAlchemy queries, so we filter out the dots. Feel free to report upstream to the sqlite3/pysqlite driver. |
This seems like a questionable hack. The select union doesn't blow up, but just gives too much information which the user still can use. Whereas cutting out random parts of names loses information and should strictly speaking be documented. Moreover, this isn't even common to write the union with a superfluous |
Your defense of the sqlite driver in asserting that this is not a bug at all, just that the driver is in this case deciding to "give too much information that the user can still use", suggests that SQLAlchemy is merely recognizing that a column name of the form "x.y" is just the SQLite driver giving us "too much information", which for SQLAlchemy's purposes we necessarily truncate, because it confuses other parts of the system. If this was in fact only "extra" information that's not needed, then it should be safe to truncate. This is of course totally wrong, because this is not "extra" information; it's garbage which without context we cannot distinguish between a table-qualified column name and a standalone column name with a dot in it.
This isn't a "random" part of the name, it is explicitly the part preceding the dot which in the vast majority of cases is the table or alias name. The lack of documentation was a bug and is resolved as of yesterday: http://docs.sqlalchemy.org/en/rel_1_0/dialects/sqlite.html#dotted-column-names
It is absolutely common if your query happens to contain more than one table/alias/subquery, and further is fully necessary if there is any kind of column name overlap between those relations. Vastly more common than the "questionable" practice of putting dots in column names. The SQLAlchemy Core and ORM querying features, which account for the overwhelming vast majority of queries performed under normal use of SQLAlchemy, render all SELECT statements with the leading table name. Historically, the Core also has relied upon a predictable set of column names coming back from cursor.description in order to match result columns to the Core table constructs which they correspond towards. Calling this "too much information" is not accurate at all because it is entirely ambiguous and without context (short of parsing the SQL statement to see if it has the word "UNION" in it - now that would be a hack). We have no idea if the "x." prefix is part of the column name or is part of a table/alias name. That's not information, it's noise. Very recent SQLAlchemy releases no longer rely as heavily on the names in cursor.description and we could likely just switch this behavior off at this point, however it is very likely that real-world applications are themselves relying upon consistency in these names; Pandas itself is one such example. If SQLAlchemy did not do what it does here, then the user whose Pandas application ends up calling upon a UNION against multiple tables/aliases would report the same issue from the other end. It's a no-win situation as long as sqlite3 has this issue. Certainly the sqlite3 driver is most responsible here, as it has a lot of silly SQL-parsing issues like these which often go on for years without fixes (see http://bugs.python.org/issue21718 for another example). Pandas itself is also not without blame, as it naively constructs SQL schemas using the raw naming schemes from an entirely different domain instead of applying reasonable escaping patterns to avoid poor SQL practices such as putting dots / spaces / other things in column names.
To my knowledge, this is not even acknowledged as a bug in sqlite. I hardly bother reporting them myself because in ten years I've hardly seen the pysqlite driver ever fix any of the bugs we're reported. Bugs like these in sqlite and the driver linger on for years. I can assure you the moment a fixed is released, SQLAlchemy will immediately release an accommodation for this - on my end there's a lot of ugly crap that had to happen to make these use cases work. It is by nature that database abstraction libraries have dependencies on the specific behaviors of the drivers, so accommodating continued change in those libraries is necessary in any case; there are hundreds of critical behaviors in SQLAlchemy dialects that would classify as "dead/buggy" if the upstream driver made a change in behavior; something as simple as a new version of the psycopg2 driver suddenly deciding to parse JSON result sets into json rather than returning them as strings, for example. SQLAlchemy clearly had to parse the JSON beforehand and the psycopg2 driver clearly benefits by building in this parsing themselves and releasing it. That's just how it goes in this domain. Without this name-truncation at some level, common SQLAlchemy queries like query(X).union(query(Y)) would historically simply fail. We felt it was appropriate to keep the public-facing keys() consistent with this. Because using dots in column names is vastly less common than emitting a SELECT/UNION with table-qualified column names. |
Thanks for the explanation. |
Using the original snippet, I get the expected output
That's probably because the
So this has been fixed upstream, so I guess we can close this? Or is there a need for a test to avoid regressions? |
Yes, as also noted in http://docs.sqlalchemy.org/en/latest/dialects/sqlite.html#dotted-column-names, this seems to be fixed now. So closing this (anyhow, it was not a pandas issues, but rather sqlite/sqlalchemy interaction). Thank @kokes for noting |
read_sql_table handles columns with dots in the name incorrectly (removes name before dot)
->
The text was updated successfully, but these errors were encountered: