Skip to content

BUG: GH3468 Fix assigning a new index to a duplicate index in a DataFrame would fail #3483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Apr 29, 2013

partially fixes #3468

This would previously raise (same dtype assignment to a non-multi dtype frame with dup indicies)

In [6]: df = DataFrame([[1,2]], columns=['a','a'])

In [7]: df.columns = ['a','a.1']

In [8]: df
Out[8]: 
   a  a.1
0  1    2

construction of a multi-dtype frame with a dup index (#2194) is fixed

In [1]: DataFrame([[1,2,1.,2.,3.,'foo','bar']], columns=list('aaaaaaa'))
Out[1]: 
   a  a  a  a  a    a    a
0  1  2  3  1  2  foo  bar

This was also previously would raise

In [2]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:        df_float  = DataFrame(np.random.randn(10, 3),dtype='float64')
:        df_int    = DataFrame(np.random.randn(10, 3),dtype='int64')
:        df_bool   = DataFrame(True,index=df_float.index,columns=df_float.columns)
:        df_object = DataFrame('foo',index=df_float.index,columns=df_float.columns)
:        df_dt     = DataFrame(Timestamp('20010101'),index=df_float.index,columns=df_float.columns)
:        df        = pan.concat([ df_float, df_int, df_bool, df_object, df_dt ], axis=1)
:--

In [3]: df
Out[3]: 
      0     1     2                   0                   1  \
0  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
1  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
2  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
3  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
4  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
5  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
6  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
7  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
8  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
9  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   

                    2    0    1    2  0  1  2         0         1         2  
0 2001-01-01 00:00:00  foo  foo  foo  0  0  0  0.431857 -0.131747 -1.039563  
1 2001-01-01 00:00:00  foo  foo  foo  0  0 -1  0.516910 -0.683163  0.736468  
2 2001-01-01 00:00:00  foo  foo  foo  0  0  0 -0.147417 -0.305452  0.006213  
3 2001-01-01 00:00:00  foo  foo  foo  0 -1  0  1.443031  0.082710 -0.335054  
4 2001-01-01 00:00:00  foo  foo  foo  1  0  0 -1.349293  0.645316  0.305524  
5 2001-01-01 00:00:00  foo  foo  foo -1  0  0  0.571095  0.756571 -0.773880  
6 2001-01-01 00:00:00  foo  foo  foo  0  0  0 -0.285091  1.196018  0.882786  
7 2001-01-01 00:00:00  foo  foo  foo  2  0  0  0.003610  0.549072 -0.823217  
8 2001-01-01 00:00:00  foo  foo  foo -1  1  0 -0.348279 -0.728958 -0.397435  
9 2001-01-01 00:00:00  foo  foo  foo -1  0  0  0.363489  2.154132  0.494673  

For those of you interested.....here is the new ref_loc indexer for duplicate columns
its by necessity a block oriented indexer, returns the column map (by column number) to a tuple of the block and the index in the block, only created when needed (e.g. when trying to get a column via iget and the index is non-unique, and the results are cached), this is #3092

In [1]: df = pd.DataFrame(np.random.randn(8,4),columns=['a']*4)

In [2]: df._data.blocks
Out[2]: [FloatBlock: [a, a, a, a], 4 x 8, dtype float64]

In [3]: df._data.blocks[0]._ref_locs

In [4]: df._data.set_ref_locs()
Out[4]: 
array([(FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 0),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 1),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 2),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 3)], dtype=object)

Fixed the #2786, #3230 bug that caused applymap to not work (we temp worked around by raising a ValueError; removed that check)

n [3]: In [3]: df = pd.DataFrame(np.random.random((3,4)))

In [4]: In [4]: cols = pd.Index(['a','a','a','a'])

In [5]: In [5]: df.columns = cols

In [6]: In [6]: df.applymap(str)
Out[6]: 
                a                a               a               a
0  0.494204195164   0.534601503195  0.471870025143  0.880092879641
1  0.860369768954  0.0472931994392  0.775532754792  0.822046777859
2  0.478775855962   0.623584943227  0.932012693593  0.739502590395

@jreback
Copy link
Contributor Author

jreback commented Apr 30, 2013

@y-p when you have a chance....I think i figured out the duplicate columns across dtypes issue. I non-trivially create the indexer. seems to fix this problem (I added a few tests). Pretty sure it also fixes the whole to_csv mess that we had. It think some of your #3458 can revert. Also I seem to have lost the issue that was created at the very end of the to_csv stuff last month when we punted on fixing dup columns (well it failed fast in certain cases). Also now should review the other dup columns stuff.

@jreback
Copy link
Contributor Author

jreback commented Apr 30, 2013

@y-p thanks...that looks right....still some issue remaining...but making progress....

@ghost
Copy link

ghost commented Apr 30, 2013

Finger slipped. related #2194, #3230, and I think you meant #3092.

@jreback
Copy link
Contributor Author

jreback commented Apr 30, 2013

@y-p what class of issues/prs are we doing for 0.11.1....? mostly just bug fixes, or minor features? sort of like a mini-release...just shorter in time?

@ghost
Copy link

ghost commented Apr 30, 2013

bug fixes and small-medium enhancements that are back-compatible. no sudden moves.
good transition period to close some of the niggling viz bugs that have been piling up,
which get no love during the major release cycles.

@jreback
Copy link
Contributor Author

jreback commented Apr 30, 2013

k...great....hopefully will have most of the dup columns issues resolved very shortly

@ghost
Copy link

ghost commented Apr 30, 2013

I'll add the config option to control dupe col name mangling tomorrow, will decide on a default before the release.

@jreback
Copy link
Contributor Author

jreback commented Apr 30, 2013

@y-p I think I got all the dup issues, only one to fix remaining is the to_csv; this will work (I pseudo tested), but need to remove the exception that is raised when the columns are duped. You can do the same type of indexing on a dup column as on a non-index column (I think that was the issue), still can't figure out which issue it was?

@ghost
Copy link

ghost commented Apr 30, 2013

hmmm. #3095?

@jreback
Copy link
Contributor Author

jreback commented Apr 30, 2013

that looks right....

@jreback
Copy link
Contributor Author

jreback commented Apr 30, 2013

that is closed, don't we have an open one? (or did we just shove the exception in that one and punt?)

@ghost
Copy link

ghost commented Apr 30, 2013

Yeah, I think we did, the block positional indexer issue was all I remember.

@jreback
Copy link
Contributor Author

jreback commented Apr 30, 2013

@wesm any thoughts before we merge this?

@jreback
Copy link
Contributor Author

jreback commented Apr 30, 2013

I added #3495 to cover the to_csv stuff

@jreback
Copy link
Contributor Author

jreback commented May 2, 2013

closed in favor of #3509, which is cleaner and has to_csv support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas inconsistenly handles identically named columns in csv export and merging
1 participant