BUG: GH3468 Fix assigning a new index to a duplicate index in a DataFrame would fail #3483

jreback · 2013-04-29T16:06:30Z

partially fixes #3468

This would previously raise (same dtype assignment to a non-multi dtype frame with dup indicies)

In [6]: df = DataFrame([[1,2]], columns=['a','a'])

In [7]: df.columns = ['a','a.1']

In [8]: df
Out[8]: 
   a  a.1
0  1    2

construction of a multi-dtype frame with a dup index (#2194) is fixed

In [1]: DataFrame([[1,2,1.,2.,3.,'foo','bar']], columns=list('aaaaaaa'))
Out[1]: 
   a  a  a  a  a    a    a
0  1  2  3  1  2  foo  bar

This was also previously would raise

In [2]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:        df_float  = DataFrame(np.random.randn(10, 3),dtype='float64')
:        df_int    = DataFrame(np.random.randn(10, 3),dtype='int64')
:        df_bool   = DataFrame(True,index=df_float.index,columns=df_float.columns)
:        df_object = DataFrame('foo',index=df_float.index,columns=df_float.columns)
:        df_dt     = DataFrame(Timestamp('20010101'),index=df_float.index,columns=df_float.columns)
:        df        = pan.concat([ df_float, df_int, df_bool, df_object, df_dt ], axis=1)
:--

In [3]: df
Out[3]: 
      0     1     2                   0                   1  \
0  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
1  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
2  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
3  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
4  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
5  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
6  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
7  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
8  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   
9  True  True  True 2001-01-01 00:00:00 2001-01-01 00:00:00   

                    2    0    1    2  0  1  2         0         1         2  
0 2001-01-01 00:00:00  foo  foo  foo  0  0  0  0.431857 -0.131747 -1.039563  
1 2001-01-01 00:00:00  foo  foo  foo  0  0 -1  0.516910 -0.683163  0.736468  
2 2001-01-01 00:00:00  foo  foo  foo  0  0  0 -0.147417 -0.305452  0.006213  
3 2001-01-01 00:00:00  foo  foo  foo  0 -1  0  1.443031  0.082710 -0.335054  
4 2001-01-01 00:00:00  foo  foo  foo  1  0  0 -1.349293  0.645316  0.305524  
5 2001-01-01 00:00:00  foo  foo  foo -1  0  0  0.571095  0.756571 -0.773880  
6 2001-01-01 00:00:00  foo  foo  foo  0  0  0 -0.285091  1.196018  0.882786  
7 2001-01-01 00:00:00  foo  foo  foo  2  0  0  0.003610  0.549072 -0.823217  
8 2001-01-01 00:00:00  foo  foo  foo -1  1  0 -0.348279 -0.728958 -0.397435  
9 2001-01-01 00:00:00  foo  foo  foo -1  0  0  0.363489  2.154132  0.494673

For those of you interested.....here is the new ref_loc indexer for duplicate columns
its by necessity a block oriented indexer, returns the column map (by column number) to a tuple of the block and the index in the block, only created when needed (e.g. when trying to get a column via iget and the index is non-unique, and the results are cached), this is #3092

In [1]: df = pd.DataFrame(np.random.randn(8,4),columns=['a']*4)

In [2]: df._data.blocks
Out[2]: [FloatBlock: [a, a, a, a], 4 x 8, dtype float64]

In [3]: df._data.blocks[0]._ref_locs

In [4]: df._data.set_ref_locs()
Out[4]: 
array([(FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 0),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 1),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 2),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 3)], dtype=object)

Fixed the #2786, #3230 bug that caused applymap to not work (we temp worked around by raising a ValueError; removed that check)

n [3]: In [3]: df = pd.DataFrame(np.random.random((3,4)))

In [4]: In [4]: cols = pd.Index(['a','a','a','a'])

In [5]: In [5]: df.columns = cols

In [6]: In [6]: df.applymap(str)
Out[6]: 
                a                a               a               a
0  0.494204195164   0.534601503195  0.471870025143  0.880092879641
1  0.860369768954  0.0472931994392  0.775532754792  0.822046777859
2  0.478775855962   0.623584943227  0.932012693593  0.739502590395

jreback · 2013-04-30T00:27:00Z

@y-p when you have a chance....I think i figured out the duplicate columns across dtypes issue. I non-trivially create the indexer. seems to fix this problem (I added a few tests). Pretty sure it also fixes the whole to_csv mess that we had. It think some of your #3458 can revert. Also I seem to have lost the issue that was created at the very end of the to_csv stuff last month when we punted on fixing dup columns (well it failed fast in certain cases). Also now should review the other dup columns stuff.

jreback · 2013-04-30T00:36:53Z

@y-p thanks...that looks right....still some issue remaining...but making progress....

ghost · 2013-04-30T00:39:42Z

Finger slipped. related #2194, #3230, and I think you meant #3092.

jreback · 2013-04-30T14:51:56Z

@y-p what class of issues/prs are we doing for 0.11.1....? mostly just bug fixes, or minor features? sort of like a mini-release...just shorter in time?

ghost · 2013-04-30T14:57:55Z

bug fixes and small-medium enhancements that are back-compatible. no sudden moves.
good transition period to close some of the niggling viz bugs that have been piling up,
which get no love during the major release cycles.

jreback · 2013-04-30T15:07:09Z

k...great....hopefully will have most of the dup columns issues resolved very shortly

ghost · 2013-04-30T15:16:33Z

I'll add the config option to control dupe col name mangling tomorrow, will decide on a default before the release.

jreback · 2013-04-30T17:21:41Z

@y-p I think I got all the dup issues, only one to fix remaining is the to_csv; this will work (I pseudo tested), but need to remove the exception that is raised when the columns are duped. You can do the same type of indexing on a dup column as on a non-index column (I think that was the issue), still can't figure out which issue it was?

ghost · 2013-04-30T17:31:51Z

hmmm. #3095?

jreback · 2013-04-30T17:39:41Z

that looks right....

jreback · 2013-04-30T17:41:24Z

that is closed, don't we have an open one? (or did we just shove the exception in that one and punt?)

ghost · 2013-04-30T18:00:13Z

Yeah, I think we did, the block positional indexer issue was all I remember.

jreback · 2013-04-30T18:24:36Z

@wesm any thoughts before we merge this?

jreback · 2013-04-30T18:28:48Z

I added #3495 to cover the to_csv stuff

…rame would fail

…get) when using a non-unique index (GH2786 for the warning and GH3230 for applymap) TST: test for GH2194 (which is fixed)

jreback · 2013-05-02T01:15:09Z

closed in favor of #3509, which is cleaner and has to_csv support

jreback mentioned this pull request Apr 30, 2013

error msg when trying to split duplicated columns across dtypes #2194

Closed

jreback mentioned this pull request Apr 30, 2013

BUG: fix to_csv to work with dup column indices #3495

Closed

jreback added 4 commits May 1, 2013 12:37

BUG: GH3468 Fix assigning a new index to a duplicate index in a DataF…

432c672

…rame would fail

ENH: support for having duplicative indices across blocks (dtypes)

98fe08c

BUG: fix construction of a DataFrame with duplicative indices

b0d4e0c

BUG: enabled applymap to work (and updated internals/convert to use i…

5941829

…get) when using a non-unique index (GH2786 for the warning and GH3230 for applymap) TST: test for GH2194 (which is fixed)

jreback closed this May 2, 2013

jreback mentioned this pull request Jun 28, 2013

Assign column values to non-unique DataFrame #4067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GH3468 Fix assigning a new index to a duplicate index in a DataFrame would fail #3483

BUG: GH3468 Fix assigning a new index to a duplicate index in a DataFrame would fail #3483

jreback commented Apr 29, 2013

jreback commented Apr 30, 2013

jreback commented Apr 30, 2013

ghost commented Apr 30, 2013

jreback commented Apr 30, 2013

ghost commented Apr 30, 2013

jreback commented Apr 30, 2013

ghost commented Apr 30, 2013

jreback commented Apr 30, 2013

ghost commented Apr 30, 2013

jreback commented Apr 30, 2013

jreback commented Apr 30, 2013

ghost commented Apr 30, 2013

jreback commented Apr 30, 2013

jreback commented Apr 30, 2013

jreback commented May 2, 2013

BUG: GH3468 Fix assigning a new index to a duplicate index in a DataFrame would fail #3483

BUG: GH3468 Fix assigning a new index to a duplicate index in a DataFrame would fail #3483

Conversation

jreback commented Apr 29, 2013

jreback commented Apr 30, 2013

jreback commented Apr 30, 2013

ghost commented Apr 30, 2013

jreback commented Apr 30, 2013

ghost commented Apr 30, 2013

jreback commented Apr 30, 2013

ghost commented Apr 30, 2013

jreback commented Apr 30, 2013

ghost commented Apr 30, 2013

jreback commented Apr 30, 2013

jreback commented Apr 30, 2013

ghost commented Apr 30, 2013

jreback commented Apr 30, 2013

jreback commented Apr 30, 2013

jreback commented May 2, 2013