spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Natu Lauchande (JIRA)" <>
Subject [jira] [Commented] (SPARK-16896) Loading csv with duplicate column names
Date Thu, 18 Aug 2016 11:01:20 GMT


Natu Lauchande commented on SPARK-16896:

Hi i did start. But couldn't make a lot of progress within the last couple of days . Feel
free to grab it . I can try find another easier and less critical beginner task .

> Loading csv with duplicate column names
> ---------------------------------------
>                 Key: SPARK-16896
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Aseem Bansal
> It would be great if the library allows us to load csv with duplicate column names. I
understand that having duplicate columns in the data is odd but sometimes we get data that
has duplicate columns. Getting upstream data like that can happen. We may choose to ignore
them but currently there is no way to drop those as we are not able to load them at all. Currently
as a pre-processing I loaded the data into R, changed the column names and then make a fixed
version with which Spark Java API can work.
> But if talk about other options, e.g. R has read.csv which automatically takes care of
such situation by appending a number to the column name.
> Also case sensitivity in column names can also cause problems. I mean if we have columns
> ColumnName, columnName
> I may want to have them as separate. But the option to do this is not documented.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message