spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Commented] (SPARK-16896) Loading csv with duplicate column names
Date Tue, 09 Aug 2016 08:50:20 GMT


Sean Owen commented on SPARK-16896:

No, the code is in Spark now. 

> Loading csv with duplicate column names
> ---------------------------------------
>                 Key: SPARK-16896
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Aseem Bansal
> It would be great if the library allows us to load csv with duplicate column names. I
understand that having duplicate columns in the data is odd but sometimes we get data that
has duplicate columns. Getting upstream data like that can happen. We may choose to ignore
them but currently there is no way to drop those as we are not able to load them at all. Currently
as a pre-processing I loaded the data into R, changed the column names and then make a fixed
version with which Spark Java API can work.
> But if talk about other options, e.g. R has read.csv which automatically takes care of
such situation by appending a number to the column name.
> Also case sensitivity in column names can also cause problems. I mean if we have columns
> ColumnName, columnName
> I may want to have them as separate. But the option to do this is not documented.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message