spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "natalya (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods
Date Thu, 21 May 2015 16:30:17 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554609#comment-14554609
] 

natalya commented on SPARK-6189:
--------------------------------

Figuring out what is wrong is not the difficulty.  The current error message while confusing
and humorous, provides sufficient information to track down the issue.  

However, if Spark simply returns an error it will remain incompatible with certain data sets
- for example, URLs, server names, IP addresses, and e-mail addresses.  All necessarily will
contain a period.  Some small subset will also contain underscores.  Both solutions will prohibit
direct handling of this type of data in field names which seems like a significant restriction,
and even more so when you factor in the additional restriction on compatibility with R and
SQL.  

Wouldn't it be better to fix the problem and allow periods?

> Pandas to DataFrame conversion should check field names for periods
> -------------------------------------------------------------------
>
>                 Key: SPARK-6189
>                 URL: https://issues.apache.org/jira/browse/SPARK-6189
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> Issue I ran into:  I imported an R dataset in CSV format into a Pandas DataFrame and
then use toDF() to convert that into a Spark DataFrame.  The R dataset had a column with a
period in it (column "GNP.deflator" in the "longley" dataset).  When I tried to select it
using the Spark DataFrame DSL, I could not because the DSL thought the period was selecting
a field within GNP.
> Also, since "GNP" is another field's name, it gives an error which could be obscure to
users, complaining:
> {code}
> org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type DoubleType;
> {code}
> We should either handle periods in column names or check during loading and warn/fail
gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message