spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-15522) DataFrame Column Names That are Numbers aren't referenced correctly in SQL
Date Mon, 06 Mar 2017 16:52:33 GMT

     [ https://issues.apache.org/jira/browse/SPARK-15522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon resolved SPARK-15522.
----------------------------------
    Resolution: Invalid

> DataFrame Column Names That are Numbers aren't referenced correctly in SQL
> --------------------------------------------------------------------------
>
>                 Key: SPARK-15522
>                 URL: https://issues.apache.org/jira/browse/SPARK-15522
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Jason Pohl
>
> The following code is run:
> val pre_piv_df_a = sqlContext.sql("""
> SELECT
> CASE WHEN Gender = 'M' Then 1 ELSE 0 END AS has_male,
> CASE WHEN Gender = 'F' Then 1 ELSE 0 END AS has_female,
> CAST(StartAge AS Double) AS StartAge_dbl,
> CAST(EndAge AS Double) AS EndAge_dbl,
> *
> FROM alldem_union_curr
> """)
> .withColumn("JavaStartTimestamp", create_ts($"StartTimestamp"))
> .drop("StartTimestamp").withColumnRenamed("JavaStartTimestamp", "StartTimestamp")
> .drop("StartAge").drop("EndAge")
> .withColumnRenamed("StartAge_dbl", "StartAge").withColumnRenamed("EndAge_dbl", "EndAge")
> val pre_piv_df_b = pre_piv_df_a
> .withColumn("media_month_cc", media_month_cc($"MediaMonth"))
> .withColumn("media_week_cc", media_week_sts_cc($"StartTimestamp"))
> .withColumn("media_day_cc", media_day_sts_cc($"StartTimestamp"))
> .withColumn("week_day", week_day($"StartTimestamp"))
> .withColumn("week_end", week_end($"StartTimestamp"))
> .join(sqlContext.table("cad_nets"), $"Network" === $"nielsen_network", "inner")
> .withColumnRenamed("cad_network", "norm_net_code_a")
> .withColumn("norm_net_code", reCodeNets($"norm_net_code_a"))
> pre_piv_df_b.registerTempTable("pre_piv_df")
> val piv_qhID_df = pre_piv_df_b.groupBy("Network", "Audience", "StartDate", "rating_category_cd")
> .pivot("qaID").agg("rating" -> "mean")
> The pivot creates a lot of columns (96) with names that are like ‘01’,’02’,…,’96’
as a result of pivoting a table that has quarter hour IDs.
> In the below SQL the highlighted section causes problems. If I rename the columns to
‘col01’,’col02’,…,’col96’ I can run the SQL correctly and get the expected results.
> select * from piv_qhID where 82 is NULL limit 20
> And I am getting no rows even though there are nulls.
> On the other hand the query:
> select * from piv_qhID where 82 is NOT NULL limit 20
> Returns all rows (even those with nulls)
> Renaming the columns fixes this, but it would be nice if the columns were referenced
correctly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message