spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-15127) Column names are handled incorrectly when they originate from a single Dataframe
Date Tue, 21 May 2019 04:14:15 GMT

     [ https://issues.apache.org/jira/browse/SPARK-15127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon resolved SPARK-15127.
----------------------------------
    Resolution: Incomplete

> Column names are handled incorrectly when they originate from a single Dataframe
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-15127
>                 URL: https://issues.apache.org/jira/browse/SPARK-15127
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core, SQL
>    Affects Versions: 1.6.1, 2.0.0
>         Environment: Mac OS X 10.11.4 And Ubuntu Linux 16.04 LTS
>            Reporter: Jurriaan Pruis
>            Priority: Major
>              Labels: bulk-closed
>
> I think I found a bug in the way columns are handled in (py)Spark
> h3. How to reproduce
> {code}
> df = sc.parallelize([[1, 'A', 'Not B'], [1, 'Not A', 'B']]).toDF(['id', 'a', 'b'])
> example = sc.parallelize([[1],[2]]).toDF(['id'])
> df_a = df.filter('a = "A"').alias('df_a')
> df_b = df.filter('b = "B"').alias('df_b')
> example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id',
df_a['a'], df_b['b']).show()
> {code}
> Results in:
> {code}
> +---+---+-----+
> | id|  a|    b|
> +---+---+-----+
> |  1|  A|Not B|
> +---+---+-----+
> {code}
> Expected result:
> {code}
> +---+---+---+
> | id|  a|  b|
> +---+---+---+
> |  1|  A|  B|
> +---+---+---+
> {code}
> When using the aliases in the select statement it does work properly
> {code}
> example.join(df_a, 'id').join(df_b, 'id').select('id', 'df_a.a', 'df_b.b').show()
> {code}
> Results in expected result:
> {code}
> +---+---+---+
> | id|  a|  b|
> +---+---+---+
> |  1|  A|  B|
> +---+---+---+
> {code}
> I'm not sure if this is how you're supposed to select columns from this kind of Dataframe,
but I think the first example should've worked just as fine.
> I did some other experiments with this:
> It also works when creating a new Dataframe using toDF():
> {code}
> df_a = df.filter('a = "A"').alias('df_a')
> df_b = df.filter('b = "B"').alias('df_b')
> df_a = df_a.toDF(*df_a.columns)
> df_b = df_b.toDF(*df_b.columns)
> example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id',
df_a['a'], df_b['b']).show()
> {code}
> Results in expected result:
> {code}
> +---+---+---+
> | id|  a|  b|
> +---+---+---+
> |  1|  A|  B|
> +---+---+---+
> {code}
> But not when doing this with a select (which according to the docs, should return a *new*
Dataframe)
> {code}
> df_a = df.filter('a = "A"').alias('df_a')
> df_b = df.filter('b = "B"').alias('df_b')
> df_a = df_a.select(*df_a.columns)
> df_b = df_b.select(*df_b.columns)
> example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id',
df_a['a'], df_b['b']).show()
> {code}
> Results in:
> {code}
> +---+---+-----+
> | id|  a|    b|
> +---+---+-----+
> |  1|  A|Not B|
> +---+---+-----+
> {code}
> At least something is unclear in the documentation here, and maybe this is a Column handing
bug too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message