spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jurriaan Pruis (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-15127) Column names are handled incorrectly when they originate from a single Dataframe
Date Wed, 04 May 2016 18:00:16 GMT

     [ https://issues.apache.org/jira/browse/SPARK-15127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jurriaan Pruis updated SPARK-15127:
-----------------------------------
    Description: 
I think I found a bug in the way columns are handled in (py)Spark

h3. How to reproduce
{code}
df = sc.parallelize([[1, 'A', 'Not B'], [1, 'Not A', 'B']]).toDF(['id', 'a', 'b'])

example = sc.parallelize([[1],[2]]).toDF(['id'])

df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')

example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'],
df_b['b']).show()
{code}
Results in:

{code}
+---+---+-----+
| id|  a|    b|
+---+---+-----+
|  1|  A|Not B|
+---+---+-----+
{code}

Expected result:

{code}
+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+
{code}

When using the aliases in the select statement it does work properly
{code}
example.join(df_a, 'id').join(df_b, 'id').select('id', 'df_a.a', 'df_b.b').show()
{code}

Results in expected result:

{code}
+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+
{code}

I'm not sure if this is how you're supposed to select columns from this kind of Dataframe,
but I think the first example should've worked just as fine.


I did some other experiments with this:

It also works when creating a new Dataframe using toDF():
{code}
df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')
df_a = df_a.toDF(*df_a.columns)
df_b = df_b.toDF(*df_b.columns)
example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'],
df_b['b']).show()
{code}

Results in expected result:
{code}
+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+
{code}

But not when doing this with a select (which according to the docs, should return a *new*
Dataframe)

{code}
df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')
df_a = df_a.select(*df_a.columns)
df_b = df_b.select(*df_b.columns)
example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'],
df_b['b']).show()
{code}

Results in:

{code}
+---+---+-----+
| id|  a|    b|
+---+---+-----+
|  1|  A|Not B|
+---+---+-----+
{code}

At least something is unclear in the documentation here, and maybe this is a Column handing
bug too.


  was:
I think I found a bug in the way columns are handled in (py)Spark

h3. How to reproduce
{code}
df = sc.parallelize([[1, 'A', 'Not B'], [1, 'Not A', 'B']]).toDF(['id', 'a', 'b'])

example = sc.parallelize([[1],[2]]).toDF(['id'])

df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')

example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'],
df_b['b']).show()
{code}
Results in:

{code}
+---+---+-----+
| id|  a|    b|
+---+---+-----+
|  1|  A|Not B|
+---+---+-----+
{code}

Expected result:

{code}
+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+
{code}

When using the aliases in the select statement it does work properly
{code}
example.join(df_a, 'id').join(df_b, 'id').select('id', 'df_a.a', 'df_b.b').show()
{code}

Results in expected result:

{code}
+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+
{code}

Not sure if this is expected behaviour.


It also works when creating a new Dataframe using toDF():
{code}
df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')
df_a = df_a.toDF(*df_a.columns)
df_b = df_b.toDF(*df_b.columns)
example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'],
df_b['b']).show()
{code}

Results in expected result:
{code}
+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+
{code}

But not when doing this with a select (which according to the docs, should return a *new*
Dataframe)

{code}
df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')
df_a = df_a.select(*df_a.columns)
df_b = df_b.select(*df_b.columns)
example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'],
df_b['b']).show()
{code}

Results in:

{code}
+---+---+-----+
| id|  a|    b|
+---+---+-----+
|  1|  A|Not B|
+---+---+-----+
{code}

At least something is unclear in the documentation here, and maybe this is a Column handing
bug too.



> Column names are handled incorrectly when they originate from a single Dataframe
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-15127
>                 URL: https://issues.apache.org/jira/browse/SPARK-15127
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core, SQL
>    Affects Versions: 1.6.1, 2.0.0
>         Environment: Mac OS X 10.11.4 And Ubuntu Linux 16.04 LTS
>            Reporter: Jurriaan Pruis
>
> I think I found a bug in the way columns are handled in (py)Spark
> h3. How to reproduce
> {code}
> df = sc.parallelize([[1, 'A', 'Not B'], [1, 'Not A', 'B']]).toDF(['id', 'a', 'b'])
> example = sc.parallelize([[1],[2]]).toDF(['id'])
> df_a = df.filter('a = "A"').alias('df_a')
> df_b = df.filter('b = "B"').alias('df_b')
> example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id',
df_a['a'], df_b['b']).show()
> {code}
> Results in:
> {code}
> +---+---+-----+
> | id|  a|    b|
> +---+---+-----+
> |  1|  A|Not B|
> +---+---+-----+
> {code}
> Expected result:
> {code}
> +---+---+---+
> | id|  a|  b|
> +---+---+---+
> |  1|  A|  B|
> +---+---+---+
> {code}
> When using the aliases in the select statement it does work properly
> {code}
> example.join(df_a, 'id').join(df_b, 'id').select('id', 'df_a.a', 'df_b.b').show()
> {code}
> Results in expected result:
> {code}
> +---+---+---+
> | id|  a|  b|
> +---+---+---+
> |  1|  A|  B|
> +---+---+---+
> {code}
> I'm not sure if this is how you're supposed to select columns from this kind of Dataframe,
but I think the first example should've worked just as fine.
> I did some other experiments with this:
> It also works when creating a new Dataframe using toDF():
> {code}
> df_a = df.filter('a = "A"').alias('df_a')
> df_b = df.filter('b = "B"').alias('df_b')
> df_a = df_a.toDF(*df_a.columns)
> df_b = df_b.toDF(*df_b.columns)
> example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id',
df_a['a'], df_b['b']).show()
> {code}
> Results in expected result:
> {code}
> +---+---+---+
> | id|  a|  b|
> +---+---+---+
> |  1|  A|  B|
> +---+---+---+
> {code}
> But not when doing this with a select (which according to the docs, should return a *new*
Dataframe)
> {code}
> df_a = df.filter('a = "A"').alias('df_a')
> df_b = df.filter('b = "B"').alias('df_b')
> df_a = df_a.select(*df_a.columns)
> df_b = df_b.select(*df_b.columns)
> example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id',
df_a['a'], df_b['b']).show()
> {code}
> Results in:
> {code}
> +---+---+-----+
> | id|  a|    b|
> +---+---+-----+
> |  1|  A|Not B|
> +---+---+-----+
> {code}
> At least something is unclear in the documentation here, and maybe this is a Column handing
bug too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message