spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Doi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-8152) Dataframe Join Ignores Condition
Date Mon, 08 Jun 2015 01:33:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eric Doi updated SPARK-8152:
----------------------------
    Attachment: side-by-side.png

In this screenshot, the dataframes "purchasedRecItems" and "grouped_B" should be identical
and this is supported by the output of show().  They are unique on itemRecordId and have 9
rows each.

We join each of these with "countUsersDF".  This dataframe is unique on itemRecordId, and
has 227 rows.

The result of the join should be 9 rows, since "itemRecordId" is unique in each.  However,
when using  "purchasedRecItems", the result has 2043 = 227 * 9 rows.  Output from show() reveals
that each row of "countUsersDF" has been matched, regardless of the itemRecordId join condition.

> Dataframe Join Ignores Condition
> --------------------------------
>
>                 Key: SPARK-8152
>                 URL: https://issues.apache.org/jira/browse/SPARK-8152
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Eric Doi
>         Attachments: side-by-side.png
>
>
> When joining two tables A and B, on condition that A.X = B.X, in some cases that condition
is not fulfilled in the result.
> Suspect it might be due to duplicate column names in the source tables causing confusion.
 Is it possible for there to exist hidden fields in a dataframe?
> Will attach a screenshot for more details.  The bug is reproducible but hard to pinpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message