spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Doi (JIRA)" <>
Subject [jira] [Updated] (SPARK-8152) Dataframe Join Ignores Condition
Date Mon, 08 Jun 2015 01:33:00 GMT


Eric Doi updated SPARK-8152:
    Attachment: side-by-side.png

In this screenshot, the dataframes "purchasedRecItems" and "grouped_B" should be identical
and this is supported by the output of show().  They are unique on itemRecordId and have 9
rows each.

We join each of these with "countUsersDF".  This dataframe is unique on itemRecordId, and
has 227 rows.

The result of the join should be 9 rows, since "itemRecordId" is unique in each.  However,
when using  "purchasedRecItems", the result has 2043 = 227 * 9 rows.  Output from show() reveals
that each row of "countUsersDF" has been matched, regardless of the itemRecordId join condition.

> Dataframe Join Ignores Condition
> --------------------------------
>                 Key: SPARK-8152
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Eric Doi
>         Attachments: side-by-side.png
> When joining two tables A and B, on condition that A.X = B.X, in some cases that condition
is not fulfilled in the result.
> Suspect it might be due to duplicate column names in the source tables causing confusion.
 Is it possible for there to exist hidden fields in a dataframe?
> Will attach a screenshot for more details.  The bug is reproducible but hard to pinpoint.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message