spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From marmbrus <>
Subject [GitHub] spark pull request: [SPARK-6550][SQL] Use analyzed plan in DataFra...
Date Thu, 26 Mar 2015 21:44:39 GMT
GitHub user marmbrus opened a pull request:

    [SPARK-6550][SQL] Use analyzed plan in DataFrame

    This is based on bug and test case proposed by @viirya.  See #5203 for a excellent description
of the problem.
    TLDR; The problem occurs because the function `groupBy(String)` calls `resolve`, which
returns an `AttributeReference`.  However, this `AttributeReference` is based on an analyzed
plan which is thrown away.  At execution time, we once again analyze the plan.  However, in
the case of self-joins, each call to analyze will produce a new tree for the left side of
the join, rendering the previously returned `AttributeReference` invalid.
    As a fix, I propose we keep the analyzed plan instead of the logical plan inside of a
data frame.

You can merge this pull request into a Git repository by running:

    $ git pull preanalyzer

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5217
commit 089c52e5b5fc44e1f75b9156146ce649317e2375
Author: Michael Armbrust <>
Date:   2015-03-26T19:13:55Z


commit dd4dec1194272c84a71095f889e529d0a7970f65
Author: Michael Armbrust <>
Date:   2015-03-26T21:14:10Z

    Use the analyzed plan in DataFrame


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message