spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicholas Chammas (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input
Date Tue, 08 May 2018 22:23:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16468045#comment-16468045
] 

Nicholas Chammas edited comment on SPARK-23945 at 5/8/18 10:22 PM:
-------------------------------------------------------------------

{quote}So in the grand scheme of things I'd expect DataFrames to be able to do everything
that SQL can and vice versa
{quote}
Since writing this, I realized that the DataFrame API is able to express `IN` and `NOT IN`
via an inner join and left anti join respectively. And I suspect most other cases where I
may have thought the DataFrame API is less powerful than SQL are incorrect. The various DataFrame
join types basically cover a lot of the stuff you'd want to do with subqueries.

So I'd actually be fine with closing this out as "Won't Fix" and instructing users, in the
particular example I provided above, to express their query as follows:
{code:java}
(table1
    .join(
        table2,
        on='name',
        how='left_anti',
    )
){code}
This is equivalent to the SQL query I posted, and does not require that anything be collected
locally, so it scales just as well.

[~hvanhovell] - Does this make sense?


was (Author: nchammas):
> So in the grand scheme of things I'd expect DataFrames to be able to do everything that
SQL can and vice versa

 

Since writing this, I realized that the DataFrame API is able to express `IN` and `NOT IN`
via an inner join and left anti join respectively. And I suspect most other cases where I
may have thought the DataFrame API is less powerful than SQL are incorrect. The various DataFrame
join types basically cover a lot of the stuff you'd want to do with subqueries.

So I'd actually be fine with closing this out as "Won't Fix" and instructing users, in the
particular example I provided above, to express their query as follows:
{code:java}
(table1
    .join(
        table2,
        on='name',
        how='left_anti',
    )
){code}
This is equivalent to the SQL query I posted, and does not require that anything be collected
locally, so it scales just as well.

[~hvanhovell] - Does this make sense?

> Column.isin() should accept a single-column DataFrame as input
> --------------------------------------------------------------
>
>                 Key: SPARK-23945
>                 URL: https://issues.apache.org/jira/browse/SPARK-23945
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
>     SELECT name
>     FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
>     .where(
>         ~col('name').isin(
>             table2.select('name')
>         )
>     )
> ){code}
> However, .isin() currently [only accepts a local list of values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to support
correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame API, and
queries like this should instead be written in a different way? How would we write a query
like the one I have above in the DataFrame API, without needing to collect values locally
for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message