Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D8E0A200CD2 for ; Wed, 12 Jul 2017 19:19:38 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D78C61696F3; Wed, 12 Jul 2017 17:19:38 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0D6F11696F2 for ; Wed, 12 Jul 2017 19:19:37 +0200 (CEST) Received: (qmail 57717 invoked by uid 500); 12 Jul 2017 17:19:35 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 57704 invoked by uid 99); 12 Jul 2017 17:19:35 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Jul 2017 17:19:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 65C2D196264 for ; Wed, 12 Jul 2017 17:19:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 0ivjmnbKo3hw for ; Wed, 12 Jul 2017 17:19:28 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 18A9562CD0 for ; Wed, 12 Jul 2017 16:40:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 930F5E0D48 for ; Wed, 12 Jul 2017 16:40:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4822F246FA for ; Wed, 12 Jul 2017 16:40:00 +0000 (UTC) Date: Wed, 12 Jul 2017 16:40:00 +0000 (UTC) From: "Dongjoon Hyun (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 12 Jul 2017 17:19:39 -0000 [ https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084274#comment-16084274 ] Dongjoon Hyun commented on SPARK-21380: --------------------------------------- I see. I agree your point about that warning is misleading here. > Join with Columns thinks inner join is cross join even when aliased > ------------------------------------------------------------------- > > Key: SPARK-21380 > URL: https://issues.apache.org/jira/browse/SPARK-21380 > Project: Spark > Issue Type: Bug > Components: Optimizer > Affects Versions: 2.1.0, 2.1.1 > Reporter: Everett Anderson > Labels: correctness > > While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1. > Even after aliasing both the table names and all the columns, joining Datasets using a criteria assembled from Columns rather than the with the join(.... usingColumns) method variants errors complaining that a join is a cross join / cartesian product even when it isn't. > Example: > {noformat} > Dataset left = spark.sql("select 'bob' as name, 23 as age"); > left = left > .alias("l") > .select( > left.col("name").as("l_name"), > left.col("age").as("l_age")); > Dataset right = spark.sql("select 'bob' as name, 'bobco' as company"); > right = right > .alias("r") > .select( > right.col("name").as("r_name"), > right.col("company").as("r_age")); > Dataset result = left.join( > right, > left.col("l_name").equalTo(right.col("r_name")), > "inner"); > result.show(); > {noformat} > Results in > {noformat} > org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans > Project [bob AS l_name#22, 23 AS l_age#23] > +- OneRowRelation$ > and > Project [bob AS r_name#33, bobco AS r_age#34] > +- OneRowRelation$ > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these relations.; > at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067) > at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1064) > at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1049) > at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) > at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78) > at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) > at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) > at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) > at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2814) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2127) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2342) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) > at org.apache.spark.sql.Dataset.show(Dataset.scala:638) > at org.apache.spark.sql.Dataset.show(Dataset.scala:597) > at org.apache.spark.sql.Dataset.show(Dataset.scala:606) > at com.nuna.platform.common.spark.util.JoinBuilderIntegrationTest.testSimpleJoin(JoinBuilderIntegrationTest.java:129) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) > at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {noformat} > This is related to other issues like SPARK-14854. > I feel like in many of these cases, Spark shouldn't be considering these joins as Cartesian products. Usually, I run across this when one table is derived from another, but in this case it happens even with the two tables have fully distinct lineages. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org