Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B7C942004CA for ; Wed, 11 May 2016 15:55:18 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B63B31607AA; Wed, 11 May 2016 13:55:18 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B0C041601D4 for ; Wed, 11 May 2016 15:55:17 +0200 (CEST) Received: (qmail 45821 invoked by uid 500); 11 May 2016 13:55:12 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 45008 invoked by uid 99); 11 May 2016 13:55:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 May 2016 13:55:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 213CBC34B4; Wed, 11 May 2016 13:55:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.648 X-Spam-Level: ** X-Spam-Status: No, score=2.648 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, KAM_LINEPADDING=1.2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id DVVyZaYxWhZa; Wed, 11 May 2016 13:55:10 +0000 (UTC) Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 37D605F24C; Wed, 11 May 2016 13:55:09 +0000 (UTC) Received: by mail-io0-f169.google.com with SMTP id f89so54824427ioi.0; Wed, 11 May 2016 06:55:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to; bh=0dZZH9svYJUghdSodQWTQYcujtm2bhYQXaPgd11/p1o=; b=TtcHXsHRZ1/gNlh+1fIzgl7XAmAU2Cm8T+o+7Zs1uyGZosh9gBqweLlgq0btkkyiIP y9gupQFMU6NTtYdH3EVevMGvTtF/U+NUlvjdp23Bv0Q8l3SXyPvGw8t8nFUAiZNzBp9m DqOXIaLONCT5Gw7fj4VN7cP/ItzI5Gj5aFxN1fCYOlGCZq7ob1NzA6X1KJJhsowkoRfz y7N7wxD9RLuiTrD+8b2eTC14Yt6WyWI2nu9E1SXbhmqwi71Gc0kLKoe5412jIDYi/bFA CyEpPerWSsfzl/I9QnkaoGLjl0DWrVkspJ1ZTpuIUI7utM5Zvi27PT1/xOZqGlAwvYnN ssKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to; bh=0dZZH9svYJUghdSodQWTQYcujtm2bhYQXaPgd11/p1o=; b=jYTApz7U6SyF1aW9gV9M+1JsKLnLr9hyD1xHIgA+hgdOL/LihEZDLS9MIoVR2B7nDh K1baPac1UOz3eh8y1lK4poMbQjplXAbK+uYMHbPvIsMW/BBFtKQHVFRaDRHitWTiLa4m 4kI91XSKgMFUMTxrPDT6eSMGd5RTXcqri8pTgt5Us4cGISBKh5E2/ZqtDMLUYAsurcaU U6MxI3AYMRoDv3D54P4FdeTKy4HLFbm1A1wVK8qhGfcKuo2VIw5dTsZxE6x4Br+yOSUH l6puDoJHScKx/Gd/e/j494GNbkQTo3sREgPRyzKqKrvdA1cXXPJmaolmuNoBVYuUjiIp SUbw== X-Gm-Message-State: AOPr4FXAm7bFSnkZaEYzLTHNed6nY6jJXwFmUtTh/7rc1dlL4XHHRHpCVMnPVp1W/g4pOm0Lj2IuPr+jU6l+5Q== MIME-Version: 1.0 X-Received: by 10.107.162.84 with SMTP id l81mr3636874ioe.47.1462974908163; Wed, 11 May 2016 06:55:08 -0700 (PDT) Received: by 10.36.65.12 with HTTP; Wed, 11 May 2016 06:55:08 -0700 (PDT) Date: Wed, 11 May 2016 21:55:08 +0800 Message-ID: Subject: dataframe udf functioin will be executed twice when filter on new column created by withColumn From: Tony Jin To: spark users , dev Content-Type: multipart/alternative; boundary=001a11402f86b786c80532916362 archived-at: Wed, 11 May 2016 13:55:18 -0000 --001a11402f86b786c80532916362 Content-Type: text/plain; charset=UTF-8 Hi guys, I have a problem about spark DataFrame. My spark version is 1.6.1. Basically, i used udf and df.withColumn to create a "new" column, and then i filter the values on this new columns and call show(action). I see the udf function (which is used to by withColumn to create the new column) is called twice(duplicated). And if filter on "old" column, udf only run once which is expected. I attached the example codes, line 30~38 shows the problem. Anyone knows the internal reason? Can you give me any advices? Thank you very much. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> val df = sc.parallelize(Seq(("a", "b"), ("a1", "b1"))).toDF("old","old1") df: org.apache.spark.sql.DataFrame = [old: string, old1: string] scala> val udfFunc = udf((s: String) => {println(s"running udf($s)"); s }) udfFunc: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(,StringType,List(StringType)) scala> val newDF = df.withColumn("new", udfFunc(df("old"))) newDF: org.apache.spark.sql.DataFrame = [old: string, old1: string, new: string] scala> newDF.show running udf(a) running udf(a1) +---+----+---+ |old|old1|new| +---+----+---+ | a| b| a| | a1| b1| a1| +---+----+---+ scala> val filteredOnNewColumnDF = newDF.filter("new <> 'a1'") filteredOnNewColumnDF: org.apache.spark.sql.DataFrame = [old: string, old1: string, new: string] scala> val filteredOnOldColumnDF = newDF.filter("old <> 'a1'") filteredOnOldColumnDF: org.apache.spark.sql.DataFrame = [old: string, old1: string, new: string] scala> filteredOnNewColumnDF.show running udf(a) running udf(a) running udf(a1) +---+----+---+ |old|old1|new| +---+----+---+ | a| b| a| +---+----+---+ scala> filteredOnOldColumnDF.show running udf(a) +---+----+---+ |old|old1|new| +---+----+---+ | a| b| a| +---+----+---+ Best wishes. By Linbo --001a11402f86b786c80532916362 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi guys,

I have a problem about spark D= ataFrame. My spark version is 1.6.1.=C2=A0
Basically, i used udf = and df.withColumn to create a "new" column, and then i filter the= values on this new columns and call show(action). I see the udf function (= which is used to by withColumn to create the new column) is called twice(du= plicated). And if filter on "old" column, udf only run once which= is expected. I attached the example codes, line 30~38 shows the problem.

=C2=A0Anyone knows the internal reason? Can you giv= e me any advices? Thank you very much.


1
<= span class=3D"" style=3D"height:20px">2
3
45
6
7
8
9
10
11
12
13
14
15
16
17
18

19
20=
21
22
23<= /span>
24
25
26
27
28

29
3031
32
33
34
35
= 36
37
38
39
40
41

42
4344
45
46
47
scala> import org.apache.spark.sql.functi=
ons._
import org.apache.spark.sql.functions._=

scala> val df =3D sc.parallelize(Seq(("a", "b"), ("a1", "b1"))).toDF("old","old1")
df: org.apache.spark.sql.DataFrame =3D [old: strin= g, old1: string]
scala> val udfFunc =3D udf((s: = String) =3D> {println(s&= quot;running udf($s)"); s })
udfFunc: org.apache.spark.sql.UserDefinedFunction =3D User= DefinedFunction(<function1>,StringType,List(StringType))

scala> val newDF =3D df.withColumn("new", udfFunc(df("old")))
newDF: org.apache.spark.sql.DataFrame =3D [o= ld: string, old1: string, = new: string]

scala> newDF.show
running udf(a)
= running udf(a1)
+---+----= +---+
|old|old1|new|
+---+----+---+
| a| b| a|
| a1| b1| a1|
+---+----+---+
=

scala> val filteredOnNewColumnDF =3D new= DF.filter("new <>= ; 'a1'")

filteredOnNewColumnDF: org.apache.spark.sql.DataFrame =3D [old: string, = old1: string, new: = string]

scala> val filteredOnOldColumnDF =3D new= DF.filter("old <>= ; 'a1'")
filteredOnOldColumnDF: org.apache.spark.sql.DataFrame =3D [old: string, = old1: string, new: = string]

scala> filteredOnNewColumnDF.show
running u= df(a)=
running udf(a)
runn= ing udf(a1)
+---+----+---+
|old|old1|new|
+---+----+---+
| a| b= | a|
+---+----+---+


scala&= gt; filteredOnOldColumnDF.show
running udf(a)

+---+----+---+
|old|old1|new|
+---+----+---+
| = a| b| a|
+---+----+---= +

Best wishes.
By Linbo

--001a11402f86b786c80532916362--