spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Guerra (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-9131) UDFs change data values
Date Fri, 17 Jul 2015 10:45:04 GMT

     [ https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Luis Guerra updated SPARK-9131:
-------------------------------
    Description: 
I am having some troubles when using a custom udf in dataframes with pyspark 1.4.

I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using
do absolutely nothing, they just receive some value and output the same value with the same
format.

I show you my code below:

c= a.join(b, a['ID'] == b['ID_new'], 'inner')

c.filter(c['ID'] == 'XX').show()

udf_A = UserDefinedFunction(lambda x: x, DateType())
udf_B = UserDefinedFunction(lambda x: x, DateType())
udf_C = UserDefinedFunction(lambda x: x, DateType())

d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'),
udf_C(vinc_muestra['t2']).alias('td'))

d.filter(d['ID'] == 'XX').show()

I am showing here the results from the outputs:

+----------------+----------------+----------+----------+
|          ID     |     ID_new  |     t1	 |   t2     |
+----------------+----------------+----------+----------+
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
+----------------+----------------+----------+----------+

+----------------+---------------+---------------+------------+------------+
|       ID        |	    ta	   |	   tb	     |	 tc	   |     td	  |
+----------------+---------------+---------------+------------+------------+
|6000000002698917|     2012-02-28|       2007-03-05|    2003-03-05|    2014-02-28|
|6000000002698917|     2012-02-20|       2007-02-15|    2002-02-15|    2013-02-20|
|6000000002698917|     2012-02-28|       2007-03-10|    2005-03-10|    2014-02-28|
|6000000002698917|     2012-02-20|       2007-03-05|    2003-03-05|    2013-02-20|
|6000000002698917|     2012-02-20|       2013-08-02|    2013-01-02|    2013-02-20|
|6000000002698917|     2012-02-28|       2007-02-15|    2002-02-15|    2014-02-28|
|6000000002698917|     2012-02-28|       2007-02-15|    2002-02-15|    2014-02-28|
|6000000002698917|     2012-02-20|       2014-01-02|    2013-01-02|    2013-02-20|
+----------------+---------------+---------------+------------+------------+

The here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different
from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like
if values were somehow got from other registers (or just invented). Results are different
between executions (apparently random).

Thanks in advance

  was:
I am having some troubles when using a custom udf in dataframes with pyspark 1.4.

I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using
do absolutely nothing, they just receive some value and output the same value with the same
format.

I show you my code below:

c= a.join(b, a['ID'] == b['ID_new'], 'inner')

c.filter(c['ID'] == 'XX').show()

udf_A = UserDefinedFunction(lambda x: x, DateType())
udf_B = UserDefinedFunction(lambda x: x, DateType())
udf_C = UserDefinedFunction(lambda x: x, DateType())

d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'),
udf_C(vinc_muestra['t2']).alias('td'))

d.filter(d['ID'] == 'XX').show()

I am showing here the results from the outputs:

+----------------+----------------+----------+----------+
|          ID     |     ID_new  |     t1	 |   t2     |
+----------------+----------------+----------+----------+
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
|6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
+----------------+----------------+----------+----------+

+----------------+---------------+---------------+------------+------------+
|       ID        |	    ta	   |	   tb	     |	 tc	   |     td	  |
+----------------+---------------+---------------+------------+------------+
|6000000002698917|     2012-02-28|       2007-03-05|    2003-03-05|    20140228|
|6000000002698917|     2012-02-20|       2007-02-15|    20020215|    20130220|
|6000000002698917|     2012-02-28|       2007-03-10|    20050310|    20140228|
|6000000002698917|     2012-02-20|       20070305|    2003-03-05|    20130220|
|6000000002698917|     2012-02-20|       2013-08-02|    2013-01-02|    2013-02-20|
|6000000002698917|     2012-02-28|       2007-02-15|    20020215|    2014-02-28|
|6000000002698917|     2012-02-28|       20070215|    2002-02-15|    2014-02-28|
|6000000002698917|     2012-02-20|       2014-01-02|    2013-01-02|    2013-02-20|
+----------------+---------------+---------------+------------+------------+

The here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different
from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like
if values were somehow got from other registers (or just invented). Results are different
between executions (apparently random).

Thanks in advance


> UDFs change data values
> -----------------------
>
>                 Key: SPARK-9131
>                 URL: https://issues.apache.org/jira/browse/SPARK-9131
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.4.0, 1.4.1
>         Environment: Pyspark 1.4 and 1.4.1, Redhat 6.6
>            Reporter: Luis Guerra
>            Priority: Critical
>
> I am having some troubles when using a custom udf in dataframes with pyspark 1.4.
> I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I
am using do absolutely nothing, they just receive some value and output the same value with
the same format.
> I show you my code below:
> c= a.join(b, a['ID'] == b['ID_new'], 'inner')
> c.filter(c['ID'] == 'XX').show()
> udf_A = UserDefinedFunction(lambda x: x, DateType())
> udf_B = UserDefinedFunction(lambda x: x, DateType())
> udf_C = UserDefinedFunction(lambda x: x, DateType())
> d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'),
udf_C(vinc_muestra['t2']).alias('td'))
> d.filter(d['ID'] == 'XX').show()
> I am showing here the results from the outputs:
> +----------------+----------------+----------+----------+
> |          ID     |     ID_new  |     t1	 |   t2     |
> +----------------+----------------+----------+----------+
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> +----------------+----------------+----------+----------+
> +----------------+---------------+---------------+------------+------------+
> |       ID        |	    ta	   |	   tb	     |	 tc	   |     td	  |
> +----------------+---------------+---------------+------------+------------+
> |6000000002698917|     2012-02-28|       2007-03-05|    2003-03-05|    2014-02-28|
> |6000000002698917|     2012-02-20|       2007-02-15|    2002-02-15|    2013-02-20|
> |6000000002698917|     2012-02-28|       2007-03-10|    2005-03-10|    2014-02-28|
> |6000000002698917|     2012-02-20|       2007-03-05|    2003-03-05|    2013-02-20|
> |6000000002698917|     2012-02-20|       2013-08-02|    2013-01-02|    2013-02-20|
> |6000000002698917|     2012-02-28|       2007-02-15|    2002-02-15|    2014-02-28|
> |6000000002698917|     2012-02-28|       2007-02-15|    2002-02-15|    2014-02-28|
> |6000000002698917|     2012-02-20|       2014-01-02|    2013-01-02|    2013-02-20|
> +----------------+---------------+---------------+------------+------------+
> The here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely
different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It
seems like if values were somehow got from other registers (or just invented). Results are
different between executions (apparently random).
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message