spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
Date Mon, 05 Dec 2016 19:31:58 GMT

     [ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dongjoon Hyun resolved SPARK-18709.
-----------------------------------
    Resolution: Fixed

Sure. [~zsxwing]
I think the issue reporter, [~amogh.91], will agree to close this.

> Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame
with incompatible types for fields.
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18709
>                 URL: https://issues.apache.org/jira/browse/SPARK-18709
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.2, 1.6.3
>            Reporter: Amogh Param
>              Labels: bug
>             Fix For: 2.0.0
>
>
> When converting an RDD with a `float` type field to a spark dataframe with an `IntegerType`
/ `LongType` schema field, spark 1.6.2 and 1.6.3 silently convert the field values to nulls
instead of throwing an error like `LongType can not accept object ___ in type <type 'float'>`.
However, this seems to be fixed in Spark 2.0.2.
> The following example should make the problem clear:
> {code}
> from pyspark.sql.types import StructField, StructType, LongType, DoubleType
> schema = StructType([
>         StructField("0", LongType(), True),
>         StructField("1", DoubleType(), True),
>     ])
> data = [[1.0, 1.0], [nan, 2.0]]
> spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema)
> spark_df.show()
> {code}
> Instead of throwing an error like:
> {code}
> LongType can not accept object 1.0 in type <type 'float'>
> {code}
> Spark converts all the values in the first column to nulls
> Running `spark_df.show()` gives:
> {code}
> +----+---+
> |   0|  1|
> +----+---+
> |null|1.0|
> |null|1.0|
> +----+---+
> {code}
> For the purposes of my computation, I'm doing a `mapPartitions` on a spark data frame,
and for each partition, converting it into a pandas data frame, doing a few computations on
this pandas dataframe and the return value will be a list of lists, which is converted to
an RDD while being returned from 'mapPartitions' (for all partitions). This RDD is then converted
into a spark dataframe similar to the example above, using `sqlContext.createDataFrame(rdd,
schema)`. The rdd has a column that should be converted to a `LongType` in the spark data
frame, but since it has missing values, it is a `float` type. When spark tries to create the
data frame, it converts all the values in that column to nulls instead of throwing an error
that there is a type mismatch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message