spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Wampler <deanwamp...@gmail.com>
Subject Re: [sql] Dataframe how to check null values
Date Thu, 02 Apr 2015 12:51:34 GMT
I'm afraid you're a little stuck. In Scala, the types Int, Long, Float,
Double, Byte, and Boolean look like reference types in source code, but
they are compiled to the corresponding JVM primitive types, which can't be
null. That's why you get the warning about ==.

It might be your best choice is to use NaN as the placeholder for null,
then create one DF using a filter that removes those values. Use that DF to
compute the mean. Then apply a map step to the original DF to translate the
NaN's to the mean.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 7:54 AM, Peter Rudenko <petro.rudenko@gmail.com>
wrote:

> Hi i need to implement MeanImputor - impute missing values with mean. If i
> set missing values to null - then dataframe aggregation works properly, but
> in UDF it treats null values to 0.0. Here’s example:
>
> |val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF
> df.agg(avg("_1")).first //res45: org.apache.spark.sql.Row = [2.75]
> df.withColumn("d2", callUDF({(value: Double) => value}, DoubleType,
> df("d"))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0 null 0.0 val
> df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0, Double.NaN)).toDF
> df.agg(avg("_1")).first //res46: org.apache.spark.sql.Row = [Double.NaN] |
>
> In UDF i cannot compare scala’s Double to null:
>
> |comparing values of types Double and Null using `==' will always yield
> false [warn] if (value==null) meanValue else value |
>
> With Double.NaN instead of null i can compare in UDF, but aggregation
> doesn’t work properly. Maybe it’s related to : https://issues.apache.org/
> jira/browse/SPARK-6573
>
> Thanks,
> Peter Rudenko
>
> ​
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message