spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Girardot <ssab...@gmail.com>
Subject Re: [SparkSQL 1.4.0] groupBy columns are always nullable?
Date Mon, 11 May 2015 21:11:58 GMT
I'll look into it - not sure yet what I can get out of exprs :p

Le lun. 11 mai 2015 à 22:35, Reynold Xin <rxin@databricks.com> a écrit :

> Thanks for catching this. I didn't read carefully enough.
>
> It'd make sense to have the udaf result be non-nullable, if the exprs are
> indeed non-nullable.
>
> On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot <ssaboum@gmail.com>
> wrote:
>
>> Hi Haopu,
>> actually here `key` is nullable because this is your input's schema :
>>
>> scala> result.printSchema
>> root
>> |-- key: string (nullable = true)
>> |-- SUM(value): long (nullable = true)
>>
>> scala> df.printSchema
>> root
>> |-- key: string (nullable = true)
>> |-- value: long (nullable = false)
>>
>> I tried it with a schema where the key is not flagged as nullable, and
>> the schema is actually respected. What you can argue however is that
>> SUM(value) should also be not nullable since value is not nullable.
>>
>> @rxin do you think it would be reasonable to flag the Sum aggregation
>> function as nullable (or not) depending on the input expression's schema ?
>>
>> Regards,
>>
>> Olivier.
>> Le lun. 11 mai 2015 à 22:07, Reynold Xin <rxin@databricks.com> a écrit :
>>
>>> Not by design. Would you be interested in submitting a pull request?
>>>
>>> On Mon, May 11, 2015 at 1:48 AM, Haopu Wang <HWang@qilinsoft.com> wrote:
>>>
>>>> I try to get the result schema of aggregate functions using DataFrame
>>>> API.
>>>>
>>>> However, I find the result field of groupBy columns are always nullable
>>>> even the source field is not nullable.
>>>>
>>>> I want to know if this is by design, thank you! Below is the simple code
>>>> to show the issue.
>>>>
>>>> ======
>>>>
>>>>   import sqlContext.implicits._
>>>>   import org.apache.spark.sql.functions._
>>>>   case class Test(key: String, value: Long)
>>>>   val df = sc.makeRDD(Seq(Test("k1",2),Test("k1",1))).toDF
>>>>
>>>>   val result = df.groupBy("key").agg($"key", sum("value"))
>>>>
>>>>   // From the output, you can see the "key" column is nullable, why??
>>>>   result.printSchema
>>>> //    root
>>>> //     |-- key: string (nullable = true)
>>>> //     |-- SUM(value): long (nullable = true)
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>

Mime
View raw message