spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Malaska <ted.mala...@cloudera.com>
Subject Re: countByValue on dataframe with multiple columns
Date Tue, 21 Jul 2015 14:39:13 GMT
100% I would love to do it.  Who a good person to review the design with.
All I need is a quick chat about the design and approach and I'll create
the jira and push a patch.

Ted Malaska

On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> Hi Ted,
> The TopNList would be great to see directly in the Dataframe API and my
> wish would be to be able to apply it on multiple columns at the same time
> and get all these statistics.
> the .describe() function is close to what we want to achieve, maybe we
> could try to enrich its output.
> Anyway, even as a spark-package, if you could package your code for
> Dataframes, that would be great.
>
> Regards,
>
> Olivier.
>
> 2015-07-21 15:08 GMT+02:00 Jonathan Winandy <jonathan.winandy@gmail.com>:
>
>> Ha ok !
>>
>> Then generic part would have that signature :
>>
>> def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]
>>
>>
>> +1 for more work (blog / api) for data quality checks.
>>
>> Cheers,
>> Jonathan
>>
>>
>> TopCMSParams and some other monoids from Algebird are really cool for
>> that :
>>
>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590
>>
>>
>> On 21 July 2015 at 13:40, Ted Malaska <ted.malaska@cloudera.com> wrote:
>>
>>> I'm guessing you want something like what I put in this blog post.
>>>
>>>
>>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
>>>
>>> This is a very common use case.  If there is a +1 I would love to add it
>>> to dataframes.
>>>
>>> Let me know
>>> Ted Malaska
>>>
>>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot <
>>> o.girardot@lateral-thoughts.com> wrote:
>>>
>>>> Yop,
>>>> actually the generic part does not work, the countByValue on one column
>>>> gives you the count for each value seen in the column.
>>>> I would like a generic (multi-column) countByValue to give me the same
>>>> kind of output for each column, not considering each n-uples of each column
>>>> value as the key (which is what the groupBy is doing by default).
>>>>
>>>> Regards,
>>>>
>>>> Olivier
>>>>
>>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <jonathan.winandy@gmail.com
>>>> >:
>>>>
>>>>> Ahoy !
>>>>>
>>>>> Maybe you can get countByValue by using sql.GroupedData :
>>>>>
>>>>> // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List("A","B",
"B", "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType))))
>>>>>
>>>>>
>>>>> df.groupBy("n").count().show()
>>>>>
>>>>>
>>>>> // generic
>>>>> def countByValueDf(df:DataFrame) = {
>>>>>
>>>>>   val (h :: r) = df.columns.toList
>>>>>
>>>>>   df.groupBy(h, r:_*).count()
>>>>> }
>>>>>
>>>>> countByValueDf(df).show()
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Jon
>>>>>
>>>>> On 20 July 2015 at 11:28, Olivier Girardot <
>>>>> o.girardot@lateral-thoughts.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> Is there any plan to add the countByValue function to Spark SQL
>>>>>> Dataframe ?
>>>>>> Even
>>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
>>>>>> is using the RDD part right now, but for ML purposes, being able
to get the
>>>>>> most frequent categorical value on multiple columns would be very
useful.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Olivier Girardot* | Associé
>>>>>> o.girardot@lateral-thoughts.com
>>>>>> +33 6 24 09 17 94
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Olivier Girardot* | Associé
>>>> o.girardot@lateral-thoughts.com
>>>> +33 6 24 09 17 94
>>>>
>>>
>>>
>>
>
>
> --
> *Olivier Girardot* | Associé
> o.girardot@lateral-thoughts.com
> +33 6 24 09 17 94
>

Mime
View raw message