spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick <titlibat...@gmail.com>
Subject Re: Collecting Multiple Aggregation query result on one Column as collectAsMap
Date Mon, 28 Aug 2017 18:54:30 GMT
ok . i see there is a describe() function which does the stat calculation
on dataset similar to StatCounter but however i dont want to restrict my
aggregations to standard mean, stddev etc and generate some custom stats ,
or also may not run all the predefined stats but only subset of them on the
particular column.
I was thinking if we need to write some custom code which does this in one
action(job) that would work for me



On Tue, Aug 29, 2017 at 12:02 AM, Georg Heiler <georg.kf.heiler@gmail.com>
wrote:

> Rdd only
> Patrick <titlibatali@gmail.com> schrieb am Mo. 28. Aug. 2017 um 20:13:
>
>> Ah, does it work with Dataset API or i need to convert it to RDD first ?
>>
>> On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler <georg.kf.heiler@gmail.com
>> > wrote:
>>
>>> What about the rdd stat counter? https://spark.apache.org/docs/
>>> 0.6.2/api/core/spark/util/StatCounter.html
>>>
>>> Patrick <titlibatali@gmail.com> schrieb am Mo. 28. Aug. 2017 um 16:47:
>>>
>>>> Hi
>>>>
>>>> I have two lists:
>>>>
>>>>
>>>>    - List one: contains names of columns on which I want to do
>>>>    aggregate operations.
>>>>    - List two: contains the aggregate operations on which I want to
>>>>    perform on each column eg ( min, max, mean)
>>>>
>>>> I am trying to use spark 2.0 dataset to achieve this. Spark provides an
>>>> agg() where you can pass a Map <String,String> (of column name and
>>>> respective aggregate operation ) as input, however I want to perform
>>>> different aggregation operations on the same column of the data and want
to
>>>> collect the result in a Map<String,String> where key is the aggregate
>>>> operation and Value is the result on the particular column.  If i add
>>>> different agg() to same column, the key gets updated with latest value.
>>>>
>>>> Also I dont find any collectAsMap() operation that returns map of
>>>> aggregated column name as key and result as value. I get collectAsList()
>>>> but i dont know the order in which those agg() operations are run so how
do
>>>> i match which list values corresponds to which agg operation.  I am able
to
>>>> see the result using .show() but How can i collect the result in this case
?
>>>>
>>>> Is it possible to do different aggregation on the same column in one
>>>> Job(i.e only one collect operation) using agg() operation?
>>>>
>>>>
>>>> Thanks in advance.
>>>>
>>>>
>>

Mime
View raw message