asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Preston Carman <prest...@apache.org>
Subject Re: Creating aggregate functions
Date Tue, 25 Jul 2017 01:24:24 GMT
When dealing with aggregates and query plans, I find it helpful to
think about how the aggregate will work in a distributed environment.
AsterixDB compiler will make optimizations based on the types of data
partitioning. If the data is unpartitioned then a single aggregate
operator and function can calculate the result. If the data is
partitioned, then sending all the data must be send to a single node
for processing, which is not very efficient. The aggregate process
could be split up into two steps. AsterixDB optimizes the query by
running a process on each partition locally and then sending an
intermediate result to a single node to create the final aggregate
result.

COUNT
In the case of count, the local process is COUNT, but the global
aggregate process is SUM. We do not want to count responses, but sum
the total local count values.

AVG
In the count case, we use a complete separate aggregate function for
the global step. Consider AVG, to compute the average you need to know
the count and sum. In this case the local functions find both the
count and sum. These values are then passed to a global aggregate
function which uses these local results to calculate the average
aggregate result.

Take a look at the query plans for a COUNT and AVG query. The
optimized query plan will show you the two aggregate operators.

As you look at the code, AVG would probably be more informative about
the full aggregation workflow.


On Mon, Jul 24, 2017 at 8:28 AM, Riyafa Abdul Hameed
<riyafa.12@cse.mrt.ac.lk> wrote:
> On 23 July 2017 at 22:59, Yingyi Bu <buyingyi@gmail.com> wrote:
>
>> >> I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
>>
>> AVG:  that's the local function in the local plan.
>> LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG:   think about distributed
>> computation of average.  LOCAL_AVG aggregates the sum/count at the local
>> data source, INTERMEDIATE_AVG aggregates the sum/count over partially
>> aggregated sums/counts, and GLOBAL_AVG computes the final average value
>> from intermediate sums/counts.
>>
>
> How do we decide if we need these descriptors? COUNT seems to have only
> one descriptor
>
>
>>
>> Best,
>> Yingyi
>>
>>
>> On Sat, Jul 22, 2017 at 9:43 PM, Riyafa Abdul Hameed <
>> riyafa.12@cse.mrt.ac.lk> wrote:
>>
>> > Hi,
>> >
>> > Thanks for the explanation.
>> > But there are so many things I still don't understand. One of them is for
>> > the avg function itself there are several FuntionIdentifiers. What do
>> they
>> > all mean?
>> >
>> > I see AVG, LOCAL_AVG, INTERMEDIATE_AVG and GLOBAL_AVG.
>> >
>> > What do they all mean?
>> > Please help
>> >
>> > On 19 July 2017 at 21:56, Yingyi Bu <buyingyi@gmail.com> wrote:
>> >
>> > > Hi Riyafa,
>> > >
>> > >    >> ScalarCountAggregateDescriptor
>> > >   It's used for counting a scalar array that appears inside a tuple.
>> > >   For example:
>> > >   SELECT u.id, array_count(u.friends)
>> > >   FROM users u;
>> > >
>> > >    >> SerializableCountAggregateDescriptor
>> > >    Serialized aggregation descriptor implementations are only used in
>> > > hash-based group-by.
>> > >    For example:
>> > >    SELECT u.city, count(*)
>> > >    FROM users u
>> > >    /*+ hash */
>> > >    GROUP BY u.city;
>> > >
>> > >   If your aggregation function doesn't have a fixed-byte-sized state,
>> you
>> > > don't need to worry about that or implement that.
>> > >
>> > >    >> CountAggregateDescriptor
>> > >    This is used in group-by or global aggregate:
>> > >    For example:
>> > >    SELECT u.city, count(*)
>> > >    FROM users u
>> > >    GROUP BY u.city;
>> > >
>> > >    SELECT count(*) FROM users;
>> > >
>> > >
>> > > Best,
>> > > Yingyi
>> > >
>> > >
>> > > On Wed, Jul 19, 2017 at 7:55 AM, Riyafa Abdul Hameed <
>> riyafa@apache.org>
>> > > wrote:
>> > >
>> > > > Hi again,
>> > > >
>> > > > Any suggestions on this? Or anyone I can reach to who are not on this
>> > > list
>> > > > or not active on the list?
>> > > >
>> > > > Thank you.
>> > > >
>> > > > On 17 July 2017 at 17:18, Riyafa Abdul Hameed <riyafa@apache.org>
>> > wrote:
>> > > >
>> > > > > Hi again,
>> > > > >
>> > > > > I think I can understand how to write the descriptor in the
>> packages:
>> > > > > org.apache.asterix.runtime.aggregates.std and
>> > > > org.apache.asterix.runtime.aggregates.scalar.
>> > > > > But I am not sure I understand how to write the descriptor in
the
>> > > > package:
>> > > > > org.apache.asterix.runtime.aggregates.serializable.std  because
it
>> > > > > requires setting a state in the init function that doesn't seem
to
>> > > have a
>> > > > > pattern in the other descriptors.
>> > > > > Also I don't seem to understand the reasons for implementing
each
>> of
>> > > > these
>> > > > > descriptors for the aggregate functions.
>> > > > >
>> > > > > On 17 July 2017 at 16:56, Riyafa Abdul Hameed <
>> > riyafa.12@cse.mrt.ac.lk
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > >> Hi all,
>> > > > >>
>> > > > >> I meant any explanation on the implementation of aggregate
>> functions
>> > > in
>> > > > >> AsterixDB would be highly appreciated.
>> > > > >>
>> > > > >> Thank you.
>> > > > >> Yours sincerely,
>> > > > >> Riyafa
>> > > > >>
>> > > > >> On 16 July 2017 at 08:01, Riyafa Abdul Hameed <riyafa@apache.org>
>> > > > wrote:
>> > > > >>
>> > > > >>> Dear all,
>> > > > >>>
>> > > > >>> I am trying to create aggregate functions and I see there
are
>> more
>> > > than
>> > > > >>> one function descriptors for one single function.
>> > > > >>> For example the function array_count(collection) has
the
>> following
>> > > > >>> descriptors:
>> > > > >>>
>> > > > >>>
>> > > > >>>    - ScalarCountAggregateDescriptor
>> > > > >>>    - SerializableCountAggregateDescriptor
>> > > > >>>    - CountAggregateDescriptor
>> > > > >>>
>> > > > >>> I am not sure I understand the difference between each
of this.
>> Can
>> > > you
>> > > > >>> please provide and example or point me to a documentation
entry
>> to
>> > > > learn
>> > > > >>> how to properly implement aggregate functions?
>> > > > >>>
>> > > > >>> The function I am trying to implement is ST_Extent.
>> > > > >>> <https://postgis.net/docs/manual-1.4/ST_Extent.html>
>> > > > >>>
>> > > > >>> Thank you.
>> > > > >>>
>> > > > >>> Yours sincerely,
>> > > > >>>
>> > > > >>> Riyafa
>> > > > >>>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >> Riyafa Abdul Hameed
>> > > > >> Undergraduate, University of Moratuwa
>> > > > >>
>> > > > >> Email: riyafa.12@cse.mrt.ac.lk
>> > > > >> Website: https://riyafa.wordpress.com/ <
>> > http://riyafa.wordpress.com/>
>> > > > >> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/
>> riyafa
>> > >
>> > > > >> <http://twitter.com/Riyafa1>
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Riyafa Abdul Hameed
>> > Undergraduate, University of Moratuwa
>> >
>> > Email: riyafa.12@cse.mrt.ac.lk
>> > Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
>> > <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
>> > <http://twitter.com/Riyafa1>
>> >
>>
>
>
>
> --
> Riyafa Abdul Hameed
> Undergraduate, University of Moratuwa
>
> Email: riyafa.12@cse.mrt.ac.lk
> Website: https://riyafa.wordpress.com/ <http://riyafa.wordpress.com/>
> <http://facebook.com/riyafa.ahf>  <http://lk.linkedin.com/in/riyafa>
> <http://twitter.com/Riyafa1>

Mime
View raw message