tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Atri Sharma <atri.j...@gmail.com>
Subject Re: Parallel Aggregates
Date Fri, 19 Jun 2015 04:27:48 GMT
Thanks!
On 18 Jun 2015 21:24, "Jaehwa Jung" <blrunner@apache.org> wrote:

> Hi Atri
>
> I think that following articles would be helpful for you to understand tajo
> architecture.
>
> - Tajo query execution and scheduling sequence : http://jaso.co.kr/501
> - Tajo stage flow and StorageManager: http://jaso.co.kr/503
>
> For the reference, above articles had been written by Hyoungjun Kim who is
> Tajo PMC. :)
>
> Cheers
> Jaehwa
>
> 2015-06-18 20:21 GMT+09:00 Jihoon Son <jihoonson@apache.org>:
>
> > It seems that there aren't ongoing issues for distinct aggregation.
> >
> > 2015년 6월 18일 (목) 오전 10:50, Atri Sharma <atri.jiit@gmail.com>님이
작성:
> >
> > > And for DISTINCT issues?
> > > On 18 Jun 2015 15:16, "Jihoon Son" <ghoonson@gmail.com> wrote:
> > >
> > > > As far as I know, TAJO-256, TAJO-259, and TAJO-1420 are issues for
> data
> > > > cube and grouping sets.
> > > > You can create any issues if you want. Even though some issues can be
> > > > duplicated, it's ok.
> > > > 2015년 6월 18일 (목) 오전 10:39, Atri Sharma <atri.jiit@gmail.com>님이
작성:
> > > >
> > > > > Do we have a ticket around that?
> > > > > On 18 Jun 2015 15:07, "Jihoon Son" <jihoonson@apache.org> wrote:
> > > > >
> > > > > > It looks good to start.
> > > > > > Any questions welcome!
> > > > > >
> > > > > > Jihoon
> > > > > >
> > > > > > 2015년 6월 18일 (목) 오전 3:39, Atri Sharma <atri.jiit@gmail.com>님이
> 작성:
> > > > > >
> > > > > > > So distinct aggregation is one area, thanks.
> > > > > > >
> > > > > > > I am trying to get enough knowledge of Internals of aggregation
> > > > engine
> > > > > > and
> > > > > > > query planner to be able to work on rollup and cube so
picking
> > > > smaller
> > > > > > > tickets first.
> > > > > > > On 18 Jun 2015 02:42, "Jihoon Son" <jihoonson@apache.org>
> wrote:
> > > > > > >
> > > > > > > > As far as I know, there aren't any plans for improvement
> except
> > > in
> > > > > > > distinct
> > > > > > > > aggregation. I think that our code for distinct aggregation
> is
> > > too
> > > > > > > > complicated, and the performance also should be improved.
> > > > > > > >
> > > > > > > > So, when you design the implementation of your algorithm
on
> > Tajo,
> > > > you
> > > > > > > don't
> > > > > > > > have to consider distinct aggregation part, I think.
> > > > > > > >
> > > > > > > > 2015년 6월 18일 (목) 오전 2:16, Atri Sharma
<atri.jiit@gmail.com
> >님이
> > > 작성:
> > > > > > > >
> > > > > > > > > Thank you.
> > > > > > > > >
> > > > > > > > > Is there any improvement in aggregates that we
are looking
> at
> > > > > please?
> > > > > > > > > On 16 Jun 2015 17:07, "Jihoon Son" <jihoonson@apache.org>
> > > wrote:
> > > > > > > > >
> > > > > > > > > > In Tajo, aggregation is very similar to
that in Hadoop
> > > > MapReduce.
> > > > > > > > > > Let me consider an example. Given a query
of "select *k*,
> > > > > count(*)
> > > > > > > from
> > > > > > > > > *t*
> > > > > > > > > > group by *k*", Tajo generates a LogicalPlan
as follows.
> > > > > > > > > >
> > > > > > > > > > group by (k)
> > > > > > > > > >        |
> > > > > > > > > >    scan (t)
> > > > > > > > > >
> > > > > > > > > > This LogicalPlan is translated into a MasterPlan
as
> > follows.
> > > > > > > > > >
> > > > > > > > > > -----------------
> > > > > > > > > >      Stage2
> > > > > > > > > >   group by *k*
> > > > > > > > > > -----------------
> > > > > > > > > >           |
> > > > > > > > > > shuffle tuples with *k*
> > > > > > > > > >           |
> > > > > > > > > > -----------------
> > > > > > > > > >      Stage1
> > > > > > > > > >   group by *k*
> > > > > > > > > >          |
> > > > > > > > > >     scan *t*
> > > > > > > > > > -----------------
> > > > > > > > > >
> > > > > > > > > > As you can see in this example, the query
plan consists
> of
> > 2
> > > > > > stages.
> > > > > > > > Each
> > > > > > > > > > stage is executed subsequently because the
result of
> Stage
> > 1
> > > is
> > > > > > used
> > > > > > > as
> > > > > > > > > the
> > > > > > > > > > input of Stage 2. Each stage is divided
into multiple
> tasks
> > > for
> > > > > > each
> > > > > > > > > input
> > > > > > > > > > split as follows.
> > > > > > > > > >
> > > > > > > > > > Stage1
> > > > > > > > > >
> > > > > > > > > > Task 1
> > > > > > > > > > group by *k*
> > > > > > > > > >        |
> > > > > > > > > >   scan *t* (0 - 99)
> > > > > > > > > >
> > > > > > > > > > Task 2
> > > > > > > > > > group by *k*
> > > > > > > > > >        |
> > > > > > > > > >   scan *t* (100 - 199)
> > > > > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > Each task is executed by a TajoWorker. As
you can see,
> > tasks
> > > of
> > > > > the
> > > > > > > > first
> > > > > > > > > > stage execute a local aggregation after
scanning input
> > split.
> > > > > This
> > > > > > > > local
> > > > > > > > > > aggregation result is shuffled among TajoWorkers
with the
> > > > > > aggregation
> > > > > > > > key
> > > > > > > > > > *k*. Then, the final aggregation is computed
at the
> second
> > > > stage.
> > > > > > > > > >
> > > > > > > > > > Stage1 and Stage2 are similar to Map and
Reduce of
> > MapReduce.
> > > > The
> > > > > > > local
> > > > > > > > > > aggregation of Stage1 is similar to the
Combiner of
> Hadoop
> > > > > > MapReduce.
> > > > > > > > > >
> > > > > > > > > > I hope that this will be helpful to you.
> > > > > > > > > > If you have any further questions, please
feel free to
> ask.
> > > > > > > > > > Jihoon
> > > > > > > > > >
> > > > > > > > > > 2015년 6월 16일 (화) 오전 7:28, Atri
Sharma <
> atri.jiit@gmail.com
> > > >님이
> > > > > 작성:
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > > >
> > > > > > > > > > > What are your thoughts on parallel
aggregation?
> > Generating
> > > > > query
> > > > > > > > plans
> > > > > > > > > > that
> > > > > > > > > > > allow states to be generated which
can be executed
> > > > > independently
> > > > > > > and
> > > > > > > > > then
> > > > > > > > > > > states recombined?
> > > > > > > > > > > On 16 Jun 2015 05:25, "Jihoon Son"
<
> jihoonson@apache.org
> > >
> > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Atri, thanks for your question.
> > > > > > > > > > > >
> > > > > > > > > > > > First of all, maybe you already
did, I recommend that
> > you
> > > > > read
> > > > > > > this
> > > > > > > > > > > article
> > > > > > > > > > > > <
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > > > > > > > > > >
> > > > > > > > > > > > before you start implementation.
This is written by
> > > > Hyunsik,
> > > > > > and
> > > > > > > > > > contains
> > > > > > > > > > > > the description of Tajo's overall
infrastructure.
> > > > > Afterwards, I
> > > > > > > > think
> > > > > > > > > > > that
> > > > > > > > > > > > you may ask more detailed question.
> > > > > > > > > > > >
> > > > > > > > > > > > Here, I'll roughly list some important
classes for
> > > > aggregate
> > > > > > > > > > > > implementation.
> > > > > > > > > > > >
> > > > > > > > > > > >    - SQLParser.g4 contains our
SQL parsing rules. It
> is
> > > > > written
> > > > > > > in
> > > > > > > > > > antlr.
> > > > > > > > > > > >    - SQLAnalyzer is our parser
based on rules defined
> > at
> > > > > > > > > SQLParser.g4.
> > > > > > > > > > > >    - SQLAnalyzer translates a
SQL query into a tree
> of
> > > Expr
> > > > > > which
> > > > > > > > > > > >    represents an algebraic expression.
> > > > > > > > > > > >    - LogicalPlanner translates
the Expr tree into a
> > > > > LogicalPlan
> > > > > > > > that
> > > > > > > > > > > >    logically describes how the
given query will be
> > > > executed.
> > > > > > > > > > > >    - GlobalPlanner translates
the LogicalPlan into a
> > > > > MasterPlan
> > > > > > > > > > > >    (distributed query execution
plan) that describes
> > how
> > > > the
> > > > > > > given
> > > > > > > > > > query
> > > > > > > > > > > > will
> > > > > > > > > > > >    be executed in distributed
cluster.
> > > > > > > > > > > >    - Once a MasterPlan is created,
QueryMaster starts
> > to
> > > > > > execute
> > > > > > > > > query
> > > > > > > > > > > >    processing. A query consists
of multiple stages,
> > which
> > > > are
> > > > > > > > > > > individually
> > > > > > > > > > > >    processed in some order.
> > > > > > > > > > > >       - For example, a simple
aggregation query is
> > > executed
> > > > > in
> > > > > > > two
> > > > > > > > > > > stages,
> > > > > > > > > > > >       each of which is for parallel
aggregation and
> > > > combining
> > > > > > > > > > aggregates.
> > > > > > > > > > > > These
> > > > > > > > > > > >       stages are executed sequentially.
> > > > > > > > > > > >    - A stage is concurrently processed
by multiple
> > tasks,
> > > > and
> > > > > > is
> > > > > > > > > > executed
> > > > > > > > > > > >    by TajoWorker.
> > > > > > > > > > > >    - Each task contains meta information
for input
> data
> > > > and a
> > > > > > > > > > LogicalPlan
> > > > > > > > > > > >    of the stage. This LogicalPlan
is translated into
> > > > > > PhysicalExec
> > > > > > > > by
> > > > > > > > > > > >    PhysicalPlanner.
> > > > > > > > > > > >    - PhysicalExec describes how
the query is actually
> > > > > executed.
> > > > > > > > > > > >       - For example, there are
two types of
> > > > AggregationExec,
> > > > > > > > > > > >       i.e., HashAggregateExec
and SortAggregateExec,
> > for
> > > > > > > hash-based
> > > > > > > > > > > > aggregation
> > > > > > > > > > > >       and sort-based aggregation,
respectively.
> > > > > > > > > > > >
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Jihoon
> > > > > > > > > > > >
> > > > > > > > > > > > 2015년 6월 15일 (월) 오후
11:32, Atri Sharma <
> > > > atri.jiit@gmail.com
> > > > > >님이
> > > > > > > 작성:
> > > > > > > > > > > >
> > > > > > > > > > > > > Folks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I am looking into parallel
aggregates/combining
> > > > > aggregates. I
> > > > > > > > have
> > > > > > > > > a
> > > > > > > > > > > plan
> > > > > > > > > > > > > around it which I think can
work.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please update me on current
infrastructure and
> point
> > me
> > > > > > around
> > > > > > > > the
> > > > > > > > > > > > existing
> > > > > > > > > > > > > code base. Also, ideas would
be most welcome around
> > it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Atri
> > > > > > > > > > > > > *l'apprenant*
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message