tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jihoon Son <ghoon...@gmail.com>
Subject Re: Parallel Aggregates
Date Thu, 18 Jun 2015 09:46:10 GMT
As far as I know, TAJO-256, TAJO-259, and TAJO-1420 are issues for data
cube and grouping sets.
You can create any issues if you want. Even though some issues can be
duplicated, it's ok.
2015년 6월 18일 (목) 오전 10:39, Atri Sharma <atri.jiit@gmail.com>님이 작성:

> Do we have a ticket around that?
> On 18 Jun 2015 15:07, "Jihoon Son" <jihoonson@apache.org> wrote:
>
> > It looks good to start.
> > Any questions welcome!
> >
> > Jihoon
> >
> > 2015년 6월 18일 (목) 오전 3:39, Atri Sharma <atri.jiit@gmail.com>님이
작성:
> >
> > > So distinct aggregation is one area, thanks.
> > >
> > > I am trying to get enough knowledge of Internals of aggregation engine
> > and
> > > query planner to be able to work on rollup and cube so picking smaller
> > > tickets first.
> > > On 18 Jun 2015 02:42, "Jihoon Son" <jihoonson@apache.org> wrote:
> > >
> > > > As far as I know, there aren't any plans for improvement except in
> > > distinct
> > > > aggregation. I think that our code for distinct aggregation is too
> > > > complicated, and the performance also should be improved.
> > > >
> > > > So, when you design the implementation of your algorithm on Tajo, you
> > > don't
> > > > have to consider distinct aggregation part, I think.
> > > >
> > > > 2015년 6월 18일 (목) 오전 2:16, Atri Sharma <atri.jiit@gmail.com>님이
작성:
> > > >
> > > > > Thank you.
> > > > >
> > > > > Is there any improvement in aggregates that we are looking at
> please?
> > > > > On 16 Jun 2015 17:07, "Jihoon Son" <jihoonson@apache.org> wrote:
> > > > >
> > > > > > In Tajo, aggregation is very similar to that in Hadoop MapReduce.
> > > > > > Let me consider an example. Given a query of "select *k*,
> count(*)
> > > from
> > > > > *t*
> > > > > > group by *k*", Tajo generates a LogicalPlan as follows.
> > > > > >
> > > > > > group by (k)
> > > > > >        |
> > > > > >    scan (t)
> > > > > >
> > > > > > This LogicalPlan is translated into a MasterPlan as follows.
> > > > > >
> > > > > > -----------------
> > > > > >      Stage2
> > > > > >   group by *k*
> > > > > > -----------------
> > > > > >           |
> > > > > > shuffle tuples with *k*
> > > > > >           |
> > > > > > -----------------
> > > > > >      Stage1
> > > > > >   group by *k*
> > > > > >          |
> > > > > >     scan *t*
> > > > > > -----------------
> > > > > >
> > > > > > As you can see in this example, the query plan consists of 2
> > stages.
> > > > Each
> > > > > > stage is executed subsequently because the result of Stage 1
is
> > used
> > > as
> > > > > the
> > > > > > input of Stage 2. Each stage is divided into multiple tasks
for
> > each
> > > > > input
> > > > > > split as follows.
> > > > > >
> > > > > > Stage1
> > > > > >
> > > > > > Task 1
> > > > > > group by *k*
> > > > > >        |
> > > > > >   scan *t* (0 - 99)
> > > > > >
> > > > > > Task 2
> > > > > > group by *k*
> > > > > >        |
> > > > > >   scan *t* (100 - 199)
> > > > > > ...
> > > > > >
> > > > > > Each task is executed by a TajoWorker. As you can see, tasks
of
> the
> > > > first
> > > > > > stage execute a local aggregation after scanning input split.
> This
> > > > local
> > > > > > aggregation result is shuffled among TajoWorkers with the
> > aggregation
> > > > key
> > > > > > *k*. Then, the final aggregation is computed at the second stage.
> > > > > >
> > > > > > Stage1 and Stage2 are similar to Map and Reduce of MapReduce.
The
> > > local
> > > > > > aggregation of Stage1 is similar to the Combiner of Hadoop
> > MapReduce.
> > > > > >
> > > > > > I hope that this will be helpful to you.
> > > > > > If you have any further questions, please feel free to ask.
> > > > > > Jihoon
> > > > > >
> > > > > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <atri.jiit@gmail.com>님이
> 작성:
> > > > > >
> > > > > > Thanks.
> > > > > > >
> > > > > > > What are your thoughts on parallel aggregation? Generating
> query
> > > > plans
> > > > > > that
> > > > > > > allow states to be generated which can be executed
> independently
> > > and
> > > > > then
> > > > > > > states recombined?
> > > > > > > On 16 Jun 2015 05:25, "Jihoon Son" <jihoonson@apache.org>
> wrote:
> > > > > > >
> > > > > > > > Hi Atri, thanks for your question.
> > > > > > > >
> > > > > > > > First of all, maybe you already did, I recommend that
you
> read
> > > this
> > > > > > > article
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > > > > > >
> > > > > > > > before you start implementation. This is written by
Hyunsik,
> > and
> > > > > > contains
> > > > > > > > the description of Tajo's overall infrastructure.
> Afterwards, I
> > > > think
> > > > > > > that
> > > > > > > > you may ask more detailed question.
> > > > > > > >
> > > > > > > > Here, I'll roughly list some important classes for
aggregate
> > > > > > > > implementation.
> > > > > > > >
> > > > > > > >    - SQLParser.g4 contains our SQL parsing rules.
It is
> written
> > > in
> > > > > > antlr.
> > > > > > > >    - SQLAnalyzer is our parser based on rules defined
at
> > > > > SQLParser.g4.
> > > > > > > >    - SQLAnalyzer translates a SQL query into a tree
of Expr
> > which
> > > > > > > >    represents an algebraic expression.
> > > > > > > >    - LogicalPlanner translates the Expr tree into
a
> LogicalPlan
> > > > that
> > > > > > > >    logically describes how the given query will be
executed.
> > > > > > > >    - GlobalPlanner translates the LogicalPlan into
a
> MasterPlan
> > > > > > > >    (distributed query execution plan) that describes
how the
> > > given
> > > > > > query
> > > > > > > > will
> > > > > > > >    be executed in distributed cluster.
> > > > > > > >    - Once a MasterPlan is created, QueryMaster starts
to
> > execute
> > > > > query
> > > > > > > >    processing. A query consists of multiple stages,
which are
> > > > > > > individually
> > > > > > > >    processed in some order.
> > > > > > > >       - For example, a simple aggregation query is
executed
> in
> > > two
> > > > > > > stages,
> > > > > > > >       each of which is for parallel aggregation and
combining
> > > > > > aggregates.
> > > > > > > > These
> > > > > > > >       stages are executed sequentially.
> > > > > > > >    - A stage is concurrently processed by multiple
tasks, and
> > is
> > > > > > executed
> > > > > > > >    by TajoWorker.
> > > > > > > >    - Each task contains meta information for input
data and a
> > > > > > LogicalPlan
> > > > > > > >    of the stage. This LogicalPlan is translated into
> > PhysicalExec
> > > > by
> > > > > > > >    PhysicalPlanner.
> > > > > > > >    - PhysicalExec describes how the query is actually
> executed.
> > > > > > > >       - For example, there are two types of AggregationExec,
> > > > > > > >       i.e., HashAggregateExec and SortAggregateExec,
for
> > > hash-based
> > > > > > > > aggregation
> > > > > > > >       and sort-based aggregation, respectively.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Jihoon
> > > > > > > >
> > > > > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma
<atri.jiit@gmail.com
> >님이
> > > 작성:
> > > > > > > >
> > > > > > > > > Folks,
> > > > > > > > >
> > > > > > > > > I am looking into parallel aggregates/combining
> aggregates. I
> > > > have
> > > > > a
> > > > > > > plan
> > > > > > > > > around it which I think can work.
> > > > > > > > >
> > > > > > > > > Please update me on current infrastructure and
point me
> > around
> > > > the
> > > > > > > > existing
> > > > > > > > > code base. Also, ideas would be most welcome
around it.
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > > *l'apprenant*
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message