ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Pavlov <dpav...@apache.org>
Subject Re: Calcite based SQL query engine. Local queries
Date Fri, 08 Nov 2019 13:12:40 GMT
Yes, I understand that it is straightforward and, may be, naive approach.
Which is why I'm asking how to do map-reduce on cache C data in Ignite with
proper partition pinning.

About Predefined/Implemented aggregate - I'm not sure I agree that we can
predict everything. It is real perk of Ignite that you can send any of your
code (which, BTW, can be developed in lifetime of the system) to your data.

So I propose map and reduce phase should allow user code to be executed. If
I know any other better approach, I would somehow document it (e.g. add to
some next training/workshop).

Sincerely,
Dmitriy Pavlov

пт, 8 нояб. 2019 г. в 15:45, Ivan Pavlukhin <vololo100@gmail.com>:

> Dmitriy,
>
> First, what kind of cumulative metric can it be? A lot of cumulative
> metrics can be compared using SQL. MIN, MAX, AVG are simple ones. For
> more complex ones I can think about user-define aggregate functions
> (UDAF). We do not have them in Ignite so far, but can introduce them.
>
> Second, naive approaches of such ComputeScan can lead to incorrect
> results as partitions might not be properly pinned and duplicate
> entries might appear.
>
> пт, 8 нояб. 2019 г. в 15:27, Dmitriy Pavlov <dpavlov@apache.org>:
> >
> > Hi Ivan, Igniters, imagine you need to scan all entities in the cluster.
> >
> > Ideally, you don't want to de-serialize all of entries, so you can use
> > withKeepBinary(). e.g. you need a couple of fields and get some
> cumulative
> > metric on this data. You can send compute to all cluster nodes and run
> > there SQL scan queries with local mode is on. In that manner you can
> > implement Map-Reduce.
> >
> > It may be there is another way of doing that, so I encourage to share
> it. I
> > could update workshops/training I preparing in background.
> >
> > Sincerely,
> > Dmitriy Pavlov
> >
> > пт, 8 нояб. 2019 г. в 08:57, Ivan Pavlukhin <vololo100@gmail.com>:
> >
> > > Denis,
> > >
> > > To make things really clearer we need to provide some concrete example
> > > of Compute + LocalSQL and reason about it to figure out whether
> > > "smart" SQL engine can deliver the same (or better) results or not.
> > >
> > > пт, 8 нояб. 2019 г. в 01:48, Denis Magda <dmagda@apache.org>:
> > > >
> > > > Folks,
> > > >
> > > > See our compute tasks as an advanced version of stored procedures
> that
> > > let
> > > > the users code the logic of various complexity with Java, .NET or C++
> > > (and
> > > > not with PL/SQL). The logic can use a combination of APIs (key-value,
> > > SQL,
> > > > etc.) to access data both locally and remotely while being executed
> on
> > > > server nodes. The logic can make N key-value requests or run M SQL
> > > queries.
> > > >
> > > > We kept supporting local SQL queries exactly for such scenarios (for
> our
> > > > version of stored procedures) to ensure the distributed map-reduce
> phase
> > > is
> > > > canceled if all the data is local. And affinityCalls were improved
> one
> > > day
> > > > to pin the partitions.
> > > >
> > > > If the new engine is smart enough to understand that all the
> partitions
> > > are
> > > > available locally during the affinityRun execution then it's totally
> fine
> > > > to remove the 'local' flag. Otherwise, we need to instruct the engine
> > > > manually that a distributed phase is redundant via 'local' flag or by
> > > other
> > > > means.
> > > >
> > > > Does it make things clearer?
> > > >
> > > >
> > > > -
> > > > Denis
> > > >
> > > >
> > > > On Thu, Nov 7, 2019 at 3:53 AM Ivan Pavlukhin <vololo100@gmail.com>
> > > wrote:
> > > >
> > > > > Stephen,
> > > > >
> > > > > In my understanding we need to do a better job to realize
> use-cases of
> > > > > Compute + LocalSQL ourselves.
> > > > >
> > > > > Ideally smart optimizer should do the best job of query deployment.
> > > > >
> > > > > чт, 7 нояб. 2019 г. в 13:04, Stephen Darlington
> > > > > <stephen.darlington@gridgain.com>:
> > > > > >
> > > > > > I made a (bad) assumption that this would also affect queries
> against
> > > > > partitions. If “setLocal()” goes away but “setPartitions()”
> remains I’m
> > > > > happy.
> > > > > >
> > > > > > What I would say is that the “broadcast / local” method
is one I
> see
> > > > > fairly often. Do we need to do a better job educating people of the
> > > > > “correct” way?
> > > > > >
> > > > > > Regards,
> > > > > > Stephen
> > > > > >
> > > > > > > On 7 Nov 2019, at 08:30, Alexey Goncharuk <
> > > alexey.goncharuk@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > Denis, Stephen,
> > > > > > >
> > > > > > > Running a local query in a broadcast closure won't work
on
> changing
> > > > > > > topology. We specifically added an affinityCall method
to the
> > > compute
> > > > > API
> > > > > > > in order to pin a partition to prevent its moving and eviction
> > > > > throughout
> > > > > > > the task execution. Therefore, the query inside an
> affinityCall is
> > > > > always
> > > > > > > executed against some partitions (otherwise the query may
give
> > > > > incorrect
> > > > > > > results when topology is changed).
> > > > > > >
> > > > > > > I support Igor's question and think that the 'local' flag
for
> the
> > > query
> > > > > > > should be deprecated and eventually removed. A 'local'
query
> can
> > > > > always be
> > > > > > > expressed as a query agains a set of partitions. If those
> > > partitions
> > > > > are
> > > > > > > located on the same node - good, we get fast and correct
> results.
> > > If
> > > > > not -
> > > > > > > we may either raise an exception and ask user to remap
the
> query,
> > > or
> > > > > > > fallback to a distributed query execution.
> > > > > > >
> > > > > > > Given that the Calcite prototype is in its early stages,
it's
> > > likely
> > > > > its
> > > > > > > first version will be available in 3.x, and it's a good
chance
> to
> > > get
> > > > > rid
> > > > > > > of wrong API pieces.
> > > > > > >
> > > > > > > --AG
> > > > > > >
> > > > > > > пн, 4 нояб. 2019 г. в 14:02, Stephen Darlington
<
> > > > > > > stephen.darlington@gridgain.com>:
> > > > > > >
> > > > > > >> A common use case is where you want to work on many
rows of
> data
> > > > > across
> > > > > > >> the grid. You’d broadcast a closure, running the
same code on
> > > every
> > > > > node
> > > > > > >> with just the local data. SQL doesn’t work in isolation
— it’s
> > > often
> > > > > used
> > > > > > >> as a filter for future computations.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Stephen
> > > > > > >>
> > > > > > >>> On 1 Nov 2019, at 17:53, Ivan Pavlukhin <vololo100@gmail.com
> >
> > > wrote:
> > > > > > >>>
> > > > > > >>> Denis,
> > > > > > >>>
> > > > > > >>> I am mostly concerned about gathering use cases.
It would be
> > > great to
> > > > > > >>> critically assess such cases to identify why it
cannot be
> solved
> > > by
> > > > > > >>> using distributed SQL. Also it sounds similar to
some kind of
> > > > > "hints",
> > > > > > >>> but very limited and with all hints drawbacks (impossibility
> to
> > > use
> > > > > > >>> full strength of CBO). We can provide better "hints"
support
> > > with new
> > > > > > >>> engine as well.
> > > > > > >>>
> > > > > > >>> пт, 1 нояб. 2019 г. в 20:14, Denis Magda
<dmagda@apache.org
> >:
> > > > > > >>>>
> > > > > > >>>> Ivan,
> > > > > > >>>>
> > > > > > >>>> I was involved in a couple of such use cases
personally, so,
> > > that's
> > > > > not
> > > > > > >> my
> > > > > > >>>> imagination ;) Even more, as far as I remember,
the primary
> > > reason
> > > > > why
> > > > > > >> we
> > > > > > >>>> improved our affinityRuns ensuring no partition
is purged
> from a
> > > > > node
> > > > > > >> until
> > > > > > >>>> a task is completed is because many users were
running
> local SQL
> > > > > from
> > > > > > >>>> compute tasks and needed a guarantee that SQL
will always
> > > return a
> > > > > > >> correct
> > > > > > >>>> result set.
> > > > > > >>>>
> > > > > > >>>> -
> > > > > > >>>> Denis
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Fri, Nov 1, 2019 at 10:01 AM Ivan Pavlukhin
<
> > > vololo100@gmail.com
> > > > > >
> > > > > > >> wrote:
> > > > > > >>>>
> > > > > > >>>>> Denis,
> > > > > > >>>>>
> > > > > > >>>>> Would be nice to see real use-cases of
affinity call +
> local
> > > SQL
> > > > > > >>>>> combination. Generally, new engine will
be able to infer
> > > > > collocation
> > > > > > >>>>> resulting in the same collocated execution
automatically.
> > > > > > >>>>>
> > > > > > >>>>> пт, 1 нояб. 2019 г. в 19:11, Denis
Magda <
> dmagda@apache.org>:
> > > > > > >>>>>>
> > > > > > >>>>>> Hi Igor,
> > > > > > >>>>>>
> > > > > > >>>>>> Local queries feature is broadly used
together with
> > > affinity-based
> > > > > > >>>>> compute
> > > > > > >>>>>> tasks:
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>
> > > > >
> > >
> https://apacheignite.readme.io/docs/collocate-compute-and-data#section-affinity-call-and-run-methods
> > > > > > >>>>>>
> > > > > > >>>>>> The use case is as follows. The user
knows that all
> required
> > > data
> > > > > > >> needed
> > > > > > >>>>>> for computation is collocated, and
SQL is used as an
> advanced
> > > API
> > > > > for
> > > > > > >>>>> data
> > > > > > >>>>>> retrieval from the computation code.
The affinity task
> ensures
> > > > > that
> > > > > > >>>>>> partitions won't be discarded from
the node(s) if the
> topology
> > > > > changes
> > > > > > >>>>>> during the task execution and, thus,
it's safe to run SQL
> > > locally
> > > > > > >>>>> skipping
> > > > > > >>>>>> distributed phases.
> > > > > > >>>>>>
> > > > > > >>>>>> The combination of affinity compute
tasks with local SQL
> is a
> > > > > real and
> > > > > > >>>>>> valuable use case, and this is what
we need to support
> with
> > > > > Calcite.
> > > > > > >> Do
> > > > > > >>>>> you
> > > > > > >>>>>> see any challenges?
> > > > > > >>>>>>
> > > > > > >>>>>> -
> > > > > > >>>>>> Denis
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>> On Fri, Nov 1, 2019 at 8:46 AM Roman
Kondakov
> > > > > > >> <kondakov87@mail.ru.invalid
> > > > > > >>>>>>
> > > > > > >>>>>> wrote:
> > > > > > >>>>>>
> > > > > > >>>>>>> Hi Igor!
> > > > > > >>>>>>>
> > > > > > >>>>>>> IMO we need to maintain the backward
compatibility
> between
> > > old
> > > > > and
> > > > > > >> new
> > > > > > >>>>>>> query engines as much as possible.
And therefore we
> shouldn't
> > > > > change
> > > > > > >>>>> the
> > > > > > >>>>>>> behavior of local queries.
> > > > > > >>>>>>>
> > > > > > >>>>>>> So, for local queries Calcite's
planner shouldn't
> consider
> > > the
> > > > > > >>>>>>> distribution trait at all.
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>> --
> > > > > > >>>>>>> Kind Regards
> > > > > > >>>>>>> Roman Kondakov
> > > > > > >>>>>>>
> > > > > > >>>>>>> On 01.11.2019 17:07, Seliverstov
Igor wrote:
> > > > > > >>>>>>>> Hi Igniters,
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Working on new generation of
Ignite SQL I faced a
> question:
> > > «Do
> > > > > we
> > > > > > >>>>> need
> > > > > > >>>>>>> local queries at all and, if so,
what semantic they
> should
> > > > > have?».
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Current planing flow consists
of next steps:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> 1) Parsing SQL to AST
> > > > > > >>>>>>>> 2) Validating AST (against
Schema)
> > > > > > >>>>>>>> 3) Optimizing (Building execution
graph)
> > > > > > >>>>>>>> 4) Splitting (into query fragments
which executes on
> target
> > > > > nodes)
> > > > > > >>>>>>>> 5) Mapping (query fragments
to nodes/partitions)
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> At last step we check that
all Fragment sources (a
> table or
> > > > > result)
> > > > > > >>>>> have
> > > > > > >>>>>>> the same distribution (in other
words all sources have
> to be
> > > > > > >>>>> co-located)
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Planner and Splitter guarantee
that all caches in a
> > > Fragment are
> > > > > > >>>>>>> co-located, an Exchange is produced
otherwise. But if we
> > > force
> > > > > local
> > > > > > >>>>>>> execution we cannot produce Exchanges,
that means we may
> > > face two
> > > > > > >>>>>>> non-co-located caches inside a
single query fragment
> (result
> > > of
> > > > > local
> > > > > > >>>>> query
> > > > > > >>>>>>> planning is a single query fragment).
So, we cannot pass
> the
> > > > > check.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Should we throw an exception
or omit the check for local
> > > query
> > > > > > >>>>> planning
> > > > > > >>>>>>> or prohibit local queries at all?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Your thoughts?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Regards,
> > > > > > >>>>>>>> Igor
> > > > > > >>>>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> --
> > > > > > >>>>> Best regards,
> > > > > > >>>>> Ivan Pavlukhin
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Best regards,
> > > > > > >>> Ivan Pavlukhin
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ivan Pavlukhin
> > > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message