ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Pavlukhin <vololo...@gmail.com>
Subject Re: Calcite based SQL query engine. Local queries
Date Fri, 08 Nov 2019 05:57:10 GMT
Denis,

To make things really clearer we need to provide some concrete example
of Compute + LocalSQL and reason about it to figure out whether
"smart" SQL engine can deliver the same (or better) results or not.

пт, 8 нояб. 2019 г. в 01:48, Denis Magda <dmagda@apache.org>:
>
> Folks,
>
> See our compute tasks as an advanced version of stored procedures that let
> the users code the logic of various complexity with Java, .NET or C++ (and
> not with PL/SQL). The logic can use a combination of APIs (key-value, SQL,
> etc.) to access data both locally and remotely while being executed on
> server nodes. The logic can make N key-value requests or run M SQL queries.
>
> We kept supporting local SQL queries exactly for such scenarios (for our
> version of stored procedures) to ensure the distributed map-reduce phase is
> canceled if all the data is local. And affinityCalls were improved one day
> to pin the partitions.
>
> If the new engine is smart enough to understand that all the partitions are
> available locally during the affinityRun execution then it's totally fine
> to remove the 'local' flag. Otherwise, we need to instruct the engine
> manually that a distributed phase is redundant via 'local' flag or by other
> means.
>
> Does it make things clearer?
>
>
> -
> Denis
>
>
> On Thu, Nov 7, 2019 at 3:53 AM Ivan Pavlukhin <vololo100@gmail.com> wrote:
>
> > Stephen,
> >
> > In my understanding we need to do a better job to realize use-cases of
> > Compute + LocalSQL ourselves.
> >
> > Ideally smart optimizer should do the best job of query deployment.
> >
> > чт, 7 нояб. 2019 г. в 13:04, Stephen Darlington
> > <stephen.darlington@gridgain.com>:
> > >
> > > I made a (bad) assumption that this would also affect queries against
> > partitions. If “setLocal()” goes away but “setPartitions()” remains I’m
> > happy.
> > >
> > > What I would say is that the “broadcast / local” method is one I see
> > fairly often. Do we need to do a better job educating people of the
> > “correct” way?
> > >
> > > Regards,
> > > Stephen
> > >
> > > > On 7 Nov 2019, at 08:30, Alexey Goncharuk <alexey.goncharuk@gmail.com>
> > wrote:
> > > >
> > > > Denis, Stephen,
> > > >
> > > > Running a local query in a broadcast closure won't work on changing
> > > > topology. We specifically added an affinityCall method to the compute
> > API
> > > > in order to pin a partition to prevent its moving and eviction
> > throughout
> > > > the task execution. Therefore, the query inside an affinityCall is
> > always
> > > > executed against some partitions (otherwise the query may give
> > incorrect
> > > > results when topology is changed).
> > > >
> > > > I support Igor's question and think that the 'local' flag for the query
> > > > should be deprecated and eventually removed. A 'local' query can
> > always be
> > > > expressed as a query agains a set of partitions. If those partitions
> > are
> > > > located on the same node - good, we get fast and correct results. If
> > not -
> > > > we may either raise an exception and ask user to remap the query, or
> > > > fallback to a distributed query execution.
> > > >
> > > > Given that the Calcite prototype is in its early stages, it's likely
> > its
> > > > first version will be available in 3.x, and it's a good chance to get
> > rid
> > > > of wrong API pieces.
> > > >
> > > > --AG
> > > >
> > > > пн, 4 нояб. 2019 г. в 14:02, Stephen Darlington <
> > > > stephen.darlington@gridgain.com>:
> > > >
> > > >> A common use case is where you want to work on many rows of data
> > across
> > > >> the grid. You’d broadcast a closure, running the same code on every
> > node
> > > >> with just the local data. SQL doesn’t work in isolation — it’s
often
> > used
> > > >> as a filter for future computations.
> > > >>
> > > >> Regards,
> > > >> Stephen
> > > >>
> > > >>> On 1 Nov 2019, at 17:53, Ivan Pavlukhin <vololo100@gmail.com>
wrote:
> > > >>>
> > > >>> Denis,
> > > >>>
> > > >>> I am mostly concerned about gathering use cases. It would be great
to
> > > >>> critically assess such cases to identify why it cannot be solved
by
> > > >>> using distributed SQL. Also it sounds similar to some kind of
> > "hints",
> > > >>> but very limited and with all hints drawbacks (impossibility to
use
> > > >>> full strength of CBO). We can provide better "hints" support with
new
> > > >>> engine as well.
> > > >>>
> > > >>> пт, 1 нояб. 2019 г. в 20:14, Denis Magda <dmagda@apache.org>:
> > > >>>>
> > > >>>> Ivan,
> > > >>>>
> > > >>>> I was involved in a couple of such use cases personally, so,
that's
> > not
> > > >> my
> > > >>>> imagination ;) Even more, as far as I remember, the primary
reason
> > why
> > > >> we
> > > >>>> improved our affinityRuns ensuring no partition is purged
from a
> > node
> > > >> until
> > > >>>> a task is completed is because many users were running local
SQL
> > from
> > > >>>> compute tasks and needed a guarantee that SQL will always
return a
> > > >> correct
> > > >>>> result set.
> > > >>>>
> > > >>>> -
> > > >>>> Denis
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Nov 1, 2019 at 10:01 AM Ivan Pavlukhin <vololo100@gmail.com
> > >
> > > >> wrote:
> > > >>>>
> > > >>>>> Denis,
> > > >>>>>
> > > >>>>> Would be nice to see real use-cases of affinity call +
local SQL
> > > >>>>> combination. Generally, new engine will be able to infer
> > collocation
> > > >>>>> resulting in the same collocated execution automatically.
> > > >>>>>
> > > >>>>> пт, 1 нояб. 2019 г. в 19:11, Denis Magda <dmagda@apache.org>:
> > > >>>>>>
> > > >>>>>> Hi Igor,
> > > >>>>>>
> > > >>>>>> Local queries feature is broadly used together with
affinity-based
> > > >>>>> compute
> > > >>>>>> tasks:
> > > >>>>>>
> > > >>>>>
> > > >>
> > https://apacheignite.readme.io/docs/collocate-compute-and-data#section-affinity-call-and-run-methods
> > > >>>>>>
> > > >>>>>> The use case is as follows. The user knows that all
required data
> > > >> needed
> > > >>>>>> for computation is collocated, and SQL is used as
an advanced API
> > for
> > > >>>>> data
> > > >>>>>> retrieval from the computation code. The affinity
task ensures
> > that
> > > >>>>>> partitions won't be discarded from the node(s) if
the topology
> > changes
> > > >>>>>> during the task execution and, thus, it's safe to
run SQL locally
> > > >>>>> skipping
> > > >>>>>> distributed phases.
> > > >>>>>>
> > > >>>>>> The combination of affinity compute tasks with local
SQL is a
> > real and
> > > >>>>>> valuable use case, and this is what we need to support
with
> > Calcite.
> > > >> Do
> > > >>>>> you
> > > >>>>>> see any challenges?
> > > >>>>>>
> > > >>>>>> -
> > > >>>>>> Denis
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Fri, Nov 1, 2019 at 8:46 AM Roman Kondakov
> > > >> <kondakov87@mail.ru.invalid
> > > >>>>>>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi Igor!
> > > >>>>>>>
> > > >>>>>>> IMO we need to maintain the backward compatibility
between old
> > and
> > > >> new
> > > >>>>>>> query engines as much as possible. And therefore
we shouldn't
> > change
> > > >>>>> the
> > > >>>>>>> behavior of local queries.
> > > >>>>>>>
> > > >>>>>>> So, for local queries Calcite's planner shouldn't
consider the
> > > >>>>>>> distribution trait at all.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>>> Kind Regards
> > > >>>>>>> Roman Kondakov
> > > >>>>>>>
> > > >>>>>>> On 01.11.2019 17:07, Seliverstov Igor wrote:
> > > >>>>>>>> Hi Igniters,
> > > >>>>>>>>
> > > >>>>>>>> Working on new generation of Ignite SQL I
faced a question: «Do
> > we
> > > >>>>> need
> > > >>>>>>> local queries at all and, if so, what semantic
they should
> > have?».
> > > >>>>>>>>
> > > >>>>>>>> Current planing flow consists of next steps:
> > > >>>>>>>>
> > > >>>>>>>> 1) Parsing SQL to AST
> > > >>>>>>>> 2) Validating AST (against Schema)
> > > >>>>>>>> 3) Optimizing (Building execution graph)
> > > >>>>>>>> 4) Splitting (into query fragments which executes
on target
> > nodes)
> > > >>>>>>>> 5) Mapping (query fragments to nodes/partitions)
> > > >>>>>>>>
> > > >>>>>>>> At last step we check that all Fragment sources
(a table or
> > result)
> > > >>>>> have
> > > >>>>>>> the same distribution (in other words all sources
have to be
> > > >>>>> co-located)
> > > >>>>>>>>
> > > >>>>>>>> Planner and Splitter guarantee that all caches
in a Fragment are
> > > >>>>>>> co-located, an Exchange is produced otherwise.
But if we force
> > local
> > > >>>>>>> execution we cannot produce Exchanges, that means
we may face two
> > > >>>>>>> non-co-located caches inside a single query fragment
(result of
> > local
> > > >>>>> query
> > > >>>>>>> planning is a single query fragment). So, we cannot
pass the
> > check.
> > > >>>>>>>>
> > > >>>>>>>> Should we throw an exception or omit the check
for local query
> > > >>>>> planning
> > > >>>>>>> or prohibit local queries at all?
> > > >>>>>>>>
> > > >>>>>>>> Your thoughts?
> > > >>>>>>>>
> > > >>>>>>>> Regards,
> > > >>>>>>>> Igor
> > > >>>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> --
> > > >>>>> Best regards,
> > > >>>>> Ivan Pavlukhin
> > > >>>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Best regards,
> > > >>> Ivan Pavlukhin
> > > >>
> > > >>
> > > >>
> > >
> > >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
> >



-- 
Best regards,
Ivan Pavlukhin

Mime
View raw message