ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: Batch DML queries design discussion
Date Fri, 09 Dec 2016 08:45:15 GMT
I already expressed my concern - this is counterintuitive approach. Because
without happens-before pure streaming model can be applied only on
independent chunks of data. It mean that mentioned ETL use case is not
feasible - ETL always depend on implicit or explicit links between tables,
and hence streaming is not applicable here. My question stands still - what
products except of possibly Ignite do this kind of JDBC streaming?

Another problem is that connection-wide property doesn't fit well in JDBC
pooling model. Users will have use different connections for streaming and
non-streaming approaches.

Please see how Oracle did that, this is precisely what I am talking about:
https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf.htm#i1056232
Two batching modes - one with explicit flush, another one with implicit
flush, when Oracle decides on it's own when it is better to communicate the
server. Batching mode can be declared globally or on per-statement level.
Simple and flexible.


On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <dsetrakyan@apache.org>
wrote:

> Gents,
>
> As Sergi suggested, batching and streaming are very different semantically.
>
> To use standard JDBC batching, all we need to do is convert it to a
> cache.putAll() method, as semantically a putAll(...) call is identical to a
> JDBC batch. Of course, if we see and UPDATE with a WHERE clause in between,
> then we may have to break a batch into several chunks and execute the
> update in between. The DataStreamer should not be used here.
>
> I believe that for streaming we need to add a special JDBC/ODBC connection
> flag. Whenever this flag is set to true, then we only should allow INSERT
> or single-UPDATE operations and use DataStreamer API internally. All
> operations other than INSERT or single-UPDATE should be prohibited.
>
> I think this design is semantically clear. Any objections?
>
> D.
>
> On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <sergi.vladykin@gmail.com>
> wrote:
>
> > If we use Streamer, then we always have `happens-before` broken. This is
> > ok, because Streamer is for data loading, not for usual operating.
> >
> > We are not inventing any bicycles, just separating concerns: Batching and
> > Streaming.
> >
> > My point here is that they should not depend on each other at all:
> Batching
> > can work with or without Streaming, as well as Streaming can work with or
> > without Batching.
> >
> > Your proposal is a set of non-obvious rules for them to work. I see no
> > reasons for these complications.
> >
> > Sergi
> >
> >
> > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:
> >
> > > Sergi,
> > >
> > > If user call single *execute() *operation, than most likely it is not
> > > batching. We should not rely on strange case where user perform
> batching
> > > without using standard and well-adopted batching JDBC API. The main
> > problem
> > > with streamer is that it is async and hence break happens-before
> > guarantees
> > > in a single thread: SELECT after INSERT might not return inserted
> value.
> > >
> > > Honestly, I do not really understand why we are trying to re-invent a
> > > bicycle here. There is standard API - let's just use it and make
> flexible
> > > enough to take advantage of IgniteDataStreamer if needed.
> > >
> > > Is there any use case which is not covered with this solution? Or let
> me
> > > ask from the opposite side - are there any well-known JDBC drivers
> which
> > > perform batching/streaming from non-batched update statements?
> > >
> > > Vladimir.
> > >
> > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
> sergi.vladykin@gmail.com
> > >
> > > wrote:
> > >
> > > > Vladimir,
> > > >
> > > > I see no reason to forbid Streamer usage from non-batched statement
> > > > execution.
> > > > It is common that users already have their ETL tools and you can't be
> > > sure
> > > > if they use batching or not.
> > > >
> > > > Alex,
> > > >
> > > > I guess we have to decide on Streaming first and then we will discuss
> > > > Batching separately, ok? Because this decision may become important
> for
> > > > batching implementation.
> > > >
> > > > Sergi
> > > >
> > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <agura@apache.org>:
> > > >
> > > > > Alex,
> > > > >
> > > > > In most cases JdbcQueryTask should be executed locally on client
> node
> > > > > started by JDBC driver.
> > > > >
> > > > > JdbcQueryTask.QueryResult res =
> > > > >     loc ? qryTask.call() :
> > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(qryTask);
> > > > >
> > > > > Is it valid behavior after introducing DML functionality?
> > > > >
> > > > > In cases when user wants to execute query on specific node he
> should
> > > > > fully understand what he wants and what can go in wrong way.
> > > > >
> > > > >
> > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
> > > > > <alexander.a.paschenko@gmail.com> wrote:
> > > > > > Sergi,
> > > > > >
> > > > > > JDBC batching might work quite differently from driver to driver.
> > > Say,
> > > > > > MySQL happily rewrites queries as I had suggested in the
> beginning
> > of
> > > > > > this thread (it's not the only strategy, but one of the possible
> > > > > > options) - and, BTW, would like to hear at least an opinion
about
> > it.
> > > > > >
> > > > > > On your first approach, section before streamer: you suggest
that
> > we
> > > > > > send single statement and multiple param sets as a single query
> > task,
> > > > > > am I right? (Just to make sure that I got you properly.) If
so,
> do
> > > you
> > > > > > also mean that API (namely JdbcQueryTask) between server and
> client
> > > > > > should also change? Or should new API means be added to
> facilitate
> > > > > > batching tasks?
> > > > > >
> > > > > > - Alex
> > > > > >
> > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
> > sergi.vladykin@gmail.com
> > > >:
> > > > > >> Guys,
> > > > > >>
> > > > > >> I discussed this feature with Dmitriy and we came to conclusion
> > that
> > > > > >> batching in JDBC and Data Streaming in Ignite have different
> > > semantics
> > > > > and
> > > > > >> performance characteristics. Thus they are independent features
> > > (they
> > > > > may
> > > > > >> work together, may separately, but this is another story).
> > > > > >>
> > > > > >> Let me explain.
> > > > > >>
> > > > > >> This is how JDBC batching works:
> > > > > >> - Add N sets of parameters to a prepared statement.
> > > > > >> - Manually execute prepared statement.
> > > > > >> - Repeat until all the data is loaded.
> > > > > >>
> > > > > >>
> > > > > >> This is how data streamer works:
> > > > > >> - Keep adding data.
> > > > > >> - Streamer will buffer and load buffered per-node batches
when
> > they
> > > > are
> > > > > big
> > > > > >> enough.
> > > > > >> - Close streamer to make sure that everything is loaded.
> > > > > >>
> > > > > >> As you can see we have a difference in semantics of when
we send
> > > data:
> > > > > if
> > > > > >> in our JDBC we will allow sending batches to nodes without
> calling
> > > > > >> `execute` (and probably we will need to make `execute` to
no-op
> > > here),
> > > > > then
> > > > > >> we are violating semantics of JDBC, if we will disallow
this
> > > behavior,
> > > > > then
> > > > > >> this batching will underperform.
> > > > > >>
> > > > > >> Thus I suggest keeping these features (JDBC Batching and
JDBC
> > > > > Streaming) as
> > > > > >> separate features.
> > > > > >>
> > > > > >> As I already said they can work together: Batching will
batch
> > > > parameters
> > > > > >> and on `execute` they will go to the Streamer in one shot
and
> > > Streamer
> > > > > will
> > > > > >> deal with the rest.
> > > > > >>
> > > > > >> Sergi
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
> vozerov@gridgain.com
> > >:
> > > > > >>
> > > > > >>> Hi Alex,
> > > > > >>>
> > > > > >>> To my understanding there are two possible approaches
to
> batching
> > > in
> > > > > JDBC
> > > > > >>> layer:
> > > > > >>>
> > > > > >>> 1) Rely on default batching API. Specifically
> > > > > >>> *PreparedStatement.addBatch()* [1]
> > > > > >>> and others. This is nice and clear API, users are used
to it,
> and
> > > > it's
> > > > > >>> adoption will minimize user code changes when migrating
from
> > other
> > > > JDBC
> > > > > >>> sources. We simply copy updates locally and then execute
them
> all
> > > at
> > > > > once
> > > > > >>> with only a single network hop to servers. *IgniteDataStreamer*
> > can
> > > > be
> > > > > used
> > > > > >>> underneath.
> > > > > >>>
> > > > > >>> 2) Or we can have separate connection flag which will
move all
> > > > > >>> INSERT/UPDATE/DELETE statements through streamer.
> > > > > >>>
> > > > > >>> I prefer the first approach
> > > > > >>>
> > > > > >>> Also we need to keep in mind that data streamer has
poor
> > > performance
> > > > > when
> > > > > >>> adding single key-value pairs due to high overhead on
> concurrency
> > > and
> > > > > other
> > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value
pairs
> > > > before
> > > > > >>> giving them to streamer.
> > > > > >>>
> > > > > >>> Vladimir.
> > > > > >>>
> > > > > >>> [1]
> > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/
> > > > > PreparedStatement.html#
> > > > > >>> addBatch--
> > > > > >>>
> > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko
<
> > > > > >>> alexander.a.paschenko@gmail.com> wrote:
> > > > > >>>
> > > > > >>> > Hello Igniters,
> > > > > >>> >
> > > > > >>> > One of the major improvements to DML has to be
support of
> batch
> > > > > >>> > statements. I'd like to discuss its implementation.
The
> > suggested
> > > > > >>> > approach is to rewrite given query turning it from
few
> INSERTs
> > > into
> > > > > >>> > single statement and processing arguments accordingly.
I
> > suggest
> > > > this
> > > > > >>> > as long as the whole point of batching is to make
as little
> > > > > >>> > interactions with cluster as possible and to make
operations
> as
> > > > > >>> > condensed as possible, and in case of Ignite it
means that we
> > > > should
> > > > > >>> > send as little JdbcQueryTasks as possible. And,
as long as a
> > > query
> > > > > >>> > task holds single query and its arguments, this
approach will
> > not
> > > > > >>> > require any changes to be done to current design
and won't
> > break
> > > > any
> > > > > >>> > backward compatibility - all dirty work on rewriting
will be
> > done
> > > > by
> > > > > >>> > JDBC driver.
> > > > > >>> > Without rewriting, we could introduce some new
query task for
> > > batch
> > > > > >>> > operations, but that would make impossible sending
such
> > requests
> > > > from
> > > > > >>> > newer clients to older servers (say, servers of
version
> 1.8.0,
> > > > which
> > > > > >>> > does not know about batching, let alone older versions).
> > > > > >>> > I'd like to hear comments and suggestions from
the community.
> > > > Thanks!
> > > > > >>> >
> > > > > >>> > - Alex
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
>



-- 
Vladimir Ozerov
Senior Software Architect
GridGain Systems
www.gridgain.com
*+7 (960) 283 98 40*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message