ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Paschenko <alexander.a.pasche...@gmail.com>
Subject Re: Batch DML queries design discussion
Date Sat, 10 Dec 2016 10:52:55 GMT
Dima,

I would like to point out that data streamer support had already been
implemented in the course of work on DML in 1.8 exactly as you are
suggesting now (turned on via connection flag; allowed only MERGE — data
streamer can't do putIfAbsent stuff, right?; absolutely no relation
w/JDBC), *but* that patch had been reverted — by advice from Vlad which I
believe had been agreed with you, so it didn't make it to 1.8 after all.
Also, while it's possible to maintain INSERT vs MERGE semantic using
streamer's allowOverwrite flag, I can't see how we could mimic UPDATE here
as long as the streamer does not put data to cache only in case when key is
present AND allowOverwrite is false, while UPDATE should not put anything
when the key is *missing* — i.e., there's no way to emulate cache's
*replace* operation semantic with streamer (update value only if key is
present, otherwise do nothing).

— Alex
9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" <
dsetrakyan@apache.org> написал:

> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <vozerov@gridgain.com>
> wrote:
>
> > I already expressed my concern - this is counterintuitive approach.
> Because
> > without happens-before pure streaming model can be applied only on
> > independent chunks of data. It mean that mentioned ETL use case is not
> > feasible - ETL always depend on implicit or explicit links between
> tables,
> > and hence streaming is not applicable here. My question stands still -
> what
> > products except of possibly Ignite do this kind of JDBC streaming?
> >
>
> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or
> DataStreamer.addData().
>
> JDBC batching and putAll() are absolutely identical. If you see it as
> counter-intuitive, I would ask for a concrete example.
>
> As far as links between data, Ignite does not have foreign-key constraints,
> so DataStreamer can insert data in any order (but again, not as part  of
> JDBC batch).
>
>
> >
> > Another problem is that connection-wide property doesn't fit well in JDBC
> > pooling model. Users will have use different connections for streaming
> and
> > non-streaming approaches.
> >
>
> Using DataStreamer is not possible within JDBC batching paradigm, period. I
> wish we could drop the high-level-feels-good discussions altogether,
> because it seems like we are spinning wheels here.
>
> There is no way to use the streamer in JDBC context, unless we add a
> connection flag. Again, if you disagree, I would prefer to see a concrete
> example explaining why.
>
>
> > Please see how Oracle did that, this is precisely what I am talking
> about:
> > https://docs.oracle.com/cd/B28359_01/java.111/b31224/
> oraperf.htm#i1056232
> > Two batching modes - one with explicit flush, another one with implicit
> > flush, when Oracle decides on it's own when it is better to communicate
> the
> > server. Batching mode can be declared globally or on per-statement level.
> > Simple and flexible.
> >
> >
> > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <dsetrakyan@apache.org
> >
> > wrote:
> >
> > > Gents,
> > >
> > > As Sergi suggested, batching and streaming are very different
> > semantically.
> > >
> > > To use standard JDBC batching, all we need to do is convert it to a
> > > cache.putAll() method, as semantically a putAll(...) call is identical
> > to a
> > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in
> > between,
> > > then we may have to break a batch into several chunks and execute the
> > > update in between. The DataStreamer should not be used here.
> > >
> > > I believe that for streaming we need to add a special JDBC/ODBC
> > connection
> > > flag. Whenever this flag is set to true, then we only should allow
> INSERT
> > > or single-UPDATE operations and use DataStreamer API internally. All
> > > operations other than INSERT or single-UPDATE should be prohibited.
> > >
> > > I think this design is semantically clear. Any objections?
> > >
> > > D.
> > >
> > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <
> sergi.vladykin@gmail.com
> > >
> > > wrote:
> > >
> > > > If we use Streamer, then we always have `happens-before` broken. This
> > is
> > > > ok, because Streamer is for data loading, not for usual operating.
> > > >
> > > > We are not inventing any bicycles, just separating concerns: Batching
> > and
> > > > Streaming.
> > > >
> > > > My point here is that they should not depend on each other at all:
> > > Batching
> > > > can work with or without Streaming, as well as Streaming can work
> with
> > or
> > > > without Batching.
> > > >
> > > > Your proposal is a set of non-obvious rules for them to work. I see
> no
> > > > reasons for these complications.
> > > >
> > > > Sergi
> > > >
> > > >
> > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:
> > > >
> > > > > Sergi,
> > > > >
> > > > > If user call single *execute() *operation, than most likely it is
> not
> > > > > batching. We should not rely on strange case where user perform
> > > batching
> > > > > without using standard and well-adopted batching JDBC API. The main
> > > > problem
> > > > > with streamer is that it is async and hence break happens-before
> > > > guarantees
> > > > > in a single thread: SELECT after INSERT might not return inserted
> > > value.
> > > > >
> > > > > Honestly, I do not really understand why we are trying to
> re-invent a
> > > > > bicycle here. There is standard API - let's just use it and make
> > > flexible
> > > > > enough to take advantage of IgniteDataStreamer if needed.
> > > > >
> > > > > Is there any use case which is not covered with this solution? Or
> let
> > > me
> > > > > ask from the opposite side - are there any well-known JDBC drivers
> > > which
> > > > > perform batching/streaming from non-batched update statements?
> > > > >
> > > > > Vladimir.
> > > > >
> > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
> > > sergi.vladykin@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Vladimir,
> > > > > >
> > > > > > I see no reason to forbid Streamer usage from non-batched
> statement
> > > > > > execution.
> > > > > > It is common that users already have their ETL tools and you
> can't
> > be
> > > > > sure
> > > > > > if they use batching or not.
> > > > > >
> > > > > > Alex,
> > > > > >
> > > > > > I guess we have to decide on Streaming first and then we will
> > discuss
> > > > > > Batching separately, ok? Because this decision may become
> important
> > > for
> > > > > > batching implementation.
> > > > > >
> > > > > > Sergi
> > > > > >
> > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <agura@apache.org>:
> > > > > >
> > > > > > > Alex,
> > > > > > >
> > > > > > > In most cases JdbcQueryTask should be executed locally
on
> client
> > > node
> > > > > > > started by JDBC driver.
> > > > > > >
> > > > > > > JdbcQueryTask.QueryResult res =
> > > > > > >     loc ? qryTask.call() :
> > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(
> > qryTask);
> > > > > > >
> > > > > > > Is it valid behavior after introducing DML functionality?
> > > > > > >
> > > > > > > In cases when user wants to execute query on specific node
he
> > > should
> > > > > > > fully understand what he wants and what can go in wrong
way.
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
> > > > > > > <alexander.a.paschenko@gmail.com> wrote:
> > > > > > > > Sergi,
> > > > > > > >
> > > > > > > > JDBC batching might work quite differently from driver
to
> > driver.
> > > > > Say,
> > > > > > > > MySQL happily rewrites queries as I had suggested
in the
> > > beginning
> > > > of
> > > > > > > > this thread (it's not the only strategy, but one of
the
> > possible
> > > > > > > > options) - and, BTW, would like to hear at least an
opinion
> > about
> > > > it.
> > > > > > > >
> > > > > > > > On your first approach, section before streamer: you
suggest
> > that
> > > > we
> > > > > > > > send single statement and multiple param sets as a
single
> query
> > > > task,
> > > > > > > > am I right? (Just to make sure that I got you properly.)
If
> so,
> > > do
> > > > > you
> > > > > > > > also mean that API (namely JdbcQueryTask) between
server and
> > > client
> > > > > > > > should also change? Or should new API means be added
to
> > > facilitate
> > > > > > > > batching tasks?
> > > > > > > >
> > > > > > > > - Alex
> > > > > > > >
> > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
> > > > sergi.vladykin@gmail.com
> > > > > >:
> > > > > > > >> Guys,
> > > > > > > >>
> > > > > > > >> I discussed this feature with Dmitriy and we came
to
> > conclusion
> > > > that
> > > > > > > >> batching in JDBC and Data Streaming in Ignite
have different
> > > > > semantics
> > > > > > > and
> > > > > > > >> performance characteristics. Thus they are independent
> > features
> > > > > (they
> > > > > > > may
> > > > > > > >> work together, may separately, but this is another
story).
> > > > > > > >>
> > > > > > > >> Let me explain.
> > > > > > > >>
> > > > > > > >> This is how JDBC batching works:
> > > > > > > >> - Add N sets of parameters to a prepared statement.
> > > > > > > >> - Manually execute prepared statement.
> > > > > > > >> - Repeat until all the data is loaded.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> This is how data streamer works:
> > > > > > > >> - Keep adding data.
> > > > > > > >> - Streamer will buffer and load buffered per-node
batches
> when
> > > > they
> > > > > > are
> > > > > > > big
> > > > > > > >> enough.
> > > > > > > >> - Close streamer to make sure that everything
is loaded.
> > > > > > > >>
> > > > > > > >> As you can see we have a difference in semantics
of when we
> > send
> > > > > data:
> > > > > > > if
> > > > > > > >> in our JDBC we will allow sending batches to nodes
without
> > > calling
> > > > > > > >> `execute` (and probably we will need to make `execute`
to
> > no-op
> > > > > here),
> > > > > > > then
> > > > > > > >> we are violating semantics of JDBC, if we will
disallow this
> > > > > behavior,
> > > > > > > then
> > > > > > > >> this batching will underperform.
> > > > > > > >>
> > > > > > > >> Thus I suggest keeping these features (JDBC Batching
and
> JDBC
> > > > > > > Streaming) as
> > > > > > > >> separate features.
> > > > > > > >>
> > > > > > > >> As I already said they can work together: Batching
will
> batch
> > > > > > parameters
> > > > > > > >> and on `execute` they will go to the Streamer
in one shot
> and
> > > > > Streamer
> > > > > > > will
> > > > > > > >> deal with the rest.
> > > > > > > >>
> > > > > > > >> Sergi
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
> > > vozerov@gridgain.com
> > > > >:
> > > > > > > >>
> > > > > > > >>> Hi Alex,
> > > > > > > >>>
> > > > > > > >>> To my understanding there are two possible
approaches to
> > > batching
> > > > > in
> > > > > > > JDBC
> > > > > > > >>> layer:
> > > > > > > >>>
> > > > > > > >>> 1) Rely on default batching API. Specifically
> > > > > > > >>> *PreparedStatement.addBatch()* [1]
> > > > > > > >>> and others. This is nice and clear API, users
are used to
> it,
> > > and
> > > > > > it's
> > > > > > > >>> adoption will minimize user code changes when
migrating
> from
> > > > other
> > > > > > JDBC
> > > > > > > >>> sources. We simply copy updates locally and
then execute
> them
> > > all
> > > > > at
> > > > > > > once
> > > > > > > >>> with only a single network hop to servers.
> > *IgniteDataStreamer*
> > > > can
> > > > > > be
> > > > > > > used
> > > > > > > >>> underneath.
> > > > > > > >>>
> > > > > > > >>> 2) Or we can have separate connection flag
which will move
> > all
> > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer.
> > > > > > > >>>
> > > > > > > >>> I prefer the first approach
> > > > > > > >>>
> > > > > > > >>> Also we need to keep in mind that data streamer
has poor
> > > > > performance
> > > > > > > when
> > > > > > > >>> adding single key-value pairs due to high
overhead on
> > > concurrency
> > > > > and
> > > > > > > other
> > > > > > > >>> bookkeeping. Instead, it is better to pre-batch
key-value
> > pairs
> > > > > > before
> > > > > > > >>> giving them to streamer.
> > > > > > > >>>
> > > > > > > >>> Vladimir.
> > > > > > > >>>
> > > > > > > >>> [1]
> > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/
> > > > > > > PreparedStatement.html#
> > > > > > > >>> addBatch--
> > > > > > > >>>
> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander
Paschenko <
> > > > > > > >>> alexander.a.paschenko@gmail.com> wrote:
> > > > > > > >>>
> > > > > > > >>> > Hello Igniters,
> > > > > > > >>> >
> > > > > > > >>> > One of the major improvements to DML
has to be support of
> > > batch
> > > > > > > >>> > statements. I'd like to discuss its implementation.
The
> > > > suggested
> > > > > > > >>> > approach is to rewrite given query turning
it from few
> > > INSERTs
> > > > > into
> > > > > > > >>> > single statement and processing arguments
accordingly. I
> > > > suggest
> > > > > > this
> > > > > > > >>> > as long as the whole point of batching
is to make as
> little
> > > > > > > >>> > interactions with cluster as possible
and to make
> > operations
> > > as
> > > > > > > >>> > condensed as possible, and in case of
Ignite it means
> that
> > we
> > > > > > should
> > > > > > > >>> > send as little JdbcQueryTasks as possible.
And, as long
> as
> > a
> > > > > query
> > > > > > > >>> > task holds single query and its arguments,
this approach
> > will
> > > > not
> > > > > > > >>> > require any changes to be done to current
design and
> won't
> > > > break
> > > > > > any
> > > > > > > >>> > backward compatibility - all dirty work
on rewriting will
> > be
> > > > done
> > > > > > by
> > > > > > > >>> > JDBC driver.
> > > > > > > >>> > Without rewriting, we could introduce
some new query task
> > for
> > > > > batch
> > > > > > > >>> > operations, but that would make impossible
sending such
> > > > requests
> > > > > > from
> > > > > > > >>> > newer clients to older servers (say,
servers of version
> > > 1.8.0,
> > > > > > which
> > > > > > > >>> > does not know about batching, let alone
older versions).
> > > > > > > >>> > I'd like to hear comments and suggestions
from the
> > community.
> > > > > > Thanks!
> > > > > > > >>> >
> > > > > > > >>> > - Alex
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Vladimir Ozerov
> > Senior Software Architect
> > GridGain Systems
> > www.gridgain.com
> > *+7 (960) 283 98 40*
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message