ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@apache.org>
Subject Re: DML data streaming
Date Sat, 11 Feb 2017 00:13:30 GMT
In general, the data streamer approach should be mostly used for data loading scenarios. The
data is usually loaded with INSERTS which means that the scenario is already supported and
we’re free to merge the changes to 1.9.

If you UPDATE or DELETE data in the streaming mode then you are required to set dataStreamer.allowOverwrite
= true, making sure that the updates coming from the streamer side are consistent with transactions
that might be executed in parallel. In this mode the streamer switches to a slower mode pushing
the data with cache.writeAll() and cache.removeAll() methods. 

At all, considering real-life use cases it’s more than enough to support the streaming mode
for INSERTS only and describe it properly in the documentation.


> On Feb 10, 2017, at 3:36 AM, Vladimir Ozerov <vozerov@gridgain.com> wrote:
> I propose to ship streaming with INSERT support only for now. This is
> enough for multitude cases and will add value to Ignite 1.9 immediately. We
> can think about correct streaming UPDATE/DELETE architecture separately .It
> is much more difficult thing, we cannot support it in a clean way right now
> due to multiple "_key" and "_val" usages over the code base.
> On Fri, Feb 10, 2017 at 11:55 AM, Alexander Paschenko <
> alexander.a.paschenko@gmail.com> wrote:
>> And to avoid further confusion: UPDATE and DELETE are simply
>> impossible in streaming mode when the key is not completely defined as
>> long as data streamer operates with key-value pairs and not just
>> tuples of named values. That's why we can't do DELETE from Person
>> WHERE id1 = 5 from prev example with streamer - the Key { id1 = 5, id2
>> = 0 } that would be constructed from such query is just one key and is
>> handled by streamer as such while semantically that query is not about
>> ONE key but about ALL keys where id1 = 5.
>> - Alex
>> 2017-02-10 11:49 GMT+03:00 Alexander Paschenko
>> <alexander.a.paschenko@gmail.com>:
>>> Dima,
>>>> There are several ways to handle it. I would check how other databases
>>>> handle it, maybe we can borrow something. To the least, we should log
>> such
>>>> errors in the log for now.
>>> Logging errors would mean introducing some kind of stream receiver to
>>> do that and thus that would be really the same performance penalty for
>>> the successful operations. I think we should go with that optional
>>> flag for semantics after all.
>>>> You don't have to use _key. Primary key is usually a field in the
>> class, so
>>>> you can use a normal column name. In any case, we should remove any
>> usage
>>>> of _key before 2.0 is released.
>>>> Again, if user does not have to specify _key on INSERT, then it is very
>>>> unclear to me, why user would need to specify _key for UPDATE or DELETE.
>>>> Something smells here. Can you please provide an example?
>>> UPDATE and DELETE _in streaming mode_ are carried _only_ for "fast"
>>> optimized cases - i.e. those where _key (and possibly _val) are
>>> explicitly specified by the user thus allowing us to map UPDATE and
>>> DELETE directly to cache's replace and remove operations without
>>> messing with entry processors and doing map-reduce SELECT by given
>>> criteria.
>>> Say, we have Person { firstName, secondName } with key class Key { id1,
>> id2 }
>>> If I say DELETE from Person WHERE _key = ? and specify arg via JDBC,
>>> there's no need to do any SELECT - we can just call IgniteCache.remove
>>> on that key.
>>> But if I say DELETE from Person WHERE id1 = 5 then there's no way to
>>> avoid MR - we have to find all keys that interest us first by doing
>>> SELECT as long as we know only partly about what keys the user wants
>>> to be affected.
>>> It works in the same way for UPDATE. And I hope that it's clear how
>>> it's different from INSERT - there's no MR by definition (we don't
>>> allow INSERT FROM SELECT in streaming mode).
>>> AGAIN: this all is said only about streaming mode; non streaming mode
>>> does those optimizations too, but it also allows complex conditions,
>>> while streaming mode does not allow them to keep things fast and avoid
>>> MR.
>>> That's the reason why I suggest that we drop UPDATE and DELETE from
>>> DML streaming as they mean messing with those soon-hidden columns.
>>> Still we could optimize stuff like DELETE from Person WHERE id1 = 5
>>> AND id2 = 6 - query involves ALL fields of key AND compares only for
>>> equality AND has no complex expressions - we can construct key
>>> unambiguously and still call remove directly.
>>> But to me it does not sound like a really great reason to leave UPDATE
>>> and DELETE in DML - the users will have to write some specific queries
>>> to use that while all other stuff will just be declined in that mode.
>>> And, as I said before, UPDATE and DELETE don't probably perfectly fit
>>> with primary data streamer use cases - after all, modifying existing
>>> stuff is not what data streamer is about.
>>> And regarding hiding columns: it's unclear how things will look like
>>> for caches like <int, int> when we remove _key and _val as long as
>>> tables for such cases currently have nothing but those two columns.
>>> - Alex
>>>>> 8 февр. 2017 г. 11:33 PM пользователь "Dmitriy Setrakyan"
>>>>> dsetrakyan@apache.org> написал:
>>>>>> Alexander,
>>>>>> Are you suggesting that currently to execute a simple INSERT for
>> row we
>>>>>> invoke a data streamer on Ignite API? How about an update by a
>> primary
>>>>> key?
>>>>>> Why not execute a simple cache put in either case?
>>>>>> I think we had a separate thread where we agreed that the streamer
>> should
>>>>>> only be turned on if a certain flag on a JDBC connection is set,
>>>>>> D.
>>>>>> On Wed, Feb 8, 2017 at 7:00 AM, Alexander Paschenko <
>>>>>> alexander.a.paschenko@gmail.com> wrote:
>>>>>>> Hello Igniters,
>>>>>>> I'd like to raise few questions regarding data streaming via
>>>>>>> statements.
>>>>>>> Currently, all types of DML statements are supported (INSERT,
>>>>>>> DELETE, MERGE).
>>>>>>> UPDATE and DELETE are supported in streaming mode only when their
>>>>>>> WHERE condition is bounded with _key and/or _val columns, and
>>>>>>> works only for _val column directly.
>>>>>>> Seeing some activity in direction of hiding _key and _val from
>>>>>>> user as far as possible, these features seem pointless and should
>> not
>>>>>>> be released, what do you think?
>>>>>>> Also INSERT in streaming mode currently does not throw errors
>>>>>>> duplicate keys and silently ignores such new records (as long
>> it's
>>>>>>> faster than it would work if we'd introduced receiver that would
>> throw
>>>>>>> exceptions) - this can be fixed with additional flag that could
>>>>>>> _optionally_ make INSERT slower but more accurate in semantic.
>>>>>>> And MERGE in streaming mode currently not totally accurate in
>>>>>>> semantic, too - on key presence, it will just replace whole value
>> with
>>>>>>> new one thus potentially making values of some concrete
>> columns/fields
>>>>>>> lost - this is analogous to
>>>>>>> https://issues.apache.org/jira/browse/IGNITE-4489, but hardly
>> be
>>>>>>> fixed as long as probably it would hit performance and would
>>>>>>> unresonably complex to implement.
>>>>>>> I suggest that we drop all except INSERT and introduce optional
>> flag
>>>>>>> for its totally correct semantic behavior as described above.
>>>>>>> - Alex

View raw message