ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valentin Kulichenko <valentin.kuliche...@gmail.com>
Subject Re: IgniteCache.loadCache improvement proposal
Date Tue, 15 Nov 2016 22:06:31 GMT
You can use localLoadCache method for this (it should be overloaded as well
of course). Basically, if you provide closure based on IgniteDataStreamer
and call localLoadCache on one of the nodes (client or server), it's the
same approach as described in [1], but with the possibility to reuse
existing persistence code. Makes sense?

[1] https://apacheignite.readme.io/docs/data-loading#ignitedatastreamer

-Val

On Tue, Nov 15, 2016 at 1:15 PM, Denis Magda <dmagda@apache.org> wrote:

> How would your proposal resolve the main point Aleksandr is trying to
> convey that is extensive network utilization?
>
> As I see the loadCache method still will be triggered on every and as
> before all the nodes will pre-load all the data set from a database. That
> was Aleksandr’s reasonable concern.
>
> If we make up a way how to call the loadCache on a specific node only and
> implement some falt-tolerant mechanism then your suggestion should work
> perfectly fine.
>
> —
> Denis
>
> > On Nov 15, 2016, at 12:05 PM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
> >
> > It sounds like Aleksandr is basically proposing to support automatic
> > persistence [1] for loading through data streamer and we really don't
> have
> > this. However, I think I have more generic solution in mind.
> >
> > What if we add one more IgniteCache.loadCache overload like this:
> >
> > loadCache(@Nullable IgniteBiPredicate<K, V> p, IgniteBiInClosure<K, V>
> > clo, @Nullable
> > Object... args)
> >
> > It's the same as the existing one, but with the key-value closure
> provided
> > as a parameter. This closure will be passed to the CacheStore.loadCache
> > along with the arguments and will allow to override the logic that
> actually
> > saves the loaded entry in cache (currently this logic is always provided
> by
> > the cache itself and user can't control it).
> >
> > We can then provide the implementation of this closure that will create a
> > data streamer and call addData() within its apply() method.
> >
> > I see the following advantages:
> >
> >   - Any existing CacheStore implementation can be reused to load through
> >   streamer (our JDBC and Cassandra stores or anything else that user
> has).
> >   - Loading code is always part of CacheStore implementation, so it's
> very
> >   easy to switch between different ways of loading.
> >   - User is not limited by two approaches we provide out of the box, they
> >   can always implement a new one.
> >
> > Thoughts?
> >
> > [1] https://apacheignite.readme.io/docs/automatic-persistence
> >
> > -Val
> >
> > On Tue, Nov 15, 2016 at 2:27 AM, Alexey Kuznetsov <akuznetsov@apache.org
> >
> > wrote:
> >
> >> Hi, All!
> >>
> >> I think we do not need to chage API at all.
> >>
> >> public void loadCache(@Nullable IgniteBiPredicate<K, V> p, @Nullable
> >> Object... args) throws CacheException;
> >>
> >> We could pass any args to loadCache();
> >>
> >> So we could create class
> >> IgniteCacheLoadDescriptor {
> >> some fields that will describe how to load
> >> }
> >>
> >>
> >> and modify POJO store to detect and use such arguments.
> >>
> >>
> >> All we need is to implement this and write good documentation and
> examples.
> >>
> >> Thoughts?
> >>
> >> On Tue, Nov 15, 2016 at 5:22 PM, Alexandr Kuramshin <
> ein.nsk.ru@gmail.com>
> >> wrote:
> >>
> >>> Hi Vladimir,
> >>>
> >>> I don't offer any changes in API. Usage scenario is the same as it was
> >>> described in
> >>> https://apacheignite.readme.io/docs/persistent-store#
> section-loadcache-
> >>>
> >>> The preload cache logic invokes IgniteCache.loadCache() with some
> >>> additional arguments, depending on a CacheStore implementation, and
> then
> >>> the loading occurs in the way I've already described.
> >>>
> >>>
> >>> 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:
> >>>
> >>>> Hi Alex,
> >>>>
> >>>>>>> Let's give the user the reusable code which is convenient,
reliable
> >>> and
> >>>> fast.
> >>>> Convenience - this is why I asked for example on how API can look like
> >>> and
> >>>> how users are going to use it.
> >>>>
> >>>> Vladimir.
> >>>>
> >>>> On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin <
> >>> ein.nsk.ru@gmail.com
> >>>>>
> >>>> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I think the discussion goes a wrong direction. Certainly it's not
a
> >> big
> >>>>> deal to implement some custom user logic to load the data into
> >> caches.
> >>>> But
> >>>>> Ignite framework gives the user some reusable code build on top
of
> >> the
> >>>>> basic system.
> >>>>>
> >>>>> So the main question is: Why developers let the user to use
> >> convenient
> >>>> way
> >>>>> to load caches with totally non-optimal solution?
> >>>>>
> >>>>> We could talk too much about different persistence storage types,
but
> >>>>> whenever we initiate the loading with IgniteCache.loadCache the
> >> current
> >>>>> implementation imposes much overhead on the network.
> >>>>>
> >>>>> Partition-aware data loading may be used in some scenarios to avoid
> >>> this
> >>>>> network overhead, but the users are compelled to do additional steps
> >> to
> >>>>> achieve this optimization: adding the column to tables, adding
> >> compound
> >>>>> indices including the added column, write a peace of repeatable
code
> >> to
> >>>>> load the data in different caches in fault-tolerant fashion, etc.
> >>>>>
> >>>>> Let's give the user the reusable code which is convenient, reliable
> >> and
> >>>>> fast.
> >>>>>
> >>>>> 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko <
> >>>>> valentin.kulichenko@gmail.com>:
> >>>>>
> >>>>>> Hi Aleksandr,
> >>>>>>
> >>>>>> Data streamer is already outlined as one of the possible approaches
> >>> for
> >>>>>> loading the data [1]. Basically, you start a designated client
node
> >>> or
> >>>>>> chose a leader among server nodes [1] and then use
> >> IgniteDataStreamer
> >>>> API
> >>>>>> to load the data. With this approach there is no need to have
the
> >>>>>> CacheStore implementation at all. Can you please elaborate what
> >>>>> additional
> >>>>>> value are you trying to add here?
> >>>>>>
> >>>>>> [1] https://apacheignite.readme.io/docs/data-loading#
> >>>> ignitedatastreamer
> >>>>>> [2] https://apacheignite.readme.io/docs/leader-election
> >>>>>>
> >>>>>> -Val
> >>>>>>
> >>>>>> On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan <
> >>>>> dsetrakyan@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I just want to clarify a couple of API details from the
original
> >>>> email
> >>>>> to
> >>>>>>> make sure that we are making the right assumptions here.
> >>>>>>>
> >>>>>>> *"Because of none keys are passed to the CacheStore.loadCache
> >>>> methods,
> >>>>>> the
> >>>>>>>> underlying implementation is forced to read all the
data from
> >> the
> >>>>>>>> persistence storage"*
> >>>>>>>
> >>>>>>>
> >>>>>>> According to the javadoc, loadCache(...) method receives
an
> >>> optional
> >>>>>>> argument from the user. You can pass anything you like,
> >> including a
> >>>>> list
> >>>>>> of
> >>>>>>> keys, or an SQL where clause, etc.
> >>>>>>>
> >>>>>>> *"The partition-aware data loading approach is not a choice.
It
> >>>>> requires
> >>>>>>>> persistence of the volatile data depended on affinity
function
> >>>>>>>> implementation and settings."*
> >>>>>>>
> >>>>>>>
> >>>>>>> This is only partially true. While Ignite allows to plugin
custom
> >>>>>> affinity
> >>>>>>> functions, the affinity function is not something that changes
> >>>>>> dynamically
> >>>>>>> and should always return the same partition for the same
key.So,
> >>> the
> >>>>>>> partition assignments are not volatile at all. If, in some
very
> >>> rare
> >>>>>> case,
> >>>>>>> the partition assignment logic needs to change, then you
could
> >>> update
> >>>>> the
> >>>>>>> partition assignments that you may have persisted elsewhere
as
> >>> well,
> >>>>> e.g.
> >>>>>>> database.
> >>>>>>>
> >>>>>>> D.
> >>>>>>>
> >>>>>>> On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov <
> >>>>> vozerov@gridgain.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Alexandr, Alexey,
> >>>>>>>>
> >>>>>>>> While I agree with you that current cache loading logic
is far
> >>> from
> >>>>>>> ideal,
> >>>>>>>> it would be cool to see API drafts based on your suggestions
to
> >>> get
> >>>>>>> better
> >>>>>>>> understanding of your ideas. How exactly users are going
to use
> >>>> your
> >>>>>>>> suggestions?
> >>>>>>>>
> >>>>>>>> My main concern is that initial load is not very trivial
task
> >> in
> >>>>>> general
> >>>>>>>> case. Some users have centralized RDBMS systems, some
have
> >> NoSQL,
> >>>>>> others
> >>>>>>>> work with distributed persistent stores (e.g. HDFS).
Sometimes
> >> we
> >>>>> have
> >>>>>>>> Ignite nodes "near" persistent data, sometimes we don't.
> >>> Sharding,
> >>>>>>>> affinity, co-location, etc.. If we try to support all
(or many)
> >>>> cases
> >>>>>> out
> >>>>>>>> of the box, we may end up in very messy and difficult
API. So
> >> we
> >>>>> should
> >>>>>>>> carefully balance between simplicity, usability and
> >> feature-rich
> >>>>>>>> characteristics here.
> >>>>>>>>
> >>>>>>>> Personally, I think that if user is not satisfied with
> >>>> "loadCache()"
> >>>>>> API,
> >>>>>>>> he just writes simple closure with blackjack streamer
and
> >> queries
> >>>> and
> >>>>>>> send
> >>>>>>>> it to whatever node he finds convenient. Not a big deal.
Only
> >>> very
> >>>>>> common
> >>>>>>>> cases should be added to Ignite API.
> >>>>>>>>
> >>>>>>>> Vladimir.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov <
> >>>>>>>> akuznetsov@gridgain.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Looks good for me.
> >>>>>>>>>
> >>>>>>>>> But I will suggest to consider one more use-case:
> >>>>>>>>>
> >>>>>>>>> If user knows its data he could manually split loading.
> >>>>>>>>> For example: table Persons contains 10M rows.
> >>>>>>>>> User could provide something like:
> >>>>>>>>> cache.loadCache(null, "Person", "select * from Person
where
> >> id
> >>> <
> >>>>>>>>> 1_000_000",
> >>>>>>>>> "Person", "select * from Person where id >= 
1_000_000 and
> >> id <
> >>>>>>>> 2_000_000",
> >>>>>>>>> ....
> >>>>>>>>> "Person", "select * from Person where id >= 9_000_000
and id
> >> <
> >>>>>>>> 10_000_000",
> >>>>>>>>> );
> >>>>>>>>>
> >>>>>>>>> or may be it could be some descriptor object like
> >>>>>>>>>
> >>>>>>>>> {
> >>>>>>>>>   sql: select * from Person where id >=  ? and
id < ?"
> >>>>>>>>>   range: 0...10_000_000
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> In this case provided queries will be send to mach
nodes as
> >>>> number
> >>>>> of
> >>>>>>>>> queries.
> >>>>>>>>> And data will be loaded in parallel and for keys
that a not
> >>>> local -
> >>>>>>> data
> >>>>>>>>> streamer
> >>>>>>>>> should be used (as described Alexandr description).
> >>>>>>>>>
> >>>>>>>>> I think it is a good issue for Ignite 2.0
> >>>>>>>>>
> >>>>>>>>> Vova, Val - what do you think?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin
<
> >>>>>>>> ein.nsk.ru@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> All right,
> >>>>>>>>>>
> >>>>>>>>>> Let's assume a simple scenario. When the
> >> IgniteCache.loadCache
> >>>> is
> >>>>>>>> invoked,
> >>>>>>>>>> we check whether the cache is not local, and
if so, then
> >> we'll
> >>>>>>> initiate
> >>>>>>>>>> the
> >>>>>>>>>> new loading logic.
> >>>>>>>>>>
> >>>>>>>>>> First, we take a "streamer" node, it could be
done by
> >>>>>>>>>> utilizing LoadBalancingSpi, or it may be configured
> >>> statically,
> >>>>> for
> >>>>>>> the
> >>>>>>>>>> reason that the streamer node is running on
the same host as
> >>> the
> >>>>>>>>>> persistence storage provider.
> >>>>>>>>>>
> >>>>>>>>>> After that we start the loading task on the
streamer node
> >>> which
> >>>>>>>>>> creates IgniteDataStreamer and loads the cache
with
> >>>>>>>> CacheStore.loadCache.
> >>>>>>>>>> Every call to IgniteBiInClosure.apply simply
> >>>>>>>>>> invokes IgniteDataStreamer.addData.
> >>>>>>>>>>
> >>>>>>>>>> This implementation will completely relieve
overhead on the
> >>>>>>> persistence
> >>>>>>>>>> storage provider. Network overhead is also decreased
in the
> >>> case
> >>>>> of
> >>>>>>>>>> partitioned caches. For two nodes we get 1-1/2
amount of
> >> data
> >>>>>>>> transferred
> >>>>>>>>>> by the network (1 part well be transferred from
the
> >>> persistence
> >>>>>>> storage
> >>>>>>>> to
> >>>>>>>>>> the streamer, and then 1/2 from the streamer
node to the
> >>> another
> >>>>>>> node).
> >>>>>>>>>> For
> >>>>>>>>>> three nodes it will be 1-2/3 and so on, up to
the two times
> >>>> amount
> >>>>>> of
> >>>>>>>> data
> >>>>>>>>>> on the big clusters.
> >>>>>>>>>>
> >>>>>>>>>> I'd like to propose some additional optimization
at this
> >>> place.
> >>>> If
> >>>>>> we
> >>>>>>>> have
> >>>>>>>>>> the streamer node on the same machine as the
persistence
> >>> storage
> >>>>>>>> provider,
> >>>>>>>>>> then we completely relieve the network overhead
as well. It
> >>>> could
> >>>>>> be a
> >>>>>>>>>> some
> >>>>>>>>>> special daemon node for the cache loading assigned
in the
> >>> cache
> >>>>>>>>>> configuration, or an ordinary sever node as
well.
> >>>>>>>>>>
> >>>>>>>>>> Certainly this calculations have been done in
assumption
> >> that
> >>> we
> >>>>>> have
> >>>>>>>> even
> >>>>>>>>>> partitioned cache with only primary nodes (without
backups).
> >>> In
> >>>>> the
> >>>>>>> case
> >>>>>>>>>> of
> >>>>>>>>>> one backup (the most frequent case I think),
we get 2 amount
> >>> of
> >>>>> data
> >>>>>>>>>> transferred by the network on two nodes, 2-1/3
on three,
> >> 2-1/2
> >>>> on
> >>>>>>> four,
> >>>>>>>>>> and
> >>>>>>>>>> so on up to the three times amount of data on
the big
> >>> clusters.
> >>>>>> Hence
> >>>>>>>> it's
> >>>>>>>>>> still better than the current implementation.
In the worst
> >>> case
> >>>>>> with a
> >>>>>>>>>> fully replicated cache we take N+1 amount of
data
> >> transferred
> >>> by
> >>>>> the
> >>>>>>>>>> network (where N is the number of nodes in the
cluster). But
> >>>> it's
> >>>>>> not
> >>>>>>> a
> >>>>>>>>>> problem in small clusters, and a little overhead
in big
> >>>> clusters.
> >>>>>> And
> >>>>>>> we
> >>>>>>>>>> still gain the persistence storage provider
optimization.
> >>>>>>>>>>
> >>>>>>>>>> Now let's take more complex scenario. To achieve
some level
> >> of
> >>>>>>>>>> parallelism,
> >>>>>>>>>> we could split our cluster on several groups.
It could be a
> >>>>>> parameter
> >>>>>>> of
> >>>>>>>>>> the IgniteCache.loadCache method or a cache
configuration
> >>>> option.
> >>>>>> The
> >>>>>>>>>> number of groups could be a fixed value, or
it could be
> >>>> calculated
> >>>>>>>>>> dynamically by the maximum number of nodes in
the group.
> >>>>>>>>>>
> >>>>>>>>>> After splitting the whole cluster on groups
we will take the
> >>>>>> streamer
> >>>>>>>> node
> >>>>>>>>>> in the each group and submit the task for loading
the cache
> >>>>> similar
> >>>>>> to
> >>>>>>>> the
> >>>>>>>>>> single streamer scenario, except as the only
keys will be
> >>> passed
> >>>>> to
> >>>>>>>>>> the IgniteDataStreamer.addData method those
correspond to
> >> the
> >>>>>> cluster
> >>>>>>>>>> group
> >>>>>>>>>> where is the streamer node running.
> >>>>>>>>>>
> >>>>>>>>>> In this case we get equal level of overhead
as the
> >>> parallelism,
> >>>>> but
> >>>>>>> not
> >>>>>>>> so
> >>>>>>>>>> surplus as how many nodes in whole the cluster.
> >>>>>>>>>>
> >>>>>>>>>> 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov
<
> >>>>> akuznetsov@apache.org
> >>>>>>> :
> >>>>>>>>>>
> >>>>>>>>>>> Alexandr,
> >>>>>>>>>>>
> >>>>>>>>>>> Could you describe your proposal in more
details?
> >>>>>>>>>>> Especially in case with several nodes.
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Nov 11, 2016 at 6:34 PM, Alexandr
Kuramshin <
> >>>>>>>>>> ein.nsk.ru@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> You know CacheStore API that is commonly
used for
> >>>>>>> read/write-through
> >>>>>>>>>>>> relationship of the in-memory data with
the persistence
> >>>>> storage.
> >>>>>>>>>>>>
> >>>>>>>>>>>> There is also IgniteCache.loadCache
method for
> >> hot-loading
> >>>> the
> >>>>>>> cache
> >>>>>>>>>> on
> >>>>>>>>>>>> startup. Invocation of this method causes
execution of
> >>>>>>>>>>> CacheStore.loadCache
> >>>>>>>>>>>> on the all nodes storing the cache partitions.
Because
> >> of
> >>>> none
> >>>>>>> keys
> >>>>>>>>>> are
> >>>>>>>>>>>> passed to the CacheStore.loadCache methods,
the
> >> underlying
> >>>>>>>>>> implementation
> >>>>>>>>>>>> is forced to read all the data from
the persistence
> >>> storage,
> >>>>> but
> >>>>>>>> only
> >>>>>>>>>>> part
> >>>>>>>>>>>> of the data will be stored on each node.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So, the current implementation have
two general
> >> drawbacks:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Persistence storage is forced to
perform as many
> >>>> identical
> >>>>>>>> queries
> >>>>>>>>>> as
> >>>>>>>>>>>> many nodes on the cluster. Each query
may involve much
> >>>>>> additional
> >>>>>>>>>>>> computation on the persistence storage
server.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. Network is forced to transfer much
more data, so
> >>>> obviously
> >>>>>> the
> >>>>>>>> big
> >>>>>>>>>>>> disadvantage on large systems.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The partition-aware data loading approach,
described in
> >>>>>>>>>>>> https://apacheignite.readme.
> >> io/docs/data-loading#section-
> >>>>>>>>>>>> partition-aware-data-loading
> >>>>>>>>>>>> , is not a choice. It requires persistence
of the
> >> volatile
> >>>>> data
> >>>>>>>>>> depended
> >>>>>>>>>>> on
> >>>>>>>>>>>> affinity function implementation and
settings.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I propose using something like IgniteDataStreamer
inside
> >>>>>>>>>>>> IgniteCache.loadCache implementation.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Alexandr Kuramshin
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Alexey Kuznetsov
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Alexandr Kuramshin
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Alexey Kuznetsov
> >>>>>>>>> GridGain Systems
> >>>>>>>>> www.gridgain.com
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks,
> >>>>> Alexandr Kuramshin
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Alexandr Kuramshin
> >>>
> >>
> >>
> >>
> >> --
> >> Alexey Kuznetsov
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message