ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandr Kuramshin <ein.nsk...@gmail.com>
Subject Re: IgniteCache.loadCache improvement proposal
Date Tue, 15 Nov 2016 08:18:08 GMT
Hi all,

I think the discussion goes a wrong direction. Certainly it's not a big
deal to implement some custom user logic to load the data into caches. But
Ignite framework gives the user some reusable code build on top of the
basic system.

So the main question is: Why developers let the user to use convenient way
to load caches with totally non-optimal solution?

We could talk too much about different persistence storage types, but
whenever we initiate the loading with IgniteCache.loadCache the current
implementation imposes much overhead on the network.

Partition-aware data loading may be used in some scenarios to avoid this
network overhead, but the users are compelled to do additional steps to
achieve this optimization: adding the column to tables, adding compound
indices including the added column, write a peace of repeatable code to
load the data in different caches in fault-tolerant fashion, etc.

Let's give the user the reusable code which is convenient, reliable and
fast.

2016-11-14 20:56 GMT+03:00 Valentin Kulichenko <
valentin.kulichenko@gmail.com>:

> Hi Aleksandr,
>
> Data streamer is already outlined as one of the possible approaches for
> loading the data [1]. Basically, you start a designated client node or
> chose a leader among server nodes [1] and then use IgniteDataStreamer API
> to load the data. With this approach there is no need to have the
> CacheStore implementation at all. Can you please elaborate what additional
> value are you trying to add here?
>
> [1] https://apacheignite.readme.io/docs/data-loading#ignitedatastreamer
> [2] https://apacheignite.readme.io/docs/leader-election
>
> -Val
>
> On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan <dsetrakyan@apache.org>
> wrote:
>
> > Hi,
> >
> > I just want to clarify a couple of API details from the original email to
> > make sure that we are making the right assumptions here.
> >
> > *"Because of none keys are passed to the CacheStore.loadCache methods,
> the
> > > underlying implementation is forced to read all the data from the
> > > persistence storage"*
> >
> >
> > According to the javadoc, loadCache(...) method receives an optional
> > argument from the user. You can pass anything you like, including a list
> of
> > keys, or an SQL where clause, etc.
> >
> > *"The partition-aware data loading approach is not a choice. It requires
> > > persistence of the volatile data depended on affinity function
> > > implementation and settings."*
> >
> >
> > This is only partially true. While Ignite allows to plugin custom
> affinity
> > functions, the affinity function is not something that changes
> dynamically
> > and should always return the same partition for the same key.So, the
> > partition assignments are not volatile at all. If, in some very rare
> case,
> > the partition assignment logic needs to change, then you could update the
> > partition assignments that you may have persisted elsewhere as well, e.g.
> > database.
> >
> > D.
> >
> > On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov <vozerov@gridgain.com>
> > wrote:
> >
> > > Alexandr, Alexey,
> > >
> > > While I agree with you that current cache loading logic is far from
> > ideal,
> > > it would be cool to see API drafts based on your suggestions to get
> > better
> > > understanding of your ideas. How exactly users are going to use your
> > > suggestions?
> > >
> > > My main concern is that initial load is not very trivial task in
> general
> > > case. Some users have centralized RDBMS systems, some have NoSQL,
> others
> > > work with distributed persistent stores (e.g. HDFS). Sometimes we have
> > > Ignite nodes "near" persistent data, sometimes we don't. Sharding,
> > > affinity, co-location, etc.. If we try to support all (or many) cases
> out
> > > of the box, we may end up in very messy and difficult API. So we should
> > > carefully balance between simplicity, usability and feature-rich
> > > characteristics here.
> > >
> > > Personally, I think that if user is not satisfied with "loadCache()"
> API,
> > > he just writes simple closure with blackjack streamer and queries and
> > send
> > > it to whatever node he finds convenient. Not a big deal. Only very
> common
> > > cases should be added to Ignite API.
> > >
> > > Vladimir.
> > >
> > >
> > > On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov <
> > > akuznetsov@gridgain.com>
> > > wrote:
> > >
> > > > Looks good for me.
> > > >
> > > > But I will suggest to consider one more use-case:
> > > >
> > > > If user knows its data he could manually split loading.
> > > > For example: table Persons contains 10M rows.
> > > > User could provide something like:
> > > > cache.loadCache(null, "Person", "select * from Person where id <
> > > > 1_000_000",
> > > > "Person", "select * from Person where id >=  1_000_000 and id <
> > > 2_000_000",
> > > > ....
> > > > "Person", "select * from Person where id >= 9_000_000 and id <
> > > 10_000_000",
> > > > );
> > > >
> > > > or may be it could be some descriptor object like
> > > >
> > > >  {
> > > >    sql: select * from Person where id >=  ? and id < ?"
> > > >    range: 0...10_000_000
> > > > }
> > > >
> > > > In this case provided queries will be send to mach nodes as number of
> > > > queries.
> > > > And data will be loaded in parallel and for keys that a not local -
> > data
> > > > streamer
> > > > should be used (as described Alexandr description).
> > > >
> > > > I think it is a good issue for Ignite 2.0
> > > >
> > > > Vova, Val - what do you think?
> > > >
> > > >
> > > > On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin <
> > > ein.nsk.ru@gmail.com>
> > > > wrote:
> > > >
> > > >> All right,
> > > >>
> > > >> Let's assume a simple scenario. When the IgniteCache.loadCache is
> > > invoked,
> > > >> we check whether the cache is not local, and if so, then we'll
> > initiate
> > > >> the
> > > >> new loading logic.
> > > >>
> > > >> First, we take a "streamer" node, it could be done by
> > > >> utilizing LoadBalancingSpi, or it may be configured statically, for
> > the
> > > >> reason that the streamer node is running on the same host as the
> > > >> persistence storage provider.
> > > >>
> > > >> After that we start the loading task on the streamer node which
> > > >> creates IgniteDataStreamer and loads the cache with
> > > CacheStore.loadCache.
> > > >> Every call to IgniteBiInClosure.apply simply
> > > >> invokes IgniteDataStreamer.addData.
> > > >>
> > > >> This implementation will completely relieve overhead on the
> > persistence
> > > >> storage provider. Network overhead is also decreased in the case of
> > > >> partitioned caches. For two nodes we get 1-1/2 amount of data
> > > transferred
> > > >> by the network (1 part well be transferred from the persistence
> > storage
> > > to
> > > >> the streamer, and then 1/2 from the streamer node to the another
> > node).
> > > >> For
> > > >> three nodes it will be 1-2/3 and so on, up to the two times amount
> of
> > > data
> > > >> on the big clusters.
> > > >>
> > > >> I'd like to propose some additional optimization at this place. If
> we
> > > have
> > > >> the streamer node on the same machine as the persistence storage
> > > provider,
> > > >> then we completely relieve the network overhead as well. It could
> be a
> > > >> some
> > > >> special daemon node for the cache loading assigned in the cache
> > > >> configuration, or an ordinary sever node as well.
> > > >>
> > > >> Certainly this calculations have been done in assumption that we
> have
> > > even
> > > >> partitioned cache with only primary nodes (without backups). In the
> > case
> > > >> of
> > > >> one backup (the most frequent case I think), we get 2 amount of data
> > > >> transferred by the network on two nodes, 2-1/3 on three, 2-1/2 on
> > four,
> > > >> and
> > > >> so on up to the three times amount of data on the big clusters.
> Hence
> > > it's
> > > >> still better than the current implementation. In the worst case
> with a
> > > >> fully replicated cache we take N+1 amount of data transferred by the
> > > >> network (where N is the number of nodes in the cluster). But it's
> not
> > a
> > > >> problem in small clusters, and a little overhead in big clusters.
> And
> > we
> > > >> still gain the persistence storage provider optimization.
> > > >>
> > > >> Now let's take more complex scenario. To achieve some level of
> > > >> parallelism,
> > > >> we could split our cluster on several groups. It could be a
> parameter
> > of
> > > >> the IgniteCache.loadCache method or a cache configuration option.
> The
> > > >> number of groups could be a fixed value, or it could be calculated
> > > >> dynamically by the maximum number of nodes in the group.
> > > >>
> > > >> After splitting the whole cluster on groups we will take the
> streamer
> > > node
> > > >> in the each group and submit the task for loading the cache similar
> to
> > > the
> > > >> single streamer scenario, except as the only keys will be passed to
> > > >> the IgniteDataStreamer.addData method those correspond to the
> cluster
> > > >> group
> > > >> where is the streamer node running.
> > > >>
> > > >> In this case we get equal level of overhead as the parallelism, but
> > not
> > > so
> > > >> surplus as how many nodes in whole the cluster.
> > > >>
> > > >> 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov <akuznetsov@apache.org
> >:
> > > >>
> > > >> > Alexandr,
> > > >> >
> > > >> > Could you describe your proposal in more details?
> > > >> > Especially in case with several nodes.
> > > >> >
> > > >> > On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshin <
> > > >> ein.nsk.ru@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > You know CacheStore API that is commonly used for
> > read/write-through
> > > >> > > relationship of the in-memory data with the persistence
storage.
> > > >> > >
> > > >> > > There is also IgniteCache.loadCache method for hot-loading
the
> > cache
> > > >> on
> > > >> > > startup. Invocation of this method causes execution of
> > > >> > CacheStore.loadCache
> > > >> > > on the all nodes storing the cache partitions. Because of
none
> > keys
> > > >> are
> > > >> > > passed to the CacheStore.loadCache methods, the underlying
> > > >> implementation
> > > >> > > is forced to read all the data from the persistence storage,
but
> > > only
> > > >> > part
> > > >> > > of the data will be stored on each node.
> > > >> > >
> > > >> > > So, the current implementation have two general drawbacks:
> > > >> > >
> > > >> > > 1. Persistence storage is forced to perform as many identical
> > > queries
> > > >> as
> > > >> > > many nodes on the cluster. Each query may involve much
> additional
> > > >> > > computation on the persistence storage server.
> > > >> > >
> > > >> > > 2. Network is forced to transfer much more data, so obviously
> the
> > > big
> > > >> > > disadvantage on large systems.
> > > >> > >
> > > >> > > The partition-aware data loading approach, described in
> > > >> > > https://apacheignite.readme.io/docs/data-loading#section-
> > > >> > > partition-aware-data-loading
> > > >> > > , is not a choice. It requires persistence of the volatile
data
> > > >> depended
> > > >> > on
> > > >> > > affinity function implementation and settings.
> > > >> > >
> > > >> > > I propose using something like IgniteDataStreamer inside
> > > >> > > IgniteCache.loadCache implementation.
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Thanks,
> > > >> > > Alexandr Kuramshin
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Alexey Kuznetsov
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Thanks,
> > > >> Alexandr Kuramshin
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Alexey Kuznetsov
> > > > GridGain Systems
> > > > www.gridgain.com
> > > >
> > >
> >
>



-- 
Thanks,
Alexandr Kuramshin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message