ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roger Fischer (CW)" <rfis...@Brocade.com>
Subject Question on efficient loading from Cassandra
Date Thu, 27 Jul 2017 00:03:09 GMT

what is the best way to efficiently load data from a backing store, like Cassandra. I am looking
for a solution that minimizes work in Ignite and Cassandra.

As I understand:

The simplest way is to call loadCache() with a single select statement.
cache.loadCache( null, "select * from a_table where a_date_time >= '2017-07-25 10:00:00');")

Is it correct that:
1) Each Ignite node gets the same loadCache() request.
2) Each Ignite node sends the same query to Cassandra.
3) Each Ignite node gets all matched objects (rows) back from Cassandra.
4) Each Ignite node stores only the objects for which it has the primary partition, or a backup

Unless I misunderstand, this simple approach has the following inefficiencies:
a) Cassandra executes the same query multiple times, once for each Ignite node.
b) The query results are transferred multiple times, once for each Ignite node.
c) The Ignite node gets a lot of data which it does not need (has neither primary or backup
d) Each Cassandra node has to query all partitions.

loadCache() supports multiple queries. This allows the query to be broken down, ideally (for
this case) into one query per Cassandra partition.

cache.loadCache( null, "select * from a_table where partition_key = 0 and a_date_time >=
'2017-07-25 10:00:00');", "select * from a_table where partition_key = 1 and a_date_time >=
'2017-07-25 10:00:00');", ...)

This optimizes the Cassandra query, as each query is constrained to one Cassandra partition.

But, I think, each node still needs to execute each query. Thus none of the other inefficiencies
are eliminated.

I believe that, when multiple cores (worker threads) are available, the Ignite nodes will
execute multiple queries in parallel. So, there is a reduction in elapsed time. Correct?

Now, is there any way to avoid that Cassandra has to execute the same query multiple times,
and that the data is transferred multiple times?

One approach would be that an Ignite node modifies the query so that it only includes the
partitions for which it has the primary or a backup partition. That eliminates some duplication,
but may not result in efficient queries in Cassandra.

Another approach is that Ignite forwards objects for which it is not the primary or does not
have a backup (similar to when an application does a put()). That would optimize the Cassandra
query, but require additional communications between Ignite nodes.

What if Ignite and Cassandra partitions were aligned? Then queries could be created that only
return data relevant to the node and only query a subset of Cassandra partitions. But this
seems not practical for a generalized system (I think).

Any other suggestions?



PS: The use case for this is to use Ignite as an SQL cache for a large data set in the Cassandra
DB. The most recent data is pre-loaded (and updated) in Ignite. When older data is required,
it is loaded first into Ignite, and then processed. It is this dynamic loading that should
be quick (and efficient).

View raw message