ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@gridgain.com>
Subject Re: spark SQL thriftserver over ignite and cassandra
Date Wed, 05 Oct 2016 01:33:54 GMT
Hi Vincent,

See my answers inline

> On Oct 4, 2016, at 12:54 AM, vincent gromakowski <vincent.gromakowski@gmail.com>
> Hi,
> I know that Ignite has SQL support but:
> - ODBC driver doesn't seem to provide HTTP(S) support, which is easier to integrate on
corporate networks with rules, firewalls, proxies

Igor Sapego, what URIs are supported presently? 

> - The SQL engine doesn't seem to scale like Spark SQL would. For instance, Spark won't
generate OOM is dataset (source or result) doesn't fit in memory. From Ignite side, it's not

OOM is not related to scalability topic at all. This is about application’s logic. 

Ignite SQL engine perfectly scales out along with your cluster. Moreover, Ignite supports
indexes which allows you to get O(logN) running time complexity for your SQL queries while
in case of Spark you will face with full-scans (O(N)) all the time.

However, to benefit from Ignite SQL queries you have to put all the data in-memory. Ignite
doesn’t go to a CacheStore (Cassandra, relational database, MongoDB, etc) while a SQL query
is executed and won’t preload anything from an underlying CacheStore. Automatic preloading
works for key-value queries like cache.get(key).

> - Spark thrift can manage multi tenancy: different users can connect to the same SQL
engine and share cache. In Ignite it's one cache per user, so a big waste of RAM.

Everyone can connect to an Ignite cluster and work with the same set of distributed caches.
I’m not sure why you need to create caches with the same content for every user.

If you need a real multi-tenancy support where cacheA is allowed to be accessed by a group
of users A only and cacheB by users from group B then you can take a look at GridGain which
is built on top of Ignite

> What I want to achieve is :
> - use Cassandra for data store as it provides idempotence (HDFS/hive doesn't), resulting
in exactly once semantic without any duplicates. 
> - use Spark SQL thriftserver in multi tenancy for large scale adhoc analytics queries
(> TB) from an ODBC driver through HTTP(S) 
> - accelerate Cassandra reads when the data modeling of the Cassandra table doesn't fit
the queries. Queries would be OLAP style: target multiple C* partitions, groupby or filters
on lots of dimensions that aren't necessarely in the C* table key.

As it was mentioned Ignite uses Cassandra as a CacheStore. You should keep this in mind. Before
trying to assemble all the chain I would recommend you trying to connect Spark SQL thrift
server directly to Ignite and work with its shared RDDs [1]. A shared RDD (basically Ignite
cache) can be backed by Cassandra. Probably this chain will work for you but I can’t give
more precise guidance on this.

[1] https://apacheignite-fs.readme.io/docs/ignite-for-spark

> Thanks for your advises
> 2016-10-04 6:51 GMT+02:00 Jörn Franke <jornfranke@gmail.com <mailto:jornfranke@gmail.com>>:
> I am not sure that this will be performant. What do you want to achieve here? Fast lookups?
Then the Cassandra Ignite store might be the right solution. If you want to do more analytic
style of queries then you can put the data on HDFS/Hive and use the Ignite HDFS cache to cache
certain partitions/tables in Hive in-memory. If you want to go to iterative machine learning
algorithms you can go for Spark on top of this. You can use then also Ignite cache for Spark
> On 4 Oct 2016, at 02:24, Alexey Kuznetsov <akuznetsov@gridgain.com <mailto:akuznetsov@gridgain.com>>
>> Hi, Vincent!
>> Ignite also has SQL support (also scalable), I think it will be much faster to query
directly from Ignite than query from Spark.
>> Also please mind, that before executing queries you should load all needed data to
>> To load data from Cassandra to Ignite you may use Cassandra store [1].
>> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra <https://apacheignite.readme.io/docs/ignite-with-apache-cassandra>
>> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski <vincent.gromakowski@gmail.com
<mailto:vincent.gromakowski@gmail.com>> wrote:
>> Hi,
>> I am evaluating the possibility to use Spark SQL (and its scalability) over an Ignite
cache with Cassandra persistent store to increase read workloads like OLAP style analytics.
>> Is there any way to configure Spark thriftserver to load an external table in Ignite
like we can do in Cassandra ?
>> Here is an example of config for spark backed by cassandra
>>         ( id int, data string ) 
>>         STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' 
>>         TBLPROPERTIES ("cassandra.host" = "x.x.x.x", "cassandra.ks.name <http://cassandra.ks.name/>"
= "test" , 
>>           "cassandra.cf.name <http://cassandra.cf.name/>" = "mytable" , 
>>           "cassandra.ks.repfactor" = "1" , 
>>           "cassandra.ks.strategy" = 
>>             "org.apache.cassandra.locator.SimpleStrategy" ); 
>> -- 
>> Alexey Kuznetsov

View raw message