ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@gridgain.com>
Subject Re: spark SQL thriftserver over ignite and cassandra
Date Wed, 05 Oct 2016 22:12:14 GMT

Please see below

> On Oct 5, 2016, at 4:31 AM, vincent gromakowski <vincent.gromakowski@gmail.com>
> Hi
> thanks for your explanations. Please find inline more questions 
> Vincent
> 2016-10-05 3:33 GMT+02:00 Denis Magda <dmagda@gridgain.com <mailto:dmagda@gridgain.com>>:
> Hi Vincent,
> See my answers inline
>> On Oct 4, 2016, at 12:54 AM, vincent gromakowski <vincent.gromakowski@gmail.com
<mailto:vincent.gromakowski@gmail.com>> wrote:
>> Hi,
>> I know that Ignite has SQL support but:
>> - ODBC driver doesn't seem to provide HTTP(S) support, which is easier to integrate
on corporate networks with rules, firewalls, proxies
> Igor Sapego, what URIs are supported presently? 
>> - The SQL engine doesn't seem to scale like Spark SQL would. For instance, Spark
won't generate OOM is dataset (source or result) doesn't fit in memory. From Ignite side,
it's not clear…
> OOM is not related to scalability topic at all. This is about application’s logic.

> Ignite SQL engine perfectly scales out along with your cluster. Moreover, Ignite supports
indexes which allows you to get O(logN) running time complexity for your SQL queries while
in case of Spark you will face with full-scans (O(N)) all the time.
> However, to benefit from Ignite SQL queries you have to put all the data in-memory. Ignite
doesn’t go to a CacheStore (Cassandra, relational database, MongoDB, etc) while a SQL query
is executed and won’t preload anything from an underlying CacheStore. Automatic preloading
works for key-value queries like cache.get(key).
> This is an issue because I will potentially have to query TB of data. If I use Spark
thriftserver backed by IgniteRDD, does it solve this point and can I get automatic preloading
from C* ?

IgniteRDD will load missing tuples (key-value) pair from Cassandra because essentially IgniteRDD
is an IgniteCache and Cassandra is a CacheStore. The only thing that is left to check is whether
Spark triftserver can work with IgniteRDDs. Hope you will be able figure out this and share
your feedback with us.

>> - Spark thrift can manage multi tenancy: different users can connect to the same
SQL engine and share cache. In Ignite it's one cache per user, so a big waste of RAM.
> Everyone can connect to an Ignite cluster and work with the same set of distributed caches.
I’m not sure why you need to create caches with the same content for every user.
> It's a security issue, Ignite cache doesn't provide multiple user account per cache.
I am thinking of using Spark to authenticate multiple users and then Spark use a shared account
on Ignite cache
Basically, Ignite provides basic security interfaces and some implementations which you can
rely on by building your secure solution. This article can be useful for your case
http://smartkey.co.uk/development/securing-an-apache-ignite-cluster/ <http://smartkey.co.uk/development/securing-an-apache-ignite-cluster/>


> If you need a real multi-tenancy support where cacheA is allowed to be accessed by a
group of users A only and cacheB by users from group B then you can take a look at GridGain
which is built on top of Ignite
> https://gridgain.readme.io/docs/multi-tenancy <https://gridgain.readme.io/docs/multi-tenancy>
> OK but I am evaluating open source only solutions (kylin, druid, alluxio...), it's a
constraint from my hierarchy
>> What I want to achieve is :
>> - use Cassandra for data store as it provides idempotence (HDFS/hive doesn't), resulting
in exactly once semantic without any duplicates. 
>> - use Spark SQL thriftserver in multi tenancy for large scale adhoc analytics queries
(> TB) from an ODBC driver through HTTP(S) 
>> - accelerate Cassandra reads when the data modeling of the Cassandra table doesn't
fit the queries. Queries would be OLAP style: target multiple C* partitions, groupby or filters
on lots of dimensions that aren't necessarely in the C* table key.
> As it was mentioned Ignite uses Cassandra as a CacheStore. You should keep this in mind.
Before trying to assemble all the chain I would recommend you trying to connect Spark SQL
thrift server directly to Ignite and work with its shared RDDs [1]. A shared RDD (basically
Ignite cache) can be backed by Cassandra. Probably this chain will work for you but I can’t
give more precise guidance on this.
> I will try to make it works and give you feedback
> [1] https://apacheignite-fs.readme.io/docs/ignite-for-spark <https://apacheignite-fs.readme.io/docs/ignite-for-spark>
> —
> Denis
>> Thanks for your advises
>> 2016-10-04 6:51 GMT+02:00 Jörn Franke <jornfranke@gmail.com <mailto:jornfranke@gmail.com>>:
>> I am not sure that this will be performant. What do you want to achieve here? Fast
lookups? Then the Cassandra Ignite store might be the right solution. If you want to do more
analytic style of queries then you can put the data on HDFS/Hive and use the Ignite HDFS cache
to cache certain partitions/tables in Hive in-memory. If you want to go to iterative machine
learning algorithms you can go for Spark on top of this. You can use then also Ignite cache
for Spark RDDs.
>> On 4 Oct 2016, at 02:24, Alexey Kuznetsov <akuznetsov@gridgain.com <mailto:akuznetsov@gridgain.com>>
>>> Hi, Vincent!
>>> Ignite also has SQL support (also scalable), I think it will be much faster to
query directly from Ignite than query from Spark.
>>> Also please mind, that before executing queries you should load all needed data
to cache.
>>> To load data from Cassandra to Ignite you may use Cassandra store [1].
>>> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra <https://apacheignite.readme.io/docs/ignite-with-apache-cassandra>
>>> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski <vincent.gromakowski@gmail.com
<mailto:vincent.gromakowski@gmail.com>> wrote:
>>> Hi,
>>> I am evaluating the possibility to use Spark SQL (and its scalability) over an
Ignite cache with Cassandra persistent store to increase read workloads like OLAP style analytics.
>>> Is there any way to configure Spark thriftserver to load an external table in
Ignite like we can do in Cassandra ?
>>> Here is an example of config for spark backed by cassandra
>>>         ( id int, data string ) 
>>>         STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' 
>>>         TBLPROPERTIES ("cassandra.host" = "x.x.x.x", "cassandra.ks.name <http://cassandra.ks.name/>"
= "test" , 
>>>           "cassandra.cf.name <http://cassandra.cf.name/>" = "mytable" ,

>>>           "cassandra.ks.repfactor" = "1" , 
>>>           "cassandra.ks.strategy" = 
>>>             "org.apache.cassandra.locator.SimpleStrategy" ); 
>>> -- 
>>> Alexey Kuznetsov

View raw message