Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0D6B7200BA3 for ; Thu, 6 Oct 2016 00:12:24 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0A616160AEA; Wed, 5 Oct 2016 22:12:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A850D160ADE for ; Thu, 6 Oct 2016 00:12:22 +0200 (CEST) Received: (qmail 85794 invoked by uid 500); 5 Oct 2016 22:12:21 -0000 Mailing-List: contact user-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@ignite.apache.org Delivered-To: mailing list user@ignite.apache.org Received: (qmail 85784 invoked by uid 99); 5 Oct 2016 22:12:21 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Oct 2016 22:12:21 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 6B615C0C99 for ; Wed, 5 Oct 2016 22:12:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.479 X-Spam-Level: ** X-Spam-Status: No, score=2.479 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gridgain-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id PvQa9JA5iC0c for ; Wed, 5 Oct 2016 22:12:18 +0000 (UTC) Received: from mail-pa0-f49.google.com (mail-pa0-f49.google.com [209.85.220.49]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id E9C475FB38 for ; Wed, 5 Oct 2016 22:12:17 +0000 (UTC) Received: by mail-pa0-f49.google.com with SMTP id rz1so638956pab.1 for ; Wed, 05 Oct 2016 15:12:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gridgain-com.20150623.gappssmtp.com; s=20150623; h=mime-version:subject:from:in-reply-to:date:cc:message-id:references :to; bh=y3IZs6qKEngvTL403YbBe9iShUPBr9oKZlVgctS9eNk=; b=NUZfRfDXReAyHAVxWlzNM5K1xHACBsyAyRWmqzTxaiRmKrL3mUdSsyyRB0iDf5OXxy 2uCFwjBNASwt8lAvGwCYtC2oPlsnlsV6Z1fyU6iycGS7JIUCxVi4J6alqj6qVINjs16t kxda02dMm/2ziQgeI3AVGex/3WTUJOeDPWJZEm2d6FEP+5nzmRSF8V6qhgIrmcN46Cp8 aKwq3rbZbnaq+OsIPQ+CZCIuieiAbjuWLBHqk9trxLjkATwuFo33JQCBlt9z0kdEVwtw LdTncpc4bWwJScnAUBnOTLwkhjda3llg2f/nCi1o5LU7cvy1y0wDrCcDOtznPpZzXxmQ UPjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :message-id:references:to; bh=y3IZs6qKEngvTL403YbBe9iShUPBr9oKZlVgctS9eNk=; b=GI+XK+AgVLTlTCtAHqv5QKiywnhoOGekZQXUJNp3U/iZjoPmu2SZ0NLu7sAaDfZjJ3 lnioAKpwMhLZPg9vSiclMOnWHxzce6G/ke2zq3o/e71ncqnCHrRNmlSGcpeEm1B1HzI9 bKWt+8gqR9G3VJGArqGYkhl3Fn0odei9z4nx/pjbmnQBSV2XDdcV2A5XnTCOaEZT5m3J 5fYdYLxYirQhg1fM7URN6vGUp/PkbXAY6EnEXzzteeOezdvlyrEfEbaakImfXfZ3LS3A z3FA4rXARSAu2BWCC56bSz3GblhpSYfkSdh+uRX98f1laOlvxkZg+wNwLDMBAYiy7qUy OnoQ== X-Gm-Message-State: AA6/9Rl+0C7GcMOS2QkRmzztiubZPeY3srZ+08r6vqr3Fy70asJNNDeESIR+LZ8VgVNKs11t X-Received: by 10.66.16.74 with SMTP id e10mr16487718pad.148.1475705536525; Wed, 05 Oct 2016 15:12:16 -0700 (PDT) Received: from ?IPv6:2601:646:c401:d400:f077:e98:caab:7ec7? ([2601:646:c401:d400:f077:e98:caab:7ec7]) by smtp.gmail.com with ESMTPSA id n7sm66011351pfg.45.2016.10.05.15.12.15 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 05 Oct 2016 15:12:15 -0700 (PDT) Content-Type: multipart/alternative; boundary="Apple-Mail=_BC088F85-6FD5-4DAF-90F1-A901279A60B1" Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: spark SQL thriftserver over ignite and cassandra From: Denis Magda In-Reply-To: Date: Wed, 5 Oct 2016 15:12:14 -0700 Cc: Igor Sapego Message-Id: <671A1C29-DE68-40E3-801D-45437E451D7D@gridgain.com> References: <1CB72FC6-465C-4E65-8283-06CAD72F54D5@gridgain.com> To: user@ignite.apache.org X-Mailer: Apple Mail (2.3124) archived-at: Wed, 05 Oct 2016 22:12:24 -0000 --Apple-Mail=_BC088F85-6FD5-4DAF-90F1-A901279A60B1 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Vincent, Please see below > On Oct 5, 2016, at 4:31 AM, vincent gromakowski = wrote: >=20 > Hi > thanks for your explanations. Please find inline more questions=20 >=20 > Vincent >=20 > 2016-10-05 3:33 GMT+02:00 Denis Magda >: > Hi Vincent, >=20 > See my answers inline >=20 >> On Oct 4, 2016, at 12:54 AM, vincent gromakowski = > = wrote: >>=20 >> Hi, >> I know that Ignite has SQL support but: >> - ODBC driver doesn't seem to provide HTTP(S) support, which is = easier to integrate on corporate networks with rules, firewalls, proxies >=20 > Igor Sapego, what URIs are supported presently?=20 >=20 >> - The SQL engine doesn't seem to scale like Spark SQL would. For = instance, Spark won't generate OOM is dataset (source or result) doesn't = fit in memory. =46rom Ignite side, it's not clear=E2=80=A6 >=20 > OOM is not related to scalability topic at all. This is about = application=E2=80=99s logic.=20 >=20 > Ignite SQL engine perfectly scales out along with your cluster. = Moreover, Ignite supports indexes which allows you to get O(logN) = running time complexity for your SQL queries while in case of Spark you = will face with full-scans (O(N)) all the time. >=20 > However, to benefit from Ignite SQL queries you have to put all the = data in-memory. Ignite doesn=E2=80=99t go to a CacheStore (Cassandra, = relational database, MongoDB, etc) while a SQL query is executed and = won=E2=80=99t preload anything from an underlying CacheStore. Automatic = preloading works for key-value queries like cache.get(key). >=20 >=20 > This is an issue because I will potentially have to query TB of data. = If I use Spark thriftserver backed by IgniteRDD, does it solve this = point and can I get automatic preloading from C* ? IgniteRDD will load missing tuples (key-value) pair from Cassandra = because essentially IgniteRDD is an IgniteCache and Cassandra is a = CacheStore. The only thing that is left to check is whether Spark = triftserver can work with IgniteRDDs. Hope you will be able figure out = this and share your feedback with us. >=20 >> - Spark thrift can manage multi tenancy: different users can connect = to the same SQL engine and share cache. In Ignite it's one cache per = user, so a big waste of RAM. >=20 > Everyone can connect to an Ignite cluster and work with the same set = of distributed caches. I=E2=80=99m not sure why you need to create = caches with the same content for every user. >=20 > It's a security issue, Ignite cache doesn't provide multiple user = account per cache. I am thinking of using Spark to authenticate multiple = users and then Spark use a shared account on Ignite cache > =20 Basically, Ignite provides basic security interfaces and some = implementations which you can rely on by building your secure solution. = This article can be useful for your case http://smartkey.co.uk/development/securing-an-apache-ignite-cluster/ = =E2=80=94 Denis >=20 > If you need a real multi-tenancy support where cacheA is allowed to be = accessed by a group of users A only and cacheB by users from group B = then you can take a look at GridGain which is built on top of Ignite > https://gridgain.readme.io/docs/multi-tenancy = >=20 >=20 >=20 > OK but I am evaluating open source only solutions (kylin, druid, = alluxio...), it's a constraint from my hierarchy >>=20 >> What I want to achieve is : >> - use Cassandra for data store as it provides idempotence (HDFS/hive = doesn't), resulting in exactly once semantic without any duplicates.=20 >> - use Spark SQL thriftserver in multi tenancy for large scale adhoc = analytics queries (> TB) from an ODBC driver through HTTP(S)=20 >> - accelerate Cassandra reads when the data modeling of the Cassandra = table doesn't fit the queries. Queries would be OLAP style: target = multiple C* partitions, groupby or filters on lots of dimensions that = aren't necessarely in the C* table key. >>=20 >=20 > As it was mentioned Ignite uses Cassandra as a CacheStore. You should = keep this in mind. Before trying to assemble all the chain I would = recommend you trying to connect Spark SQL thrift server directly to = Ignite and work with its shared RDDs [1]. A shared RDD (basically Ignite = cache) can be backed by Cassandra. Probably this chain will work for you = but I can=E2=80=99t give more precise guidance on this. >=20 >=20 > I will try to make it works and give you feedback >=20 > =20 > [1] https://apacheignite-fs.readme.io/docs/ignite-for-spark = > =20 > =E2=80=94 > Denis >=20 >> Thanks for your advises >>=20 >>=20 >> 2016-10-04 6:51 GMT+02:00 J=C3=B6rn Franke >: >> I am not sure that this will be performant. What do you want to = achieve here? Fast lookups? Then the Cassandra Ignite store might be the = right solution. If you want to do more analytic style of queries then = you can put the data on HDFS/Hive and use the Ignite HDFS cache to cache = certain partitions/tables in Hive in-memory. If you want to go to = iterative machine learning algorithms you can go for Spark on top of = this. You can use then also Ignite cache for Spark RDDs. >>=20 >> On 4 Oct 2016, at 02:24, Alexey Kuznetsov > wrote: >>=20 >>> Hi, Vincent! >>>=20 >>> Ignite also has SQL support (also scalable), I think it will be much = faster to query directly from Ignite than query from Spark. >>> Also please mind, that before executing queries you should load all = needed data to cache. >>> To load data from Cassandra to Ignite you may use Cassandra store = [1]. >>>=20 >>> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra = >>>=20 >>> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski = > = wrote: >>> Hi, >>> I am evaluating the possibility to use Spark SQL (and its = scalability) over an Ignite cache with Cassandra persistent store to = increase read workloads like OLAP style analytics. >>> Is there any way to configure Spark thriftserver to load an external = table in Ignite like we can do in Cassandra ? >>> Here is an example of config for spark backed by cassandra >>>=20 >>> CREATE EXTERNAL TABLE MyHiveTable=20 >>> ( id int, data string )=20 >>> STORED BY = 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler'=20 >>> TBLPROPERTIES ("cassandra.host" =3D "x.x.x.x", = "cassandra.ks.name " =3D "test" ,=20 >>> "cassandra.cf.name " =3D = "mytable" ,=20 >>> "cassandra.ks.repfactor" =3D "1" ,=20 >>> "cassandra.ks.strategy" =3D=20 >>> "org.apache.cassandra.locator.SimpleStrategy" );=20 >>>=20 >>>=20 >>>=20 >>>=20 >>> --=20 >>> Alexey Kuznetsov --Apple-Mail=_BC088F85-6FD5-4DAF-90F1-A901279A60B1 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Vincent,

Please see below

On = Oct 5, 2016, at 4:31 AM, vincent gromakowski <vincent.gromakowski@gmail.com> wrote:

Hi
thanks for = your explanations. Please find inline more questions 

Vincent

2016-10-05= 3:33 GMT+02:00 Denis Magda <dmagda@gridgain.com>:
Hi Vincent,

See my answers inline

On Oct 4, 2016, at = 12:54 AM, vincent gromakowski <vincent.gromakowski@gmail.com> = wrote:

Hi,
I know that Ignite has SQL support but:
- = ODBC driver doesn't seem to provide HTTP(S) support, which is easier to = integrate on corporate networks with rules, firewalls, = proxies

Igor Sapego, what = URIs are supported presently? 

- The SQL engine doesn't seem to scale like Spark SQL would. = For instance, Spark won't generate OOM is dataset (source or result) = doesn't fit in memory. =46rom Ignite side, it's not = clear=E2=80=A6

OOM is not related to scalability topic at all. This is = about application=E2=80=99s logic. 

Ignite SQL engine perfectly scales out = along with your cluster. Moreover, Ignite supports indexes which allows = you to get O(logN) running time complexity for your SQL queries while in = case of Spark you will face with full-scans (O(N)) all the = time.

However, = to benefit from Ignite SQL queries you have to put all the data = in-memory. Ignite doesn=E2=80=99t go to a CacheStore (Cassandra, = relational database, MongoDB, etc) while a SQL query is executed and = won=E2=80=99t preload anything from an underlying CacheStore. Automatic = preloading works for key-value queries like = cache.get(key).


This= is an issue because I will potentially have to query TB of data. If I = use Spark thriftserver backed by IgniteRDD, does it solve this point and = can I get automatic preloading from C* = ?

IgniteRDD will load missing tuples (key-value) pair = from Cassandra because essentially IgniteRDD is an IgniteCache and = Cassandra is a CacheStore. The only thing that is left to check is = whether Spark triftserver can work with IgniteRDDs. Hope you will be = able figure out this and share your feedback with us.



- = Spark thrift can manage multi tenancy: different users can connect to = the same SQL engine and share cache. In Ignite it's one cache per user, = so a big waste of RAM.

Everyone can connect to an = Ignite cluster and work with the same set of distributed caches. I=E2=80=99= m not sure why you need to create caches with the same content for every = user.

It's a security issue, Ignite cache = doesn't provide multiple user account per cache. I am thinking of using = Spark to authenticate multiple users and then Spark use a shared account = on Ignite cache
 
Basically, = Ignite provides basic security interfaces and some implementations which = you can rely on by building your secure solution. This article can be = useful for your case

=E2=80=94
Denis


If you need a real = multi-tenancy support where cacheA is allowed to be accessed by a group = of users A only and cacheB by users from group B then you can take a = look at GridGain which is built on top of Ignite


OK but I am evaluating open source = only solutions (kylin, druid, alluxio...), it's a constraint from my = hierarchy

What I want to achieve = is :
- use Cassandra for data store as it provides = idempotence (HDFS/hive doesn't), resulting in exactly once semantic = without any duplicates. 
- use Spark SQL = thriftserver in multi tenancy for large scale adhoc analytics queries = (> TB) from an ODBC driver through HTTP(S) 
-= accelerate Cassandra reads when the data modeling of the Cassandra = table doesn't fit the queries. Queries would be OLAP style: target = multiple C* partitions, groupby or filters on lots of dimensions that = aren't necessarely in the C* table key.


As it was mentioned Ignite uses Cassandra as a = CacheStore. You should keep this in mind. Before trying to assemble all = the chain I would recommend you trying to connect Spark SQL thrift = server directly to Ignite and work with its shared RDDs [1]. A shared = RDD (basically Ignite cache) can be backed by Cassandra. Probably this = chain will work for you but I can=E2=80=99t give more precise guidance = on this.


I will try to make it works and give = you feedback

 
 
=E2=80=94
Denis

Thanks for your = advises


2016-10-04= 6:51 GMT+02:00 J=C3=B6rn Franke <jornfranke@gmail.com>:
I am not sure that this = will be performant. What do you want to achieve here? Fast lookups? Then = the Cassandra Ignite store might be the right solution. If you want to = do more analytic style of queries then you can put the data on HDFS/Hive = and use the Ignite HDFS cache to cache certain partitions/tables in Hive = in-memory. If you want to go to iterative machine learning algorithms = you can go for Spark on top of this. You can use then also Ignite cache = for Spark RDDs.

On= 4 Oct 2016, at 02:24, Alexey Kuznetsov <akuznetsov@gridgain.com> wrote:

Hi, Vincent!

Ignite also has SQL support (also = scalable), I think it will be much faster to query directly from Ignite = than query from Spark.
Also please mind, that = before executing queries you should load all needed data to = cache.
To load data from Cassandra to Ignite you = may use Cassandra store [1].


On Tue, Oct 4, 2016 at 4:19 AM, = vincent gromakowski <vincent.gromakowski@gmail.com> wrote:
Hi,
I am evaluating the possibility to use Spark SQL (and = its scalability) over an Ignite cache with Cassandra persistent store to = increase read workloads like OLAP style analytics.
Is = there any way to configure Spark thriftserver to load an external table = in Ignite like we can do in Cassandra ?
Here is an = example of config for spark backed by cassandra

CREATE EXTERNAL TABLE MyHiveTable 
  =       ( = id int, data string ) 
  =       STORED BY = 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' 
  =       TBLPROPERTIES = ("cassandra.host" =3D "x.x.x.x", "cassandra.ks.name" =3D "test" , 
  =         "cassandra.cf.name" =3D "mytable" , 
  =         "cassandra.ks.repfactor" =3D = "1" , 
          "cassandra.ks.strategy" = =3D 
            "org.apache.cassandra.locator= .SimpleStrategy" ); 




-- 
Alexey = Kuznetsov

= --Apple-Mail=_BC088F85-6FD5-4DAF-90F1-A901279A60B1--