Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 13B6410E4A for ; Thu, 8 Jan 2015 23:36:59 +0000 (UTC) Received: (qmail 21937 invoked by uid 500); 8 Jan 2015 23:37:00 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 21869 invoked by uid 500); 8 Jan 2015 23:37:00 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 21858 invoked by uid 99); 8 Jan 2015 23:37:00 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jan 2015 23:37:00 +0000 Received: from mail-qc0-f171.google.com (mail-qc0-f171.google.com [209.85.216.171]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 60E3A1A0320 for ; Thu, 8 Jan 2015 23:36:59 +0000 (UTC) Received: by mail-qc0-f171.google.com with SMTP id r5so5510919qcx.2 for ; Thu, 08 Jan 2015 15:36:57 -0800 (PST) MIME-Version: 1.0 X-Received: by 10.224.63.70 with SMTP id a6mr17796870qai.42.1420760217443; Thu, 08 Jan 2015 15:36:57 -0800 (PST) Received: by 10.96.79.38 with HTTP; Thu, 8 Jan 2015 15:36:57 -0800 (PST) In-Reply-To: <784E2D75-EC14-4779-9586-9E533F00B599@gmail.com> References: <784E2D75-EC14-4779-9586-9E533F00B599@gmail.com> Date: Thu, 8 Jan 2015 15:36:57 -0800 Message-ID: Subject: Re: [DISCUSS] Cassandra storage for Drill From: Jacques Nadeau To: "dev@drill.apache.org" Content-Type: multipart/alternative; boundary=047d7bdca1001267e8050c2c84e4 --047d7bdca1001267e8050c2c84e4 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Drill's framework does the same. Drill leverages some of Calcite's extension capabilities to allow very easy pushdowns by allowing storage subsystems to expose optimizer rules (subclassed on top of Calcite's optimizer rule construct). On-top of what Calcite can do, Drill also understand concepts like parallelization and data locality and lets systems like Cassandra expose this information to vastly improve performance, especially when working across multiple systems. On Thu, Jan 8, 2015 at 12:41 PM, Julian Hyde wrote: > Calcite=E2=80=99s adapter framework makes it easy to push down filters, > aggregations to third-party sources, and to express more powerful and > data-source-specific optimizations. > > Is Drill building on Calcite=E2=80=99s support or doing it its own way? > > Calcite doesn=E2=80=99t have a Cassandra adapter but the same approach ta= ken in > the MongoDb, Splunk, Phoenix adapters could be used. > > On Jan 8, 2015, at 9:11 AM, Tomer Shiran wrote: > > > I think that any valid SQL statement should work with any data source. > > Drill should: > > > > - Push down as much processing as possible into the data source > > (Cassandra in this case) > > - Maintain as much data locality as possible (ie, spread the work so > > that each drillbit is handling local data) > > - In the worst case, Drill should pull the entire table from the data > > source if that's what's needed to satisfy the query. > > > > > > On Thu, Jan 8, 2015 at 8:29 AM, Yash Sharma wrote: > > > >> Hi Folks, > >> This thread is to discuss few scenarios how Cassandra works - and how > do we > >> think it should be supported in Drill. > >> > >> While they are not supported in Cassandra inherently but its doable on > >> Drill's end once we fetch a superset of data without these cases. > >> > >> 1. Filtering non indexed column in Cassandra > >> 2. Filtering by subset of primary key > >> 3. OR condition in where clause > >> > >> Should we apply filters at Drill's end and support these features or w= e > >> propagate an error back to user for asking for a valid Cassandra based > >> query? > >> > >> ----- > >> Examples: > >> Here 'trending_now' is a dummy table with (id, rank, pog_id) where > >> (id,rank) is primary key pair. > >> 1. > >> cqlsh:recsys> select * from trending_now where pog_id=3D10004 ; > >> Bad Request: No indexed columns present in by-columns clause with Equa= l > >> operator > >> > >> 2. > >> cqlsh:recsys> select * from trending_now where rank=3D4; > >> Bad Request: Cannot execute this query as it might involve data > filtering > >> and thus may have unpredictable performance. If you want to execute th= is > >> query despite the performance unpredictability, use ALLOW FILTERING > >> P.S. ALLOW FILTERING is not permitted in Cassandra java driver as of > now. > >> > >> 3. > >> cqlsh:recsys> select * from trending_now where rank=3D4 or id=3D'id000= 4'; > >> Bad Request: line 1:40 missing EOF at 'or' > >> > >> 4. Valid Query: > >> cqlsh:recsys> select * from trending_now where id=3D'id0004' and rank= =3D4; > >> > >> id | rank | pog_id > >> --------+------+-------- > >> id0004 | 4 | 10002 > >> > >> (1 rows) > >> > > --047d7bdca1001267e8050c2c84e4--