Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <156AA069D74CB042AA529A69006652531D560B4E@CO1PRD6102MB001.025d.mgd.msft.net>
References: 
 <CAFruNdffc57Hf29ze8cxDELRxp9w2_KJuCjYV3070So-Ew2oKQ@mail.gmail.com>
	<CAKv2g8fcjKLnXkbOf1VoyiqxP1WmPZ3mohsGU5f8cx6d2Fy7TA@mail.gmail.com>
	<CAFruNddO6r10MW-Z4_WKYbMBew8u3=DLv0a6XeKA0EntPydCew@mail.gmail.com>
	<CAOxAL62zmjg5XtLZ6_tPLpoPPzL1L=RLy1wNzsVpOpYvO0+D2Q@mail.gmail.com>
	<CAKv2g8emmh2ngP00GOhpEtJY7+M6NdoSUa_Diw8WbS4++SVQSw@mail.gmail.com>
	<156AA069D74CB042AA529A69006652531D560B4E@CO1PRD6102MB001.025d.mgd.msft.net>
Date: Fri, 12 Jun 2015 10:29:53 -0400
Message-ID: 
 <CAOxAL62z=yHVY08Dcw21c9N_Kh36smTm1UgNym8ukEgC22HBgw@mail.gmail.com>
Subject: Re: Support for ad-hoc query
From: Jack Krupansky <jack.krupansky@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=089e013d15d8ff8202051852f08b

--089e013d15d8ff8202051852f08b
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

No dispute about that. But the main design requirement Cassandra strives to
meet is to be a blazing fast transactional database - here's the key, give
me the data, and here's the key, write this data. Any additional query
requirements are a distant second at best. A big part of that transactional
speed requirement is achieved by jettisoning the overhead required for ad
hoc queries.

I think it is inevitable that Cassandra will eventually address the
requirement for ad hoc queries when it finally decides what it wants to be
when it grows up (i.e., whether to just be a niche or to subsume all of
SQL), but in the meantime DSE Search/Solr, Stratio, and TupleJump Stargate,
as well as extraction and indexing in Elasticsearch, are moderately
reasonable near-term solutions.

And I agree that having to fully model eventual (and evolving!) data
requirements and emergent anomalous conditions upfront is too big a burden
for many enterprises.


-- Jack Krupansky

On Fri, Jun 12, 2015 at 10:07 AM, <SEAN_R_DURITY@homedepot.com> wrote:

>  I will note here that the limitations on ad-hoc querying (and
> aggregates) make it much more difficult to deal with data quality problem=
s,
> QA testing, and similar efforts, especially where people are used to a mo=
re
> relational, ad-hoc model. We have often had to extract data from Cassandr=
a
> to Hadoop for querying by hive.
>
>
>
> Example: =E2=80=9CWe found a few records with incorrect data. How many mo=
re
> records like that are out there?=E2=80=9D
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Peter Lin [mailto:woolfel@gmail.com]
> *Sent:* Wednesday, June 10, 2015 8:17 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Support for ad-hoc query
>
>
>
>
>
> I'll second Jack's detailed response and add that you really should do
> some discovery to figure out what kinds of queries you may need to suppor=
t.
>
> It might not be possible and often that is the case, but it's worth while
> to ask the end users what kind of reports they need to run. Allowing
> arbitrary ad-hoc queries is a known anti-pattern for cassandra. If the
> system needs to query multiple cf to derive/calculate some result, using
> Cassandra alone isn't going to do it. You'll need some other system to gi=
ve
> you better query capabilities like Hive.
>
> If you need data warehouse like features, look at http://www.kylin.io/ .
> They are doing some interesting things.
>
> peter
>
>
>
> On Wed, Jun 10, 2015 at 7:58 AM, Jack Krupansky <jack.krupansky@gmail.com=
>
> wrote:
>
> Knowing your queries in advance is a hard-core requirement for effective
> deployment of Cassandra. Ad hoc queries are a very clear anti-pattern for
> Cassandra. DSE Search does provide support for advanced, complex, and ad
> hoc queries. Stratio and TupleJump Stargate can also be used.
>
>
>
> Back to the question of what you mean by ad hoc queries:
>
>
>
> 1. Do you expect real-time results, like sub-second, or are these
> long-running queries that might take seconds, 10 seconds or more, or even
> minutes to run?
>
> 2. Will they be very rare or quite frequent - how much load do you expect
> them to place on the cluster?
>
> 3. How complex do you expect them to be - how many clauses and operators?
>
> 4. What is their net cardinality - are they selecting just a few rows or
> many rows?
>
> 5. Do they have individual query clauses that select many rows even if th=
e
> net combination of all select clauses is not so many rows?
>
>
>
> The requirement to perform advanced, complex, and ad hoc queries using DS=
E
> Search or the other techniques will almost certainly require that you use
> moderately more capable hardware, especially more RAM, for each node, and
> probably more nodes as well to reduce the row count per node since ad hoc
> queries will tend to be compute-intensive based on number of rows on the
> node.
>
>
>
> Yes, it can be done. No, it is not free or cheap. And, no, it does not
> come out of the box for a non-DSE Cassandra release. And, yes, you must
> address this requirement before deployment, not after deployment.
>
>
>
>
>   -- Jack Krupansky
>
>
>
> On Wed, Jun 10, 2015 at 1:18 AM, Srinivasa T N <seenutn@gmail.com> wrote:
>
> Thanks guys for the inputs.
>
> By ad-hoc queries I mean that I don't know the queries during cf design
> time.  The data may be from single cf or multiple cf.  (This feature mayb=
e
> required if I want to do analysis on the data stored in cassandra, do you
> have any better ideas)?
>
> Regards,
>
> Seenu.
>
>
>
> On Tue, Jun 9, 2015 at 5:57 PM, Peter Lin <woolfel@gmail.com> wrote:
>
>
>
> what do you mean by ad-hoc queries?
>
> Do you mean simple queries against a single column family aka table?
>
> Or do you mean MDX style queries that looks at multiple tables?
>
> if it's MDX style queries, many people extract data from Cassandra into a
> data warehouse that support multi-dimensional cubes. This works well when
> the extracted data is a small subset and fits neatly in a data warehouse.
>
> As others have stated, Cassandra isn't great at ad-hoc. For MDX style
> queries, Cassandra wasn't designed for it. One thing we've done for our o=
wn
> project is to combine solr with our own fuzzy index to make ad-hoc querie=
s
> against a single table more friendly.
>
>
>
> On Tue, Jun 9, 2015 at 2:38 AM, Srinivasa T N <seenutn@gmail.com> wrote:
>
> Hi All,
>
>    I have an web application running with my backend data stored in
> cassandra.  Now I want to do some analysis on the data stored which
> requires some ad-hoc queries fired on cassandra.  How can I do the same?
>
> Regards,
>
> Seenu.
>
>
>
>
>
>
>
>
>
> ------------------------------
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, an=
y
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addresse=
d
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Hom=
e
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content o=
f
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

--089e013d15d8ff8202051852f08b
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">No dispute about that. But the main design requirement Cas=
sandra strives to meet is to be a blazing fast transactional database - her=
e&#39;s the key, give me the data, and here&#39;s the key, write this data.=
 Any additional query requirements are a distant second at best. A big part=
 of that transactional speed requirement is achieved by jettisoning the ove=
rhead required for ad hoc queries.<div><br></div><div>I think it is inevita=
ble that Cassandra will eventually address the requirement for ad hoc queri=
es when it finally decides what it wants to be when it grows up (i.e., whet=
her to just be a niche or to subsume all of SQL), but in the meantime DSE S=
earch/Solr, Stratio, and TupleJump Stargate, as well as extraction and inde=
xing in Elasticsearch, are moderately reasonable near-term solutions.</div>=
<div><br></div><div>And I agree that having to fully model eventual (and ev=
olving!) data requirements and emergent anomalous conditions upfront is too=
 big a burden for many enterprises.</div><div><br></div></div><div class=3D=
"gmail_extra"><br clear=3D"all"><div><div class=3D"gmail_signature"><div di=
r=3D"ltr">-- Jack Krupansky</div></div></div>
<br><div class=3D"gmail_quote">On Fri, Jun 12, 2015 at 10:07 AM,  <span dir=
=3D"ltr">&lt;<a href=3D"mailto:SEAN_R_DURITY@homedepot.com" target=3D"_blan=
k">SEAN_R_DURITY@homedepot.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


<div lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">I will note here that the=
 limitations on ad-hoc querying (and aggregates) make it much more difficul=
t to deal with data quality problems, QA testing, and similar
 efforts, especially where people are used to a more relational, ad-hoc mod=
el. We have often had to extract data from Cassandra to Hadoop for querying=
 by hive.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=C2=A0<u></u></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Example: =E2=80=9CWe foun=
d a few records with incorrect data. How many more records like that are ou=
t there?=E2=80=9D<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=C2=A0<u></u></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=C2=A0<u></u></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Sean Durity<u></u><u></u>=
</span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=C2=A0<u></u></spa=
n></p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot=
;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"font-s=
ize:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Peter Li=
n [mailto:<a href=3D"mailto:woolfel@gmail.com" target=3D"_blank">woolfel@gm=
ail.com</a>]
<br>
<b>Sent:</b> Wednesday, June 10, 2015 8:17 AM<br>
<b>To:</b> <a href=3D"mailto:user@cassandra.apache.org" target=3D"_blank">u=
ser@cassandra.apache.org</a><br>
<b>Subject:</b> Re: Support for ad-hoc query<u></u><u></u></span></p><div><=
div class=3D"h5">
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<div>
<div>
<div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">I&#39;ll second Jack&=
#39;s detailed response and add that you really should do some discovery to=
 figure out what kinds of queries you may need to support.<u></u><u></u></p=
>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">It might not be possi=
ble and often that is the case, but it&#39;s worth while to ask the end use=
rs what kind of reports they need to run. Allowing arbitrary ad-hoc queries=
 is a known anti-pattern for cassandra.
 If the system needs to query multiple cf to derive/calculate some result, =
using Cassandra alone isn&#39;t going to do it. You&#39;ll need some other =
system to give you better query capabilities like Hive.
<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">If you need data ware=
house like features, look at
<a href=3D"http://www.kylin.io/" target=3D"_blank">http://www.kylin.io/</a>=
 . They are doing some interesting things.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">peter<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">On Wed, Jun 10, 2015 at 7:58 AM, Jack Krupansky &lt;=
<a href=3D"mailto:jack.krupansky@gmail.com" target=3D"_blank">jack.krupansk=
y@gmail.com</a>&gt; wrote:<u></u><u></u></p>
<div>
<p class=3D"MsoNormal">Knowing your queries in advance is a hard-core requi=
rement for effective deployment of Cassandra. Ad hoc queries are a very cle=
ar anti-pattern for Cassandra. DSE Search does provide support for advanced=
, complex, and ad hoc queries. Stratio
 and TupleJump Stargate can also be used.<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Back to the question of what you mean by ad hoc quer=
ies:<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">1. Do you expect real-time results, like sub-second,=
 or are these long-running queries that might take seconds, 10 seconds or m=
ore, or even minutes to run?<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">2. Will they be very rare or quite frequent - how mu=
ch load do you expect them to place on the cluster?<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">3. How complex do you expect them to be - how many c=
lauses and operators?<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">4. What is their net cardinality - are they selectin=
g just a few rows or many rows?<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">5. Do they have individual query clauses that select=
 many rows even if the net combination of all select clauses is not so many=
 rows?<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">The requirement to perform advanced, complex, and ad=
 hoc queries using DSE Search or the other techniques will almost certainly=
 require that you use moderately more capable hardware, especially more RAM=
, for each node, and probably more
 nodes as well to reduce the row count per node since ad hoc queries will t=
end to be compute-intensive based on number of rows on the node.<u></u><u><=
/u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Yes, it can be done. No, it is not free or cheap. An=
d, no, it does not come out of the box for a non-DSE Cassandra release. And=
, yes, you must address this requirement before deployment, not after deplo=
yment.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div>
<div>
<p class=3D"MsoNormal"><br clear=3D"all">
<u></u><u></u></p>
<div>
<div>
<div>
<p class=3D"MsoNormal">-- Jack Krupansky<u></u><u></u></p>
</div>
</div>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">On Wed, Jun 10, 2015 at 1:18 AM, Srinivasa T N &lt;<=
a href=3D"mailto:seenutn@gmail.com" target=3D"_blank">seenutn@gmail.com</a>=
&gt; wrote:<u></u><u></u></p>
<div>
<div>
<div>
<div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">Thanks guys for the i=
nputs.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">By ad-hoc queries I m=
ean that I don&#39;t know the queries during cf design time.=C2=A0 The data=
 may be from single cf or multiple cf.=C2=A0 (This feature maybe required i=
f I want to do analysis on the data stored in cassandra,
 do you have any better ideas)?<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">Regards,<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">Seenu.<u></u><u></u></p>
</div>
<div>
<div>
<div>
<div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">On Tue, Jun 9, 2015 at 5:57 PM, Peter Lin &lt;<a hre=
f=3D"mailto:woolfel@gmail.com" target=3D"_blank">woolfel@gmail.com</a>&gt; =
wrote:<u></u><u></u></p>
<div>
<div>
<div>
<div>
<div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">what do you mean by a=
d-hoc queries?<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">Do you mean simple qu=
eries against a single column family aka table?<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">Or do you mean MDX st=
yle queries that looks at multiple tables?<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">if it&#39;s MDX style=
 queries, many people extract data from Cassandra into a data warehouse tha=
t support multi-dimensional cubes. This works well when the extracted data =
is a small subset and fits neatly in a data
 warehouse.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">As others have stated=
, Cassandra isn&#39;t great at ad-hoc. For MDX style queries, Cassandra was=
n&#39;t designed for it. One thing we&#39;ve done for our own project is to=
 combine solr with our own fuzzy index to make ad-hoc
 queries against a single table more friendly.<br>
<br>
<u></u><u></u></p>
</div>
<div>
<div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">On Tue, Jun 9, 2015 at 2:38 AM, Srinivasa T N &lt;<a=
 href=3D"mailto:seenutn@gmail.com" target=3D"_blank">seenutn@gmail.com</a>&=
gt; wrote:<u></u><u></u></p>
<div>
<div>
<div>
<div>
<p class=3D"MsoNormal">Hi All,<u></u><u></u></p>
</div>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt">=C2=A0=C2=A0 I have a=
n web application running with my backend data stored in cassandra.=C2=A0 N=
ow I want to do some analysis on the data stored which requires some ad-hoc=
 queries fired on cassandra.=C2=A0 How can I do the same?<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">Regards,<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">Seenu.<u></u><u></u></p>
</div>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div>
</div>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div>
</div>
</div>
</div>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div></div></div>
<br>
<hr>
<font face=3D"Arial" color=3D"Gray" size=3D"1"><br>
The information in this Internet Email is confidential and may be legally p=
rivileged. It is intended solely for the addressee. Access to this Email by=
 anyone else is unauthorized. If you are not the intended recipient, any di=
sclosure, copying, distribution
 or any action taken or omitted to be taken in reliance on it, is prohibite=
d and may be unlawful. When addressed to our clients any opinions or advice=
 contained in this Email are subject to the terms and conditions expressed =
in any applicable governing The
 Home Depot terms of business or client engagement letter. The Home Depot d=
isclaims all responsibility and liability for the accuracy and content of t=
his attachment and for any damages or losses arising from any inaccuracies,=
 errors, viruses, e.g., worms, trojan
 horses, etc., or other items of a destructive nature, which may be contain=
ed in this attachment and shall not be liable for direct, indirect, consequ=
ential or special damages in connection with this e-mail message or its att=
achment.<br>
</font>
</div>

</blockquote></div><br></div>

--089e013d15d8ff8202051852f08b--