cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries
Date Fri, 16 Nov 2012 19:55:12 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499052#comment-13499052
] 

Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------

bq. Is it a clause that must be applied to let this query run?

Yes, that's the idea.

bq. Whenever we iterate over more then MAX_EXAMINED we shout circuit and return what we have

That's a good idea. However I'd rather see it as a way to fine-tune the behavior of the {{FILTERING
ALLOWED}} idea above (even if at the end, we end up with pretty much the same than what you
suggest). Let me explain.

What I'd like to do is:
# refuse queres as they are today when they might involve "filtering" data.  By filtering
here I mean that some records are read but discarded from the resultSet.
# adds an {{ALLOW FILTERING}} syntax that "unlock" those queries (as in, allow the query to
run).
# when {{ALLOW FILTERING}} is used, allow to specify the maximum number of filtered records
with say {{ALLOW FILTERING MAX 500}}.

I believe we're reached concensus on 1., but basically the arguments are above.
Now 2.+ 3. is pretty much the equivalent of Ed's idea (more precisely, using {{LIMIT X ALLOW
FILTERING MAX Y}} would be the equivalent of {{LIMIT X MAX_PREPARED X+Y}} if I understand
Ed's proposal right). However, the reason why I think we should allow 2. alone are that:
* I do think 2. is useful in it's own right. Or rather, you may have cases where you want
all results period. How course you could provide a very big value for the max filtered, but
that's lame. Or another way to put it is that it's one thing to say "I understand this query
may do some unknown amount of useless work underneath but go ahead" and a slightly different
one to control exactly how much of that uselless work you allow.
* Part 3. is a bit of a break of the API abstraction. What I mean here is that the actual
behavior/result of a MAX_EXAMINED will depends on implementation details. Say tomorrow we'll
optimize somehow how much records are actually examined to answer a query, then a query MAX_EXAMINED
may return a different result tomorrow even on the exact same setting. Part 2. doesn't have
this problem, and so while I'm good having 3. because I see how it can be useful, I'd rather
not have it alone.
* On the very practical side of things, part 3. is more complex to implement.  I'm pretty
sure it'll require some storage engine change for instance. Also, I think there is points
to clarify: if you shortcut the query, how does the user know if the query was shortcut or
not? We can probably add some flag to the ResultSet I suppose, or somethine else, but the
point is that I'd rather take the time to do that part right. Meaning that I think shoving
it in 1.2.0 at this point is imho a bad idea. So I'd rather do part 2. now, which I'm confident
is well defined, and improve with part 3. later.

                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query is fast
because Cassandra is performing an optimized query (over an index, or using a slicePredicate)
or if cassandra is simple sampling some random rows and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique un-sql like
position by applying an automatic limit clause without the user asking for them. I also do
not believe the CQL language should let the user issue queries that will not work as intended
with "larger-then-auto-limit" size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message