cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering
Date Tue, 19 Nov 2013 09:19:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826335#comment-13826335
] 

Sylvain Lebresne commented on CASSANDRA-6348:
---------------------------------------------

bq. Other than hadoop queries, It's common for user to query on multiple indexes

I sure hope you're wrong and for sure it shoudn't be, because Cassandra sucks at it. And I
personally have almost never seen anyone use it (on the mailing list for instance). 

ALLOW FILTERING is really meant as a "don't do unless you're just having fun with cqlsh on
a toy database". Using ALLOW FILTERING on real production queries is wrong (at least for CQL
queries, I'm not talking about Hadoop, which is a different problem). I'm more than happy
to make the document/message more clear about that fact if it's not.

bq. Hadoop Cql query uses "ALLOW FILTERING"

Which is kind of a problem in the sense that it's not what ALLOW FILTERING has been intended
for and that more generally CQL has never been designed with Hadoop in mind, it's a strictly
real-time oriented language. So maybe we should re-purpose ALLOW FILTERING as "the hadoop
mode" somehow, but if we do, we should be a explicit about it and think about how to do that
best. But trying to shove Hadoop into something it hasn't been made for feels wrong to me.

That being said, I wonder if an overall simpler solution to the "Hadoop wants to use the 2dnary
indexes" problem couldn't be better solves by letting it query the 2ndary index CFS directly.
That is, allow selects on the index itself (which would obviously require a special flag to
unlock). That way, Hadoop would get paging over the index "for free" (which at the end of
the day is the problem that needs solving if I understand it correctly) and would get control
over that paging. And it would allow Hadoop to do things like merging indexes that probably
make more sense on the Hadoop side that it makes on the realtime side (i.e. we keep Cassandra
focuses on on realtime queries with as little processing as possible, which is what it is
good at).


> TimeoutException throws if Cql query allows data filtering and index is too big and it
can't find the data in base CF after filtering 
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6348
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Alex Liu
>            Assignee: Alex Liu
>
> If index row is too big, and filtering can't find the match Cql row in base CF, it keep
scanning the index row and retrieving base CF until the index row is scanned completely which
may take too long and thrift server returns TimeoutException. This is one of the reasons why
we shouldn't index a column if the index is too big.
> Multiple indexes merging can resolve the case where there are only EQUAL clauses. (CASSANDRA-6048
addresses it).
> If the query has none-EQUAL clauses, we still need do data filtering which might lead
to timeout exception.
> We can either disable those kind of queries or WARN the user that data filtering might
lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message