cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Liu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering
Date Fri, 22 Nov 2013 23:11:36 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830401#comment-13830401
] 

Alex Liu commented on CASSANDRA-6348:
-------------------------------------

rowsPerQuery is only used as page size for Index CF during 2i search.

maxColumns is the number of limit clause.  If meanColumns is a big number, then filter.maxColumns()/meanColumns
is less than 1, rowsPerQuery is 2. The result paging size for index CF is 2 which is too small,
we end up with too many random seeks between index CF and base CF, that's the reason why sometimes
2i index search is so slow. We need to avoid the page size of index CF too small. The goal
is to set page size an enough large number but not too large to avoid OOM, so we can have
less random seeks between index CF and base CF.

If there is data filtering involved and many base CF columns don't match the filter,  the
small page size causes the issue even worse for we needs paging through more pages in index
CF.

{code}
    public int maxRows()
    {
        return countCQL3Rows ? Integer.MAX_VALUE : maxResults;
    }

    public int maxColumns()
    {
        return countCQL3Rows ? maxResults : Integer.MAX_VALUE;
    }
{code}

for none-cql query,
{code}
            rowsPerQuery = Math.max(Math.min(filter.maxResults, Integer.MAX_VALUE / meanColumns),
2);
            most likely  becomes rowsPerQuery = Math.max(filter.maxResults, 2);
            most likely becomes rowsPerQuery = filter.maxResults
            which is the same number of rows to fetch
{code}

for cql query
{code}
            rowsPerQuery = Math.max(Math.min(Integer.MAX_VALUE, filter.maxResults / meanColumns),
2);
            most likely  becomes rowsPerQuery = Math.max(filter.maxResults/ meanColumns, 2);
            most likely becomes rowsPerQuery = filter.maxResults/ meanColumns
            if meanColumns is too big, it's a very small number less than 1 possible.
            if no limit clause in cql query, it becomes Integer.MAX_VALUE/ meanColumns which
is a big number.
{code}

So the question is how to calculate page size for index CF, so we don't have too many random
seeks between index CF and base CF and void fetching too many index columns to avoid OOM.



> TimeoutException throws if Cql query allows data filtering and index is too big and it
can't find the data in base CF after filtering 
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6348
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Alex Liu
>            Assignee: Alex Liu
>         Attachments: 6348.txt
>
>
> If index row is too big, and filtering can't find the match Cql row in base CF, it keep
scanning the index row and retrieving base CF until the index row is scanned completely which
may take too long and thrift server returns TimeoutException. This is one of the reasons why
we shouldn't index a column if the index is too big.
> Multiple indexes merging can resolve the case where there are only EQUAL clauses. (CASSANDRA-6048
addresses it).
> If the query has none-EQUAL clauses, we still need do data filtering which might lead
to timeout exception.
> We can either disable those kind of queries or WARN the user that data filtering might
lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message