cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Alves (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
Date Sat, 01 Sep 2012 06:10:09 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446542#comment-13446542
] 

David Alves edited comment on CASSANDRA-1337 at 9/1/12 5:09 PM:
----------------------------------------------------------------

Clean rehash that addresses Sylvain's (very helpful) comments, including an implementation
for the CQL3 case. It estimates concurrency factor the following ways:

Estimate Rows:
- Primary Indexes - uses cfs's estimated keys divided by RF
- 2ndary indexes - uses the mean col count of the most selective index to estimate the total
num keys

Estimate Cols (CQL3):
- IdentityFilter - uses the estimated keys + mean col count to estimate total cols
- NamesFilter - assumes cols with names are present and uses estimated keys to calculate to
estimate total cols
- Other filters - as Sylvain mentioned because we have no idea on the selectivity of the col
filter we cannot estimate how many cols will be returned per node so we revert to concurrecy
factor = 1.

Reimplemented parallel the parallel execution part to make it a lot cleaner IMO (previous
implementation was forcefully adapting from the initial sequential execution which made it
difficult to read)

Notes:
- cql_test.py dtest is failing in the same place as trunk ,need to look into it to make sure
Sylvain's dtest passes
- not sure whether to wait on read repair results for all handlers or just for the ones we
actually use
                
      was (Author: dr-alves):
    Clean rehash that addresses Sylvain's (very helpful) comments, including an implementation
for the CQL3 case. It estimates concurrency factor the following ways:

Estimate Rows:
- Primary Indexes - uses cfs's estimated keys divided by RF
- 2ndary indexes - uses the mean col count of the most selective index to estimate the total
num keys

Estimate Cols (CQL3):
- IdentityFilter - uses the estimated keys + mean col count to estimate total cols
- NamesFilter - assumes cols with names are present and uses estimated keys to calculate to
estimate total cols
- Other filters - as ylvain mentioned because we have no idea on the selectivity of the col
filter we cannot estimate how many cols will be returned per node so we revert to concurrecy
factor = 1.

Reimplemented parallel the parallel execution part to make it a lot cleaner IMO (previous
implementation was forcefully adapting from the initial sequential execution which made it
difficult to read)

Notes:
- cql_test.py dtest is failing in the same place as trunk ,need to look into it to make sure
Sylvain's dtest passes
- not sure whether to wait on read repair results for all handlers or just for the ones we
actually use
                  
> parallelize fetching rows for low-cardinality indexes
> -----------------------------------------------------
>
>                 Key: CASSANDRA-1337
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: David Alves
>            Priority: Minor
>             Fix For: 1.2.1
>
>         Attachments: 1137-bugfix.patch, 1337.patch, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt,
CASSANDRA-1337.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> currently, we read the indexed rows from the first node (in partitioner order); if that
does not have enough matching rows, we read the rows from the next, and so forth.
> we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel,
such that we have a high chance of getting enough rows w/o having to do another round of queries
(but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough
data or we have fetched from each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message