cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-3545) Fix very low Secondary Index performance
Date Thu, 01 Dec 2011 18:46:39 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161023#comment-13161023
] 

Jonathan Ellis commented on CASSANDRA-3545:
-------------------------------------------

Wow.  That's a really clever workaround to working on a generic Collection iterator, that
we happen to know is sorted.  But it's also kind of complicated, and does a bunch of copying
to a temporary collection that I'd prefer to avoid.

What if we added a getSortedColumns(byte[] startWith) overload to AbstractColumnContainer
+ ISortedColumns?  Then each ISortedColumns implementation could implement that the straightforward
way (for ArrayBacked, with Collections.binarySearch + subList; for the NavigableMap based
ones, with tailMap).
                
> Fix very low Secondary Index performance
> ----------------------------------------
>
>                 Key: CASSANDRA-3545
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3545
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.7.0
>            Reporter: Evgeny Ryabitskiy
>             Fix For: 1.0.6
>
>         Attachments: CASSANDRA-3545.patch, CASSANDRA-3545_v2.patch, IndexSearchPerformance.png
>
>
> While performing index search + value filtering over large Index Row ( ~100k keys per
index value) with chunks (size of 512-1024 keys) search time is about 8-12 seconds, which
is very very low.
> After profiling I got this picture:
> 60% of search time is calculating MD5 hash with MessageDigester (Of cause it is because
of RundomPartitioner).
> 33% of search time (half of all MD5 hash calculating time) is double calculating of MD5
for comparing two row keys while rotating Index row to startKey (when performing search query
for next chunk).
> I see several performance improvements:
> 1) Use good algorithm to search startKey in sorted collection, that is faster then iteration
over all keys. This solution is on first place because it simple, need only local code changes
and should solve problem (increase search in multiple times).
> 2) Don't calculate MD5 hash for startKey every time. It's optimal to compute it once
(so search will be twice faster).
> Also need local code changes.
> 3) Think about something faster that MD5 for hashing (like TigerRandomPartitioner with
Tiger/128 hash).
> Need research and maybe this research was done.
> 4) Don't use Tokens (with MD5 hash for RandomPartitioner) for comparing and sorting keys
in index rows. In index rows, keys can be stored and compared with simple Byte Comparator.

> This solution requires huge code changes.
> I'm going to start from first solution. Next improvements can be done with next tickets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message