cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (Commented) (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-3545) Fix very low Secondary Index performance
Date Thu, 01 Dec 2011 22:46:40 GMT


Jonathan Ellis commented on CASSANDRA-3545:

bq. As a compromise I can suggest to apply this patch and add ticket for feature to cleanup
code, move to binary search and new API in ISortedColumns.

I'm afraid that's typically not how we work.  If we see a clearly better approach during review,
which this qualifies as, then we'll wait for that before committing.  Especially when the
goal is to get into a stable branch.
> Fix very low Secondary Index performance
> ----------------------------------------
>                 Key: CASSANDRA-3545
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.7.0
>            Reporter: Evgeny Ryabitskiy
>             Fix For: 1.0.6
>         Attachments: CASSANDRA-3545.patch, CASSANDRA-3545_v2.patch, IndexSearchPerformance.png
> While performing index search + value filtering over large Index Row ( ~100k keys per
index value) with chunks (size of 512-1024 keys) search time is about 8-12 seconds, which
is very very low.
> After profiling I got this picture:
> 60% of search time is calculating MD5 hash with MessageDigester (Of cause it is because
of RundomPartitioner).
> 33% of search time (half of all MD5 hash calculating time) is double calculating of MD5
for comparing two row keys while rotating Index row to startKey (when performing search query
for next chunk).
> I see several performance improvements:
> 1) Use good algorithm to search startKey in sorted collection, that is faster then iteration
over all keys. This solution is on first place because it simple, need only local code changes
and should solve problem (increase search in multiple times).
> 2) Don't calculate MD5 hash for startKey every time. It's optimal to compute it once
(so search will be twice faster).
> Also need local code changes.
> 3) Think about something faster that MD5 for hashing (like TigerRandomPartitioner with
Tiger/128 hash).
> Need research and maybe this research was done.
> 4) Don't use Tokens (with MD5 hash for RandomPartitioner) for comparing and sorting keys
in index rows. In index rows, keys can be stored and compared with simple Byte Comparator.

> This solution requires huge code changes.
> I'm going to start from first solution. Next improvements can be done with next tickets.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message