cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-3545) Fix very low Secondary Index performance
Date Thu, 08 Dec 2011 15:59:40 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sylvain Lebresne updated CASSANDRA-3545:
----------------------------------------

    Attachment: 0002-cleanup.patch
                0001-3545.patch

I agree with Jonathan than interning this inside the column family feels cleaner (and is more
efficient). Attaching patch to do that (actually 2 patch, the second one does some cleaning
of the comparator being given to lots of methods that don't care about it or can get it by
other means). The patches are against trunk since I don't think we should push that into a
stable release (independently of the actual implementation).

Note that this only applies to memtable, so this has probably much more impact on small benchmarks
(where you insert and get immediately) than it will have in real life (it's still an improvement,
don't get me wrong).

For the rest:
bq. 2) Don't calculate MD5 hash for startKey every time. It's optimal to compute it once (so
search will be twice faster).

Unfortunately I don't see much way to do this any cleanly, without breaking badly the comparator
abstraction.

bq. 3) Think about something faster that MD5 for hashing (like TigerRandomPartitioner with
Tiger/128 hash).

It could be worth checking, though a quick search doesn't seem to return much interesting
things. Finding a faster MD5 implementation would be convenient too, but the only thing I've
found so far is http://twmacinta.com/myjava/fast_md5.php, which is unfortunately incompatible
with our licence.

bq. 4) Don't use Tokens (with MD5 hash for RandomPartitioner) for comparing and sorting keys
in index rows. In index rows, keys can be stored and compared with simple Byte Comparator

Imo, that's the most promising option. I don't think that would be very complicated to do
(I actually think it would be pretty easy but I may be forgetting a difficulty), but the annoying
part will likely be how to deal with the upgrade/backward compatibility. I may give it a shot
at some point though.

                
> Fix very low Secondary Index performance
> ----------------------------------------
>
>                 Key: CASSANDRA-3545
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3545
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.7.0
>            Reporter: Evgeny Ryabitskiy
>             Fix For: 1.0.6
>
>         Attachments: 0001-3545.patch, 0002-cleanup.patch, CASSANDRA-3545.patch, CASSANDRA-3545_v2.patch,
IndexSearchPerformance.png
>
>
> While performing index search + value filtering over large Index Row ( ~100k keys per
index value) with chunks (size of 512-1024 keys) search time is about 8-12 seconds, which
is very very low.
> After profiling I got this picture:
> 60% of search time is calculating MD5 hash with MessageDigester (Of cause it is because
of RundomPartitioner).
> 33% of search time (half of all MD5 hash calculating time) is double calculating of MD5
for comparing two row keys while rotating Index row to startKey (when performing search query
for next chunk).
> I see several performance improvements:
> 1) Use good algorithm to search startKey in sorted collection, that is faster then iteration
over all keys. This solution is on first place because it simple, need only local code changes
and should solve problem (increase search in multiple times).
> 2) Don't calculate MD5 hash for startKey every time. It's optimal to compute it once
(so search will be twice faster).
> Also need local code changes.
> 3) Think about something faster that MD5 for hashing (like TigerRandomPartitioner with
Tiger/128 hash).
> Need research and maybe this research was done.
> 4) Don't use Tokens (with MD5 hash for RandomPartitioner) for comparing and sorting keys
in index rows. In index rows, keys can be stored and compared with simple Byte Comparator.

> This solution requires huge code changes.
> I'm going to start from first solution. Next improvements can be done with next tickets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message