cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood (JIRA)" <>
Subject [jira] Updated: (CASSANDRA-1526) Make cassandra sampling and startup faster
Date Tue, 21 Sep 2010 22:29:34 GMT


Stu Hood updated CASSANDRA-1526:

    Attachment: skip-short-byte-array.diff

A lot of CPU time could be eliminated by not decoding keys we don't need as well: attaching
a patch from 1472, but you'd need to split IndexSummary.maybeAddEntry into {{boolean increment}}
(increment rowid, return true if needs next entry) and {{void addEntry}}.

> Make cassandra sampling and startup faster
> ------------------------------------------
>                 Key: CASSANDRA-1526
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Edward Capriolo
>            Assignee: Jonathan Ellis
>            Priority: Minor
>             Fix For: 0.6.6, 0.7.0
>         Attachments: 1526.txt, cpu.txt, io.txt, skip-short-byte-array.diff
> makes mention of very large disks
I do not see how that would be possible.
> We have a server class system have 4x processors 16GB RAM a 6 DISK RAID5 (yes RAID0 would
be faster but still) 
> {noformat}
> INFO [main] 2010-09-21 12:58:26,348 (line 120) Sampling index for
> ...
> INFO [main] 2010-09-21 13:05:51,333 (line 124) Binding thrift service
to cdbsd07/
> {noformat}
> This node has 200GB of data in two column families and the time to sample all tables
and startup is 7+ minutes. The logging suggests this process is happening a single SSTable
at a time. Additionally the normal system vitals mainly DISK and CPU do not look overtaxed.
> * Since SSTables are immutable is there a way the sampling of the tables could be saved?
> * Could this process be done in parallel for speedup?
> * Can multiple column families be processed at once?
> Unless someone has an insanely powerful disk pack making mention of 2TB limitations seem
out of place. Unless my calculations are wrong (which they usually are), I have a pretty decent
hardware, and if I had 2 TB of data I would have a 95 minute node start up? 
> I hope that maybe sampling multiple ColumnFamilies at once would make nodes of at least
a few hundred GB startup reasonably fast.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message