Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of dan.hendry.junk@gmail.com
 designates 209.85.161.172 as permitted sender)
From: "Dan Hendry" <dan.hendry.junk@gmail.com>
To: <user@cassandra.apache.org>
References: <4EC4DB9D.4090000@sendmail.cz>
In-Reply-To: <4EC4DB9D.4090000@sendmail.cz>
Subject: RE: split large sstable
Date: Thu, 17 Nov 2011 11:42:05 -0500
Message-ID: <4ec53971.a8afec0a.484d.ffff8d90@mx.google.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
Thread-Index: AcylEBXCevegXKeRQvy52sejiv16VAANBSxg
Content-Language: en-ca

What do you mean by ' better file offset caching'? Presumably you mean
'better page cache hit rate'? Out of curiosity, why do you think this? What
data are you seeing which makes you think it's better? I am certainly not
even close to a virtual memory or page caching expert but I am pretty sure
file size does not matter (assuming file sizes are significantly greater
than the page size which I believe is 4k). 

Perhaps what you are actually seeing is row fragmentation across your
SSTables? Easy to check with nodetool cfhistograms (SSTables column).

To answer your question, I know of no tools to split SSTables. If you want
to switch compaction strategies, levelled compaction (1.0.x) creates many
smaller sstables instead of fewer, bigger ones. Although it is workload
dependent, increasing min_compaction_threshold for size tiered compaction is
probably a bad idea since it will increase row fragmentation across SSTables
and therefore increase io/seeking requirements for reads (particularly for
column ranges or non named-column queries). The only reason to do so is to
reduce the frequency of compaction (disk io considerations). 

Dan

-----Original Message-----
From: Radim Kolar [mailto:hsn@sendmail.cz] 
Sent: November-17-11 5:02
To: user@cassandra.apache.org
Subject: split large sstable

Is there some simple way how to split large sstable into several smaller 
ones? I increased  min_compaction_threshold (smaller tables seems to get 
better file offset caching from OS) and now i need to reshuffle data to 
smaller sstables, running several cluster wide repairs worked well just 
largest table was left. I have 80 GB sstable and need to split it to 
about 10 GB ones.
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.920 / Virus Database: 271.1.1/4020 - Release Date: 11/16/11
02:34:00