cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Schuller <>
Subject Re: Read Latency Degradation
Date Sat, 18 Dec 2010 15:59:31 GMT
> +1 on each of Peter's points except one.
> For example, if the hot set is very small and slowly changing, you may
> be able to have 100 TB per node and take the traffic without any
> difficulties.

So that statement was probably not the best. I should have been more
careful. I meant it purely in terms of dealing with the traffic
patterns (hot set and the implied IOPS relative to request rate etc),
rather than as a claim that there are no issues with having 100 TB of
data on a single node.

Sorry if that was unclear.

> Also this page
> On ext2/ext3 the maximum file size is 2TB, even on a 64 bit kernel. On
> ext4 that goes up to 16TB. Since Cassandra can use almost half your
> disk space on a single file, if you are raiding large disks together
> you may want to use XFS instead, particularly if you are using a
> 32-bit kernel. XFS file size limits are 16TB max on a 32 bit kernel,
> and basically unlimited on 64 bit.

Another problem is that file removals (unlink()) are *DOG* slow on
ext3/ext2, and extremely seek bound, and put the process in
uninterruptable sleep mode.

I would strongly urge the use of XFS for large data sets. The main
argument against it is probably that more people test with ext3
because it tends to be the default.

> 2) If you have small columns the start up time for a node (0.6.0)
> would be mind boggling to sample those indexes

This is true with 0.7 as well. It's improved, but sampling still takes
place and will take time. At minimum you have to read through the
indexes on disk, so whatever time it takes to do that is time to wait.
In addition if the sampling is CPU bound, it will take even longer.

(I think the parallel index sampling has gone in so that multiple core
are used, but the issue remains.)

> We should be careful not to mislead people. Talking about 16TB XFS
> setup, or 100TB/node without any difficulties , seems very very far
> from the common use case.

I completely agree. I didn't mean to imply that and I hope no one was mislead.

/ Peter Schuller

View raw message