incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <>
Subject Re: Will the large datafile size affect the performance?
Date Wed, 23 Feb 2011 22:19:10 GMT
On Wed, Feb 23, 2011 at 4:51 PM, buddhasystem <> wrote:
> I know that theoretically it should not (apart from compaction issues), but
> maybe somebody has experience showing otherwise:
> My test cluster now has 250GB of data and will have 1.5TB in its
> reincarnation. If all these data is in a single CF -- will it cause read or
> write performance problems? Should I "shard" it? One advantage of splitting
> the data would be reducing the impact of compaction and repairs (or so I
> naively assume).
> Maxim
> --
> View this message in context:
> Sent from the mailing list archive at

By dividing your data you get the benefits of being able to apply two
different settings at the Column Family or keyspace level. For example
you might have some batch data that you only want to replicate twice,
or some small subset of data that needs to be read frequently that is
highly cached. Also as you said having three smaller CF's helps you
avoid a single very long running and intensive operations like repair
or major compact.

If you always need to read both CF's to satisfy you application it is
not a good idea.

View raw message