cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Low <r...@acunu.com>
Subject Re: Retrieving a column from a fat row vs retrieving a single row
Date Thu, 09 Jun 2011 11:28:56 GMT
Remember also that partitioning is done by rows, not columns.  So
large rows are stored on a single host.  This means they can't be load
balanced and also all requests to that row will hit one host.  Having
separate rows will allow load balancing of I/Os.

-- 
Richard Low
Acunu | http://www.acunu.com | @acunu

On Thu, Jun 9, 2011 at 12:50 AM, aaron morton <aaron@thelastpickle.com> wrote:
> Just to make things less clear, if you have one row that you are continually
> writing it may end up spread out over several SSTables. Compaction helps
> here to reduce the number of files that must be accessed so long as is can
> keep up. But if you want to read column X and the row is fragmented over 5
> SSTables then each one must be accessed.
>  https://issues.apache.org/jira/browse/CASSANDRA-2319  is open to try and
> reduce the number of seeks.
> For now take a look at nodetool cfhistograms to see how many sstables are
> read for your queries.
> Cheers
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> On 9 Jun 2011, at 04:50, Peter Schuller wrote:
>
> As far as I know, to read a single column cassandra will deserialize a
>
> bunch of them and then pick the correct one (64KB of data right?)
>
> Assuming the default setting of 64kb, the average amount deserialized
> given random column access should be 8 kb (not true with row cache,
> but with large rows presumably you don't have row cache).
>
> Would it be faster to have a row for each id I want to translate? This
>
> would make keycache less effective, but the amount of data read should
>
> be smaller.
>
> It depends on what bottlenecks you're optimizing for. A key is
> "expensive" in the sense that if (1) increases the size of bloom
> filters for the column family, and it (2) increases the memory cost of
> index sampling, and (3) increases the total data size (typically)
> because the row size is duplicated in both the index and data files.
>
> The cost of deserialization the same data repeatedly is CPU. So if
> you're nowhere near bottlenecking on disk and the memory trade-off is
> reasonable, it may be a suitable optimization. However, consider that
> unless you're doing order preserving partitioning, accessing those
> rows will be effectively random w.r.t. the locations on disk you're
> reading from so you're adding a lot of overhead in terms of disk I/O
> unless your data set fits comfortably in memory.
>
> --
> / Peter Schuller
>
>

Mime
View raw message