cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Retrieving a column from a fat row vs retrieving a single row
Date Wed, 08 Jun 2011 22:50:15 GMT
Just to make things less clear, if you have one row that you are continually writing it may
end up spread out over several SSTables. Compaction helps here to reduce the number of files
that must be accessed so long as is can keep up. But if you want to read column X and the
row is fragmented over 5 SSTables then each one must be accessed. 

 https://issues.apache.org/jira/browse/CASSANDRA-2319  is open to try and reduce the number
of seeks. 

For now take a look at nodetool cfhistograms to see how many sstables are read for your queries.


Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 9 Jun 2011, at 04:50, Peter Schuller wrote:

>> As far as I know, to read a single column cassandra will deserialize a
>> bunch of them and then pick the correct one (64KB of data right?)
> 
> Assuming the default setting of 64kb, the average amount deserialized
> given random column access should be 8 kb (not true with row cache,
> but with large rows presumably you don't have row cache).
> 
>> Would it be faster to have a row for each id I want to translate? This
>> would make keycache less effective, but the amount of data read should
>> be smaller.
> 
> It depends on what bottlenecks you're optimizing for. A key is
> "expensive" in the sense that if (1) increases the size of bloom
> filters for the column family, and it (2) increases the memory cost of
> index sampling, and (3) increases the total data size (typically)
> because the row size is duplicated in both the index and data files.
> 
> The cost of deserialization the same data repeatedly is CPU. So if
> you're nowhere near bottlenecking on disk and the memory trade-off is
> reasonable, it may be a suitable optimization. However, consider that
> unless you're doing order preserving partitioning, accessing those
> rows will be effectively random w.r.t. the locations on disk you're
> reading from so you're adding a lot of overhead in terms of disk I/O
> unless your data set fits comfortably in memory.
> 
> -- 
> / Peter Schuller


Mime
View raw message