cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@datastax.com>
Subject Re: Confused about get_slice SliceRange behavior with bloom filter
Date Mon, 14 Feb 2011 12:43:37 GMT
On Mon, Feb 14, 2011 at 11:27 AM, Aditya Narayan <adynnn@gmail.com> wrote:

> Thanks Sylvain,
>
> I guess I might have misunderstood the meaning of column_index_size_in_kb,
> My previous understanding about that was: it is the threshold size for a row
> to pass, after which its columns will be indexed.
>

It is the size of the index 'bucket'. But given that there is no point to
have an index with only one entry, it is true that it is also the threshold
after wich row start to be indexed.


>
> If I have understood it correctly, it implies the size of the "blocks
> (containing columns) that are kept together on the same index". So if you
> make that high, a large no of columns will need to be deseralized for a
> single column access, in that block. And it you make it lower than optimal
> than indexes size will grow up, right?
>

yes


> So I guess we should vary that depending on the size of our columns and not
> the size of rows !? I have valueless columns for my usecase.


Yes it depends mainly on the size of your columns. But if you have big rows,
even with very tiny columns, you may still not want to put a too small value
there. In general I would really make careful tests with your workload
before changing the value of column_index_size_in_kb to see if it does make
a difference. Not sure there is much to gain here.

--
Sylvain


>
>
>
>
> On Mon, Feb 14, 2011 at 2:06 PM, Sylvain Lebresne <sylvain@datastax.com>wrote:
>
>> As said by aaron, if the whole row is under 64k, it won't matter. But
>> since you spoke of very wide row, I'll assume the whole will be much more
>> than 64k.
>>
>> If so, the row is indexed by block (of 64k, configurable). Then the read
>> performance depends on how many of those block are needed for the query,
>> since each block potentially means a seek (potentially because some block
>> could happen to be sequential on disk). So if the columns you ask for are
>> really randomly distributed, then yes, the biggest the row is, the biggest
>> the chance is to have to hit many blocks and the biggest the chance is for
>> these block to be far apart on disk.
>>
>> --
>> Sylvain
>>
>> On Sun, Feb 13, 2011 at 10:19 PM, Aditya Narayan <adynnn@gmail.com>wrote:
>>
>>> Jonathan,
>>> If I ask for around 150-200 columns (totally random not sequential) from
>>> a very wide row that contains more than a million or even more columns then,
>>> is the read performance of the SliceQuery operation affected by or "depends
>>> on the length of the row" ?? (For my use case, I would use the column names
>>> list for this SliceQuery operation).
>>>
>>>
>>> Thanks
>>> Aditya
>>>
>>>
>>> On Sun, Feb 13, 2011 at 8:41 PM, Jonathan Ellis <jbellis@gmail.com>wrote:
>>>
>>>> On Sun, Feb 13, 2011 at 12:37 AM, E S <tr1sklion@yahoo.com> wrote:
>>>> > I've gotten myself really confused by
>>>> > http://wiki.apache.org/cassandra/ArchitectureInternals and am hoping
>>>> someone can
>>>> > help me understand what the io behavior of this operation would be.
>>>> >
>>>> > When I do a get_slice for a column range, will it seek to every
>>>> SSTable?  I had
>>>> > thought that it would use the bloom filter on the row key so that it
>>>> would only
>>>> > do a seek to SSTables that have a very high probability of containing
>>>> columns
>>>> > for that row.
>>>>
>>>> Yes.
>>>>
>>>> > In the linked doc above, it seems to say that it is only used for
>>>> > exact column names.  Am I misunderstanding this?
>>>>
>>>> Yes.  You may be confusing multi-row behavior with multi-column.
>>>>
>>>> > On a related note, if instead of using a SliceRange I provide an
>>>> explicit list
>>>> > of columns, will I have to read all SSTables that have values for the
>>>> columns
>>>>
>>>> Yes.
>>>>
>>>> > or is it smart enough to stop after finding a value from the most
>>>> recent
>>>> > SSTable?
>>>>
>>>> There is no way to know which value is most recent without having to
>>>> read it first.
>>>>
>>>> --
>>>> Jonathan Ellis
>>>> Project Chair, Apache Cassandra
>>>> co-founder of DataStax, the source for professional Cassandra support
>>>> http://www.datastax.com
>>>>
>>>
>>>
>>
>

Mime
View raw message