cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Brosius <>
Subject Re: Cassandra search performance
Date Sat, 12 May 2012 14:09:01 GMT
When you build a secondary index on a field, what happens is a second
(hidden) column family is created for you, with the key for that column
family being the value of the index field, and the columns being the
keys of the entries in your column family that have that value.

So in your case, you made a bogus column that had all the same values,
and index that. So what that did was create a hidden column family with
one row. the key had the bogus value, and you had n columns that held
the keys for all your records.

When you did the query, since you did bogus_column EQ bogusvalue, you
basically got all the keys from that one index row.

You then loaded all rows in your column family, and in memory did the
real query you wanted with the inequality based on time (00:00 to 23:59)

As you can see this index was of no value, as it didn't pare down your
set of rows at all.

What you probably want to do is change your data model. Your original
column family should probably have a key that represents some time based
element 'bucket'. For instance, you might want to have rows that
represents 1 hour, and all the columns in that row are values for
various points of time in that one hour.

Then when you want to do searches for data for time ranges, you know
what keys you should query to get the values for those times. And then
what columns in those rows to get.

Of course you haven't really laid out what you are trying to do, so this
explanation is kind of fuzzy, so...

On 05/12/2012 06:18 AM, Jason Tang wrote:
> I try to search one column, this column store the time as the type
> Long, 1,000,000 data equally distributed in 24 hours, I only want to
> search certain time rang, eg from 01:30 to 01:50 or 08:00 to 12:00,
> but something stranger happened.
> Search 00:00 to 23:59 limit 100
> It took less then 1 second scan 100 record
> Search 00:00 to 00:20 limit 100
> It took more then one minute scan around 2,400 recods
> So the result shows it seems cassandra scan one by one to match the
> condition, and the data is not ordered in sequence.
> One more thing, to have equal condition, I make a redundant column to
> have equal condition, the value is same for all records.
> The search condition like get record where equal='equal' and time >
> 00:00 and time < 00:20
> Is it the expected behavior of secondary index or I didn't use it correct.
> Because I used to have another test, I have one string column most of
> it is string 'true' and I add 100 'false' among 1,000,000 'true' , it
> shows it only scan 100 records.
> So how can I exam what happened inside cassadra, and where I can find
> out the detail of how secondary works?
> 在 2012年5月8日星期二,Maxim Potekhin 写道:
>     Thanks for the comments, much appreciated.
>     Maxim
>     On 5/7/2012 3:22 AM, David Jeske wrote:
>>     On Sun, Apr 29, 2012 at 4:32 PM, Maxim Potekhin <
>>     <javascript:_e({}, 'cvml', '');>> wrote:
>>         Looking at your example,as I think you understand, you forgo
>>         indexes by
>>         combining two conditions in one query, thinking along the
>>         lines of what is
>>         often done in RDBMS. A scan is expected in this case, and
>>         there is no
>>         magic to avoid it.
>>     This sounds like a mis-understanding of how RDBMSs work. If you
>>     combine two conditions in a single SQL query, the SQL execution
>>     optimizer looks at the cardinality of any indicies. If it can
>>     successfully predict that one of the conditions significantly
>>     reduces the set of rows that would be considered (such as a
>>     status match having 200 hits vs 1M rows in the table), then it
>>     selects this index for the first-iteration, and each index hit
>>     causes a record lookup which is then tested for the other
>>     conditions. (This is one of several query-execution types RDBMS
>>     systems use)
>>     I'm no Cassandra expert, so I don't know what it does WRT
>>     index-selection, but from the page written on secondary indicies,
>>     it seems like if you just query on status, and do the other
>>     filtering yourself it'll probably do what you want...
>>         However, if this query is important, you can easily index on
>>         two conditions,
>>         using a composite type (look it up), or string concatenation
>>         for quick and
>>         easy solution.
>>     This is not necessarily a good idea. Creating a composite index
>>     explodes the index size unnecessarily. If a condition can reduce
>>     a query to 200 records, there is no need to have a composite
>>     index including another condition.

View raw message