cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@datastax.com>
Subject Re: Why is our range query failing in Cassandra 0.8.10 Client
Date Thu, 12 Jul 2012 08:14:29 GMT
When executing a query like:
  get events WHERE Firm=434550 AND ds_timestamp>=1341955958200 AND
ds_timestamp<=1341955958200;
what the 2ndary index implementation will do is:
1) it queries the index for Firm for the row with key 434550 (because
that's the only one restricted by an equal clause, and that is why you
need at least one equal clause).
2) the query from 1 will return a bunch of events row keys for those
events whose Firm=434550. So for each of those row key it queries the
corresponding event
3) if a given queried event matches the remaining clauses (here
ds_timestamp>=1341955958200 AND ds_timestamp<=1341955958200), it adds
it to the result, otherwise it skips.

So what I suspect is happening is that you have *lots* of events matching
'Firm=434550' but only one matches 'ds_timestamp>=1341955958200 AND
ds_timestamp<=1341955958200'. And given that by default, it tries to
find 100 results, it will scan all the events having 'Firm=434550'
before returning.  Which it probably cannot do within the timeout.

But when you do
  get events WHERE Firm=434550 AND ds_timestamp>=1341955958200
given that lots of event having 'Firm=434550' probably match
ds_timestamp>=1341955958200, it is able to find 100 of them quickly.

Lastly, when you do
  get events WHERE Firm=434550 AND ds_timestamp=1341955958200;
the implementation has now both clause that are equal, and based on
internal stats it is able to determine that querying the ds_timestamp
index will discriminate potential results more efficiently. So it will
query the ds_timestamp index instead of the Firm one, which will yield
all events whose timestamp is 1341955958200, but since there isn't
many such events, it quickly finds the one matching Firm=434550

In other words, you are not doing something wrong, you are just
hitting a limitation/weakness of the 2ndary index implementation.

So if the query you really want to do is for a specific timestamp, you
definitively want to use an equal rather than two non-strict
inequalities. But if what you want is query events in a very small but
non-discrete window of time, then using 2ndary indexes might just not
fit the bill.

In that case, one option would be to do a custom/specialized index
yourself. If you construct an index where the row key is the Firm and
the column name is the ds_timestamp (and the colum value is the event
identifier), then finding events having a specific firm for a time
window will be (always) efficient. However that's more work on your
side since unfortunately Cassandra is not able to automatically create
such specialized index itself (at least not yet).

--
Sylvain

On Wed, Jul 11, 2012 at 10:01 PM, JohnB <jxb@tiac.net> wrote:
> Hi:
>
> We are currently using Cassandra 0.8.10 and have run into some strange issues surrounding
> querying for a range of data
>
>
> I ran a couple of get statements via the Cassandra client and found some interesting
results:
>
>
> Consider the following Column Family Definition:
>
>     ColumnFamily: events
>       Key Validation Class: org.apache.cassandra.db.marshal.BytesType
>       Default column value validator: org.apache.cassandra.db.marshal.BytesType
>       Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>       Row cache size / save period in seconds: 0.0/0
>       Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider
>       Key cache size / save period in seconds: 200000.0/14400
>       Memtable thresholds: 0.2953125/1440/63 (millions of ops/minutes/MB)
>       GC grace seconds: 864000
>       Compaction min/max thresholds: 4/32
>       Read repair chance: 1.0
>       Replicate on write: true
>       Built indexes: [events.events_Firm_idx, events.events_OrdType_idx, events.events_OrderID_idx
>  , events.events_OrderQty_idx, events.events_Price_idx, events.events_Symbol_idx, events.events_ds_timestamp_idx]
>       Column Metadata:
>         Column Name: Firm
>           Validation Class: org.apache.cassandra.db.marshal.BytesType
>           Index Name: events_Firm_idx
>           Index Type: KEYS
>         Column Name: OrdType
>           Validation Class: org.apache.cassandra.db.marshal.BytesType
>           Index Name: events_OrdType_idx
>           Index Type: KEYS
>         Column Name: OrderID
>           Validation Class: org.apache.cassandra.db.marshal.BytesType
>           Index Name: events_OrderID_idx
>           Index Type: KEYS
>         Column Name: OrderQty
>           Validation Class: org.apache.cassandra.db.marshal.LongType
>           Index Name: events_OrderQty_idx
>           Index Type: KEYS
>         Column Name: Price
>           Validation Class: org.apache.cassandra.db.marshal.LongType
>           Index Name: events_Price_idx
>           Index Type: KEYS
>         Column Name: Symbol
>           Validation Class: org.apache.cassandra.db.marshal.BytesType
>           Index Name: events_Symbol_idx
>         Column Name: ds_timestamp
>           Validation Class: org.apache.cassandra.db.marshal.LongType
>           Index Name: events_ds_timestamp_idx
>           Index Type: KEYS
>
>
> get events WHERE Firm=434550 AND ds_timestamp=1341955958200;
>
>
> …and the results are pretty much instantaneous.
>
> 1 Row Returned.
>
> [default@FIX] get events WHERE Firm=434550 AND ds_timestamp=1341955958200;
>
> -------------------
>
> RowKey: 64326430363362302d636164362d313165312d626637622d333836303737306639303133
> => (column=ClOrdID, value=32323833, timestamp=1341955980651010)
> => (column=Firm, value=434550, timestamp=1341955980651026)
> => (column=OrdType, value=31, timestamp=1341955980651008)
> => (column=OrderQty, value=8200, timestamp=1341955980651013)
> => (column=Price, value=433561, timestamp=1341955980651019)
> => (column=Symbol, value=544e54, timestamp=1341955980651018)
> => (column=ds_timestamp, value=1341955958200, timestamp=1341955980651020)
>
>
> If I run the following query:
>
> get events WHERE Firm=434550 AND ds_timestamp>=1341955958200 AND ds_timestamp<=1341955958200;
>
> (which in theory would should return the same 1 row result)
>
>
>
> It runs for around 12 seconds,
>
>
>
> And I get:
>
> TimedOutException()
>
>
>
>
> If I run:
>
> get events WHERE Firm=434550 AND ds_timestamp>=1341955958200;
>
> or
>
> get events WHERE Firm=434550 AND ds_timestamp<=1341955958200;
>
> The results return quickly.
>
>
>
> Curious, I also ran  a similar set of queries against the price field:
>
> get events WHERE Firm=434550 AND Price=433561;
> get events WHERE Firm=434550 AND Price>=433561;
> get events WHERE Firm=434550 AND Price<=433561;
>
> These all work  fine.
>
>
> While,
>
> get events WHERE Firm=434550 AND Price=433561 AND Price <= 433561;
>
> returns an IO Exception.
>
>
>
> This feels like it’s attempting to do a full table scan here….
>
> What is going on here?
>
> Am I doing something incorrectly?
>
> We also see similar behavior when submit the query through our app via the Thrift API.
>
>
> Thanks,
> JohnB
>

Mime
View raw message