cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Kibardin <infa...@gmail.com>
Subject Why Cassandra secondary indexes are so slow on just 350k rows?
Date Tue, 28 Aug 2012 21:23:51 GMT
I have a column family with the secondary index. The secondary index is
basically a binary field, but I'm using a string for it. The field called
*is_exported* and can be *'true'* or *'false'*. After request all loaded
rows are updated with *is_exported = 'false'*.

I'm polling this column table each ten minutes and exporting new rows as
they appear.

But here the problem: I'm seeing that time for this query grows pretty
linear with amount of data in column table, and currently it takes *from 12
to 20 seconds (!!!) to find 5000 rows*. From my understanding, indexed
request should not depend on number of rows in CF but from number of rows
per one index value (cardinality), as it's just another hidden CF like:

        "true" : rowKey1 rowKey2 rowKey3 ...
        "false": rowKey1 rowKey2 rowKey3 ...

I'm using Pycassa to query the data, here the code I'm using:

        column_family = pycassa.ColumnFamily(cassandra_pool,
column_family_name, read_consistency_level=2)
        is_exported_expr = create_index_expression('is_exported', 'false')
        clause = create_index_clause([is_exported_expr], count = 5000)
        column_family.get_indexed_slices(clause)

Am I doing something wrong, but I expect this operation to work MUCH faster.

Any ideas or suggestions?

Some config info:
 - Cassandra 1.1.0
 - RandomPartitioner
 - I have 2 nodes and replication_factor = 2 (each server has a full data
copy)
 - Using AWS EC2, large instances
 - Software raid0 on ephemeral drives

Thanks in advance!

Mime
View raw message