cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Coli <>
Subject Re: How does cassandra page through low cardinality indexes?
Date Thu, 29 May 2014 19:43:33 GMT
On Fri, May 16, 2014 at 10:53 AM, Kevin Burton <> wrote:

> I'm struggling with cassandra secondary indexes since the documentation
> seems all over the place and I'm having to put together everything from
> blog posts.

This mostly-complete summary content will eventually make it into a blog
post :

Secondary Indexes in Cassandra

Users frequently come into #cassandra or the cassandra-user@ mailing list
and ask questions about Secondary Indexes. Here is my stock answer.

“Unless you REALLY NEED the feature of atomic update of the secondary index
with the underlying row, you are almost always better off just making your
own manual secondary index column family.”

In Cassandra, the unit of distribution is the partition (f/k/a “Row”). If
your query needs to scan multiple partitions and inspect each of their
contents, you have probably made a mistake in your data model. For queries
which interact with sets of partitions one should use executeAsync() w/ the
new CQL drivers, not multigets.

Advantages of Secondary Indexes :

- Atomic update of secondary index with underlying partition/storage row.
- Don’t have to be maintained manually, including automated rebuild.
- Provides the illusion that you are using a RDBMS.

Disadvantages of Secondary Indexes :

- Before 1.2, they do a read-before-write.
- A steady trickle of occasionally-serious bugs which do not affect the
normal read/write path. [3]
- Bad for low cardinality cases. FIXME : detail (relates to checking each
- Bad for high cardinality cases. FIXME : detail (certain cases? what about
- CFstats not exposed via nodetool cfstats before 1.2 : ?
- Lower availability than normal Cassandra read path. FIXME : citation
- Unsorted results, in token order and not query value order.
- Can only search on datatypes Cassandra understands.
- Secondary index is located in the same directory as the primary SSTables.
- Provides the illusion that you are using a RDBMS.

Readers will note that I am not very clear above on which cardinality cases
they *are* good for, because I consider all the other problems sufficient
to never use them.

[1] Citations : - 2i without
read-before-write - (0.7) Secondary
Indexes aren't updated when removing whole row - (0.7) Truncate is
not secondary index aware - (0.7) return
invalidrequest when client attempts to create secondary index on
supercolumns - (0.8) secondary
index not dropped until restart - (0.8) Empty Result
with Secondary Index Queries with "limit 1" - (0.8) secondary
index on a column that has a value of size > 64k will fail on flush - (1.0) Wrong check of
partitioner for secondary indexes - (1.1) Fix very low
Secondary Index performance - (1.1) CQL3 range
query with secondary index fails - (1.2) Secondary
indexes without read-before-write - (1.2) Secondary
Indexes fail following a system restart - (1.2) Secondary
Index Sporadically Doesn't Return Rows - (1.1) Secondary
Index stops returning rows when caching=ALL - (1.1, but since
0.8) Compaction
deletes ExpiringColumns in Secondary Indexes - (1.2/2.0) Can not
query secondary index - (1.2) Concurrent
secondary index updates remove rows from the index - (1.2)
Intermittently, CQL SELECT  with WHERE on secondary indexed field value
returns null when there are rows - (1.2) Updates to
PerRowSecondaryIndex don't use most current values - (1.2) Slow secondary
index performance when using VNodes - (2.0) Fix 2i on
composite components omissions - (2.0) W/O specified
columns ASPCSI does not get notified of deletes - (2.0) Allow
secondary indexed columns to be used with IN operator - (1.2/2.0) Filtering
on Secondary Index Takes a Long Time Even with Limit 1, Trace Log Filled
with Looping Messages

View raw message