cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rustam Aliyev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6477) Global indexes
Date Mon, 16 Jun 2014 03:54:02 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032102#comment-14032102
] 

Rustam Aliyev commented on CASSANDRA-6477:
------------------------------------------

In addition to performance, one of the key advantages of application-maintained global indexes
is flexibility. I think it's important to preserve it in built-in global indexes. Few cases
I think important to consider:

# Composite index. Global index can be based on more than one column.
# Range query on indexed elements. With high cardinality global index it would be efficient
to allow range query on elements to make consecutive multiget efficient. For example, indexing
time-series data by type and then looking up with {{... TYPE="type1" and ID > minTimeuuid('2013-02-02
10:00+0000')}}
# Reverse key index. Should be able to define index clustering key (i.e. indexed elements)
order (ASC, DESC). Helpful when used with range queries above.
# Function based index. In this case, index is defined by transformation function. For example,
lowercase(value) or arithmetic function like (field1 * field2).
# Storing data in index. Typically, global indexes have following structure where values are
nulls:
{code}
"idx_table" {
   "index_value1" : {
       "el_id1" : null,
       "el_id5" : null,
       ...
   }
}
{code}
However, sometimes it's efficient and convenient to keep some information in values. For example,
let's assume that elements above contains tens of fields. However, in 90% cases application
uses only one of those e.g. hash. In that case, it's efficient to scan index and retrieve
hash values directly from index instead of doing additional lookup to original table. Above
table would looks like:
{code}
"idx_table" {
   "index_value1" : {
       "el_id1" : "74335a7c9229...",
       "el_id5" : "28b986fa29eb...",
       ...
   }
}
{code}

Traditional RDBMS support most of these indexes. For function based indexes we could create
a bunch of functions in CQL3 (e.g. Math.*, LOWERCASE(), etc.) similar to other RDBMS.

Alternatively, we can achieve greater flexibility by storing optional Java 8 lambda functions.
Lambda function will take mutated row as an input and return 2 vars:
# non-empty set of indexes (required)
# map of id -> value which will be used to lookup stored index values (optional). If element
not found, null is stored.

{{CREATE INDEX}} statement has to define produced index CQL type and optionally stored index
values:
{code}
CREATE GLOBAL INDEX account_by_email_idx ON accounts ( LAMBDA("row -> { return row.email.toLowerCase();
}") ) WITH INDEX_TYPE = {'text'};
{code}

More examples:
# Lowercase email: {code} row -> { return row.email.toLowerCase(); } {code}
# Distance between coordinates: {code} row -> { return Math.sqrt((row.x1-row.x2)*(row.x1-row.x2)
+ (row.y1-row.y2)*(row.y1-row.y2)); } {code}
# Conditional index: {code} row -> { return row.price > 0 ? "paid" : "free"; } {code}
# Indexes with values (item 5 above) may require some special return type (e.g. {{IndexWithValues}}).
In the example above, message length will be stored in the index: {code} row -> { return
new IndexWithValues(row.type, row.message.length()); } {code}

Querying these indexes is another caveat. Consider distance between coordinates example above
- what would be SELECT statement for this index? With application-maintained global indexes,
application can just lookup in index using given value. Same applies to indexes with stored
values.

Without these, built-in global indexes will be very limited and once again, application-maintained
global indexes would remain as go to solution.

> Global indexes
> --------------
>
>                 Key: CASSANDRA-6477
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6477
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>             Fix For: 3.0
>
>
> Local indexes are suitable for low-cardinality data, where spreading the index across
the cluster is a Good Thing.  However, for high-cardinality data, local indexes require querying
most nodes in the cluster even if only a handful of rows is returned.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message