Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Mon, 16 Jun 2014 03:54:02 +0000 (UTC)
From: "Rustam Aliyev (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12684123.1386803761521.135411.1402890842189@arcas>
In-Reply-To: <JIRA.12684123.1386803761521@arcas>
References: <JIRA.12684123.1386803761521@arcas>
Subject: [jira] [Commented] (CASSANDRA-6477) Global indexes
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032102#comment-14032102 ] 

Rustam Aliyev commented on CASSANDRA-6477:
------------------------------------------

In addition to performance, one of the key advantages of application-maintained global indexes is flexibility. I think it's important to preserve it in built-in global indexes. Few cases I think important to consider:

# Composite index. Global index can be based on more than one column.
# Range query on indexed elements. With high cardinality global index it would be efficient to allow range query on elements to make consecutive multiget efficient. For example, indexing time-series data by type and then looking up with {{... TYPE="type1" and ID > minTimeuuid('2013-02-02 10:00+0000')}}
# Reverse key index. Should be able to define index clustering key (i.e. indexed elements) order (ASC, DESC). Helpful when used with range queries above.
# Function based index. In this case, index is defined by transformation function. For example, lowercase(value) or arithmetic function like (field1 * field2).
# Storing data in index. Typically, global indexes have following structure where values are nulls:
{code}
"idx_table" {
   "index_value1" : {
       "el_id1" : null,
       "el_id5" : null,
       ...
   }
}
{code}
However, sometimes it's efficient and convenient to keep some information in values. For example, let's assume that elements above contains tens of fields. However, in 90% cases application uses only one of those e.g. hash. In that case, it's efficient to scan index and retrieve hash values directly from index instead of doing additional lookup to original table. Above table would looks like:
{code}
"idx_table" {
   "index_value1" : {
       "el_id1" : "74335a7c9229...",
       "el_id5" : "28b986fa29eb...",
       ...
   }
}
{code}

Traditional RDBMS support most of these indexes. For function based indexes we could create a bunch of functions in CQL3 (e.g. Math.*, LOWERCASE(), etc.) similar to other RDBMS.

Alternatively, we can achieve greater flexibility by storing optional Java 8 lambda functions. Lambda function will take mutated row as an input and return 2 vars:
# non-empty set of indexes (required)
# map of id -> value which will be used to lookup stored index values (optional). If element not found, null is stored.

{{CREATE INDEX}} statement has to define produced index CQL type and optionally stored index values:
{code}
CREATE GLOBAL INDEX account_by_email_idx ON accounts ( LAMBDA("row -> { return row.email.toLowerCase(); }") ) WITH INDEX_TYPE = {'text'};
{code}

More examples:
# Lowercase email: {code} row -> { return row.email.toLowerCase(); } {code}
# Distance between coordinates: {code} row -> { return Math.sqrt((row.x1-row.x2)*(row.x1-row.x2) + (row.y1-row.y2)*(row.y1-row.y2)); } {code}
# Conditional index: {code} row -> { return row.price > 0 ? "paid" : "free"; } {code}
# Indexes with values (item 5 above) may require some special return type (e.g. {{IndexWithValues}}). In the example above, message length will be stored in the index: {code} row -> { return new IndexWithValues(row.type, row.message.length()); } {code}

Querying these indexes is another caveat. Consider distance between coordinates example above - what would be SELECT statement for this index? With application-maintained global indexes, application can just lookup in index using given value. Same applies to indexes with stored values.

Without these, built-in global indexes will be very limited and once again, application-maintained global indexes would remain as go to solution.

> Global indexes
> --------------
>
>                 Key: CASSANDRA-6477
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6477
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>             Fix For: 3.0
>
>
> Local indexes are suitable for low-cardinality data, where spreading the index across the cluster is a Good Thing.  However, for high-cardinality data, local indexes require querying most nodes in the cluster even if only a handful of rows is returned.


--
This message was sent by Atlassian JIRA
(v6.2#6252)