cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes
Date Fri, 05 Aug 2011 14:01:28 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079982#comment-13079982
] 

Jason Rutherglen commented on CASSANDRA-2915:
---------------------------------------------

bq. LUCENE-2454 adds support for nested documents. we can perhaps use this to avoid the read
before write

I think LUCENE-2454 needs the nested documents to be added at the same time.  In our case
that wouldn't be happening.  Google's GData for example doesn't offer the feature of automatically
retrieving values from the previous document, it assumes you are replacing the entire document
with new contents, and relies on the user to have read the document [somewhere] before.

I think there's another Lucene issue that performs an initial query to obtain the parent document.
 However that is the same as a read before write.

I'm guessing Cassandra enables updating an individual column?  I don't think there's any way
around this?

bq. We could store the expiration time in the document and make it a constraint on the lucene
query so we don't pull expired data

That would work.  We'd need to use a trie range filter query, which will make all queries
a little bit slower.

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>              Labels: secondary_index
>             Fix For: 1.0
>
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their current
form:
>    - Multiple IndexClauses only work when there is a subset of rows under the highest
clause
>    - One new column family is created per index this means 10 new CFs for 10 secondary
indexes
> This ticket will use the Lucene library to implement secondary indexes as one index per
CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the
Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync our memtable
flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process,
so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the data can
be stored properly, the big win in once this is done we can perform complex queries within
a column like wildcard searches.
> The downside of this approach is we will need to read before write since documents in
Lucene are written as complete documents. For random workloads with lot's of indexed columns
this means we need to read the document from the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message