lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <>
Subject [jira] Commented: (LUCENE-2456) A Column-Oriented Cassandra-Based Lucene Directory
Date Mon, 09 Aug 2010 17:22:18 GMT


Otis Gospodnetic commented on LUCENE-2456:

Karthick, I'm interested in the CassandraDirectory, so once you put it somewhere, please do
let us know.  Thanks.

> A Column-Oriented Cassandra-Based Lucene Directory
> --------------------------------------------------
>                 Key: LUCENE-2456
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Store
>    Affects Versions: 3.0.1
>            Reporter: Karthick Sankarachary
>         Attachments: LUCENE-2456.patch,
> Herein, we describe a type of Lucene directory that stores its file in a Cassandra server,
which makes for a scalable and robust store for Lucene indices.
> In brief, the CassandraDirectory maps the concept of a Lucene directory to a column family
that belongs to a certain keyspace located in a given Cassandra server. Further, it stores
each file under this directory as a row in that column family.
> Specifically, its files are broken down into blocks (whose sizes are capped), where each
block (see FileBlock) is stored as the value of a column in the corresponding row. As per, this is the recommended approach for
dealing with large objects, which Lucene files tend to be. In addition, a descriptor of the
file (see FileDescriptor) that outlines a map of blocks therein is stored as one of the columns
in that row as well. Think of this descriptor as an inode for Cassandra-based files.
> The exhaustive mapping of a Lucene directory (file) to a Cassandra column family (row)
is captured in the ColumnOrientedDirectory (ColumnOrientedFile) inner-class. Specifically,
it interprets Cassandra's data model in terms of Lucene's, and vice verca. More importantly,
these are the only two inner-classes that have a foot in both the Lucene and Cassandra camps.
> All writes to a file in this directory occur through a CassandraIndexOutput, which puts
the data flushed from a write-behind buffer into the fitting set of blocks. By the same token,
all reads from a file in this directory occur through a CassandraIndexInput, which gets the
data needed by a read-ahead buffer from the right set of blocks.
> The last (but not the least) inner-class, CassandraClient, acts as a facade over a Thrift-based
Cassandra client. In short, it provides operations to get/put rows/columns in the column family
and keyspace associated with this directory.
> Unlike Lucandra, which attempts to bridge the gap between Lucene and Cassandra at the
document-level, the CassandraDirectory is self-sufficient in the sense that it does not require
a re-write of any other component in the Lucene stack. In other words, one may use the CassandraDirectory
in conjunction with the Lucene IndexWriter and IndexReader, as you would any other kind of
Lucene Directory. Moreover, given the the data unit that is transferred to and from Cassandra
is a large-sized block, one may expect fewer round trips, and hence better throughputs, from
the CassandraDirectory.
> In conclusion, this directory attempts to marry the rich search-based query language
of Lucene with the distributed fault-tolerant database that is Cassandra. By delegating the
responsibilities of replication, durability and elasticity to the directory, we free the layers
above from such non-functional concerns. Our hope is that users will choose to make their
large-scale indices instantly scalable by seamlessly migrating them to this type of directory
(using Directory#copyTo(Directory)).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message