Wow, Steven, you really did your homework well! Major A+ in my book. :)
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
----- Original Message ----
> From: Steven Noels <stevenn@outerthought.org>
> To: hbase-user <hbase-user@hadoop.apache.org>
> Sent: Thu, June 3, 2010 3:33:04 AM
> Subject: Re: elastic search or other Lucene for HBase?
>
> On Sat, Mar 27, 2010 at 8:46 PM, Tim Robertson <
> ymailto="mailto:timrobertson100@gmail.com"
> href="mailto:timrobertson100@gmail.com">timrobertson100@gmail.com>wrote:
>
> Hi all,
>
> Is anyone using elastic search as an indexing layer to
> HBase content?
> It looks to have a really nice API, and was thinking of
> setting up an
> EC2 test where I maintain an ES index storing only the Key
> to HBase
> rows. So ES provides all search returning Key lists and
> all single
> record Get being served from HBase.
>
> Or is
> there a preferred distributed Lucene approach for HBase from the
> few
> that have been popping up? I have not had a chance to really dig
>
> into the options but I know there has been a lot of chatter on
> this.
>
For Lily - www.lilycms.org, we opted for SOLR. Here's
> some rationale behind
that (copy-pasted from our draft Lily
> website):
Selecting a search solution: SOLR
For search, the choice
> for Lucene as core technology was pretty much a
given. In Daisy, our previous
> CMS, we used Lucene only for full-text search
and performed structural
> searches on the SQL database. We merged the results
from those two different
> search technologies on the fly, supporting mixed
structural and full-text
> queries. However, this merging, combined with other
high-level features of
> Daisy, was not designed to handle very large data
sets. For Lily, we decided
> that a better approach would be to perform all
searching using one
> technology, Lucene.
A downside to Lucene is that index updates are only
> visible with some delay
to searchers, though work is ongoing to improve this.
> At its heart it is a
text-search library, though with its fielded documents
> and the trie-range
queries, it handles more data-oriented queries quite
> well.
Lucene in itself is a library, not a standalone application, nor a
> scalable
search solution. But all this can be built on top. The best known
> standalone
search server on top of Lucene is SOLR, which we decided to use in
> Lily.
But before we made that choice, we considered a lot of the
> available
options:
-
Katta <
> href="http://katta.sourceforge.net/" target=_blank
> >http://katta.sourceforge.net/>. Katta provides a powerful
> scalable
search model whereby each node is responsible for searching
> on a number of
shards, replicas of the shards are present on multiple
> of the nodes. This
provides scaling for both index size and number of
> users, and gracefully
handles node failures since the shards that
> were on a failed node will be
available online on some other nodes.
> However, Katta is only a search
solution, not an indexing solution,
> and does not offer extra search features
such as faceting.
> -
Hadoop contrib/index<
> href="http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/index/README"
> target=_blank
> >http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/index/README>.
> This is a MapReduce solution for building Lucene indexes. The nice
> thing
about it is that the MR framework manages spreading the index
> building work
over multiple nodes, reschedules failed jobs, and so
> on. It can also be used
to update existing indexes. The number of
> index shards is determined by the
number of reduce tasks. Hadoop
> contrib/index is an ideal complement to
Katta. The downside is that
> it is inherently batch-oriented, which excludes
profiting from the
> ongoing Lucene near-real time (NRT) work.
-
The tools
> from LinkedIn <
> >http://sna-projects.com/sna/>
(blog<
> href="http://invertedindex.blogspot.com/" target=_blank
> >http://invertedindex.blogspot.com/>).
LinkedIn has made
> available some cool Lucene-related projects like
Bobo<
> href="http://sna-projects.com/bobo/" target=_blank
> >http://sna-projects.com/bobo/>,
an optimized facet browser
> that does not rely on cached bitsets,
and Zoie<
> href="http://sna-projects.com/zoie/" target=_blank
> >http://sna-projects.com/zoie/>,
a real-time index-search
> solution (built in a different way than what is
available in Lucene
> 3). They are apparently integrating it all in
Sensei<
> href="http://sna-projects.com/sensei/" target=_blank
> >http://sna-projects.com/sensei/>.
It is interesting to study
> the design of these projects.
-
ElasticSearch <
> href="http://www.elasticsearch.com/" target=_blank
> >http://www.elasticsearch.com/>. ElasticSearch (ES) is a
very
> new project, that appeared as a one-man project at about the same time
> we made our choice for SOLR. One can easily launch a number of ES
> nodes,
they find themselves without configuration. Multiple indexes
> can be created
using a simple REST API. When creating an index, you
> specify the number of
shards and replicas you desire. It is designed
> to work on cloud computing
solutions like EC2, where the local disk
> is only a temporary storage. There
is a lot more to tell, but you can
> read that on their website. Despite the
name 'elastic', it does not
> support indexes growing dynamically in the same
way as tables can
> grow in HBase: the number of shards is fixed when creating
the index.
> However, if you find yourself in need of more shards, you can
create
> a new index with more shards and re-index your content into that. The
> number of shards is not related to the number of nodes, so you can plan
> for
growth by choosing e.g. 10 shards even if you have just one or
> two nodes to
start with.
-
Lucandra<
> href="http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/"
> target=_blank
> >http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/>,
> Lucehbase <
> >http://github.com/thkoch2001/lucehbase> and
Hbasene<
> href="http://github.com/akkumar/hbasene" target=_blank
> >http://github.com/akkumar/hbasene>.
These projects work by
> storing the inverted index on top of Cassandra
respectively HBase.
> The use of a database is quite different from Lucene's
segment-based
> approach. While it makes the storage of the inverted index
scalable,
> it does not necessarily make all of Lucene's functionality
scalable,
> such as sorting and faceting which depend on the field caches and
> bitset-based filters. Moreover, for HBase, which we know best, the
> storage
is not as scalable as it may seem, since terms are stored as
> rows and the
postings lists (= the documents containing the term) as
> columns. Usually the
number of terms in a corpus is relatively
> limited, while the number of
documents can be huge, but columns in
> HBase do not scale in the same way as
rows. We think the scaling
> (sharding, replication) needs to happen on the
level of Lucene
> instances itself, rather than just the storage below it.
Still, it is
> interesting to watch how these projects will evolve.
-
> Building our own. Another option was to just take Lucene itself and
> build
our own scalable search solution using it. In this case we
> would have gone
for a Katta/ElasticSearch-like approach to sharding
> and replication, with a
focus on the search features we are most
> interested in (such as faceting).
However, we decided that this would
> take too much of our time.
- Our choice, SOLR <
> href="http://lucene.apache.org/solr/" target=_blank
> >http://lucene.apache.org/solr/>. SOLR is a standalone
> Lucene-based search server made publicly available in 2006, making it
> the
oldest of the solutions listed here. It makes a lot of Lucene
> functionality
easily available, adds a schema with field types,
> faceting, different kinds
of caches and cache warming, a
> user-friendly safe query syntax, and more.
SOLR supports replicas and
> distributed searching over multiple SOLR
instances, though you are
> responsible for setting it up all yourself, it is
very much a static
> solution. Work on cloudifying
SOLR<
> href="http://svn.apache.org/repos/asf/lucene/solr/branches/cloud" target=_blank
> >http://svn.apache.org/repos/asf/lucene/solr/branches/cloud>is
ongoing.
> SOLR has lots of users, there is a book, there are companies
> supporting it, it has a large team <
> target=_blank >http://www.ohloh.net/p/solr>, and the
Lucene
> and SOLR projects recently merged.
We don't expect Lily to be
> deployed on EC2 infrastructure, but more in
private cloud/datacenter settings
> with customers. However, should ES have
appeared sooner on our radar, there's
> a good chance we would have looked at
it more in-depth. SOLR cloud uses
> ZooKeeper which we'll need between Lily
clients and servers anyway, so that
> was a nice fit with our architecture.
I haven't compared the search
> language/features between SOLR and ES myself
though, it could be that you
> need specific stuff from ES (like: the EC2
focus).
Hope this
> helps,
Steven.
--
Steven Noels
>
> href="http://outerthought.org/" target=_blank
> >http://outerthought.org/
Outerthought
> Open Source Java
> & XML
stevenn at outerthought.org
> Makers of the Daisy CMS
|