Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <32120a6a1003271146m69e21f7br946c56dc0c1fd97@mail.gmail.com>
References: <32120a6a1003271146m69e21f7br946c56dc0c1fd97@mail.gmail.com>
Date: Thu, 3 Jun 2010 09:33:04 +0200
Message-ID: <AANLkTinmzdAjtvWi4i645d9pS9iBORT3jJo1NqojmQxB@mail.gmail.com>
Subject: Re: elastic search or other Lucene for HBase?
From: Steven Noels <stevenn@outerthought.org>
To: hbase-user <hbase-user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=00163628443c908ce904881b37a2

--00163628443c908ce904881b37a2
Content-Type: text/plain; charset=UTF-8

On Sat, Mar 27, 2010 at 8:46 PM, Tim Robertson <timrobertson100@gmail.com>wrote:

> Hi all,
>
> Is anyone using elastic search as an indexing layer to HBase content?
> It looks to have a really nice API, and was thinking of setting up an
> EC2 test where I maintain an ES index storing only the Key to HBase
> rows.  So ES provides all search returning Key lists and all single
> record Get being served from HBase.
>
> Or is there a preferred distributed Lucene approach for HBase from the
> few that have been popping up?  I have not had a chance to really dig
> into the options but I know there has been a lot of chatter on this.
>


For Lily - www.lilycms.org, we opted for SOLR. Here's some rationale behind
that (copy-pasted from our draft Lily website):

Selecting a search solution: SOLR

For search, the choice for Lucene as core technology was pretty much a
given. In Daisy, our previous CMS, we used Lucene only for full-text search
and performed structural searches on the SQL database. We merged the results
from those two different search technologies on the fly, supporting mixed
structural and full-text queries. However, this merging, combined with other
high-level features of Daisy, was not designed to handle very large data
sets. For Lily, we decided that a better approach would be to perform all
searching using one technology, Lucene.

A downside to Lucene is that index updates are only visible with some delay
to searchers, though work is ongoing to improve this. At its heart it is a
text-search library, though with its fielded documents and the trie-range
queries, it handles more data-oriented queries quite well.

Lucene in itself is a library, not a standalone application, nor a scalable
search solution. But all this can be built on top. The best known standalone
search server on top of Lucene is SOLR, which we decided to use in Lily.

But before we made that choice, we considered a lot of the available
options:

   -

   Katta <http://katta.sourceforge.net/>. Katta provides a powerful scalable
   search model whereby each node is responsible for searching on a number of
   shards, replicas of the shards are present on multiple of the nodes. This
   provides scaling for both index size and number of users, and gracefully
   handles node failures since the shards that were on a failed node will be
   available online on some other nodes. However, Katta is only a search
   solution, not an indexing solution, and does not offer extra search features
   such as faceting.
   -

   Hadoop contrib/index<http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/index/README>.
   This is a MapReduce solution for building Lucene indexes. The nice thing
   about it is that the MR framework manages spreading the index building work
   over multiple nodes, reschedules failed jobs, and so on. It can also be used
   to update existing indexes. The number of index shards is determined by the
   number of reduce tasks. Hadoop contrib/index is an ideal complement to
   Katta. The downside is that it is inherently batch-oriented, which excludes
   profiting from the ongoing Lucene near-real time (NRT) work.
   -

   The tools from LinkedIn <http://sna-projects.com/sna/>
(blog<http://invertedindex.blogspot.com/>).
   LinkedIn has made available some cool Lucene-related projects like
Bobo<http://sna-projects.com/bobo/>,
   an optimized facet browser that does not rely on cached bitsets,
and Zoie<http://sna-projects.com/zoie/>,
   a real-time index-search solution (built in a different way than what is
   available in Lucene 3). They are apparently integrating it all in
Sensei<http://sna-projects.com/sensei/>.
   It is interesting to study the design of these projects.
   -

   ElasticSearch <http://www.elasticsearch.com/>. ElasticSearch (ES) is a
   very new project, that appeared as a one-man project at about the same time
   we made our choice for SOLR. One can easily launch a number of ES nodes,
   they find themselves without configuration. Multiple indexes can be created
   using a simple REST API. When creating an index, you specify the number of
   shards and replicas you desire. It is designed to work on cloud computing
   solutions like EC2, where the local disk is only a temporary storage. There
   is a lot more to tell, but you can read that on their website. Despite the
   name 'elastic', it does not support indexes growing dynamically in the same
   way as tables can grow in HBase: the number of shards is fixed when creating
   the index. However, if you find yourself in need of more shards, you can
   create a new index with more shards and re-index your content into that. The
   number of shards is not related to the number of nodes, so you can plan for
   growth by choosing e.g. 10 shards even if you have just one or two nodes to
   start with.
   -

   Lucandra<http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/>,
   Lucehbase <http://github.com/thkoch2001/lucehbase> and
Hbasene<http://github.com/akkumar/hbasene>.
   These projects work by storing the inverted index on top of Cassandra
   respectively HBase. The use of a database is quite different from Lucene's
   segment-based approach. While it makes the storage of the inverted index
   scalable, it does not necessarily make all of Lucene's functionality
   scalable, such as sorting and faceting which depend on the field caches and
   bitset-based filters. Moreover, for HBase, which we know best, the storage
   is not as scalable as it may seem, since terms are stored as rows and the
   postings lists (= the documents containing the term) as columns. Usually the
   number of terms in a corpus is relatively limited, while the number of
   documents can be huge, but columns in HBase do not scale in the same way as
   rows. We think the scaling (sharding, replication) needs to happen on the
   level of Lucene instances itself, rather than just the storage below it.
   Still, it is interesting to watch how these projects will evolve.
   -

   Building our own. Another option was to just take Lucene itself and build
   our own scalable search solution using it. In this case we would have gone
   for a Katta/ElasticSearch-like approach to sharding and replication, with a
   focus on the search features we are most interested in (such as faceting).
   However, we decided that this would take too much of our time.
   - Our choice, SOLR <http://lucene.apache.org/solr/>. SOLR is a standalone
   Lucene-based search server made publicly available in 2006, making it the
   oldest of the solutions listed here. It makes a lot of Lucene functionality
   easily available, adds a schema with field types, faceting, different kinds
   of caches and cache warming, a user-friendly safe query syntax, and more.
   SOLR supports replicas and distributed searching over multiple SOLR
   instances, though you are responsible for setting it up all yourself, it is
   very much a static solution. Work on cloudifying
SOLR<http://svn.apache.org/repos/asf/lucene/solr/branches/cloud>is
ongoing. SOLR has lots of users, there is a book, there are companies
   supporting it, it has a large team <http://www.ohloh.net/p/solr>, and the
   Lucene and SOLR projects recently merged.


We don't expect Lily to be deployed on EC2 infrastructure, but more in
private cloud/datacenter settings with customers. However, should ES have
appeared sooner on our radar, there's a good chance we would have looked at
it more in-depth. SOLR cloud uses ZooKeeper which we'll need between Lily
clients and servers anyway, so that was a nice fit with our architecture.

I haven't compared the search language/features between SOLR and ES myself
though, it could be that you need specific stuff from ES (like: the EC2
focus).

Hope this helps,

Steven.
-- 
Steven Noels                            http://outerthought.org/
Outerthought                            Open Source Java & XML
stevenn at outerthought.org             Makers of the Daisy CMS

--00163628443c908ce904881b37a2--