Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 29661 invoked from network); 3 Jun 2010 07:33:32 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Jun 2010 07:33:32 -0000 Received: (qmail 94939 invoked by uid 500); 3 Jun 2010 07:33:31 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 94815 invoked by uid 500); 3 Jun 2010 07:33:31 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 94807 invoked by uid 500); 3 Jun 2010 07:33:30 -0000 Delivered-To: apmail-hadoop-hbase-user@hadoop.apache.org Received: (qmail 94804 invoked by uid 99); 3 Jun 2010 07:33:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jun 2010 07:33:30 +0000 X-ASF-Spam-Status: No, hits=2.8 required=10.0 tests=AWL,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.83.48] (HELO mail-gw0-f48.google.com) (74.125.83.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jun 2010 07:33:26 +0000 Received: by gwb17 with SMTP id 17so752159gwb.35 for ; Thu, 03 Jun 2010 00:33:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.90.150.10 with SMTP id x10mr4416008agd.141.1275550384556; Thu, 03 Jun 2010 00:33:04 -0700 (PDT) Received: by 10.90.74.11 with HTTP; Thu, 3 Jun 2010 00:33:04 -0700 (PDT) In-Reply-To: <32120a6a1003271146m69e21f7br946c56dc0c1fd97@mail.gmail.com> References: <32120a6a1003271146m69e21f7br946c56dc0c1fd97@mail.gmail.com> Date: Thu, 3 Jun 2010 09:33:04 +0200 Message-ID: Subject: Re: elastic search or other Lucene for HBase? From: Steven Noels To: hbase-user Content-Type: multipart/alternative; boundary=00163628443c908ce904881b37a2 --00163628443c908ce904881b37a2 Content-Type: text/plain; charset=UTF-8 On Sat, Mar 27, 2010 at 8:46 PM, Tim Robertson wrote: > Hi all, > > Is anyone using elastic search as an indexing layer to HBase content? > It looks to have a really nice API, and was thinking of setting up an > EC2 test where I maintain an ES index storing only the Key to HBase > rows. So ES provides all search returning Key lists and all single > record Get being served from HBase. > > Or is there a preferred distributed Lucene approach for HBase from the > few that have been popping up? I have not had a chance to really dig > into the options but I know there has been a lot of chatter on this. > For Lily - www.lilycms.org, we opted for SOLR. Here's some rationale behind that (copy-pasted from our draft Lily website): Selecting a search solution: SOLR For search, the choice for Lucene as core technology was pretty much a given. In Daisy, our previous CMS, we used Lucene only for full-text search and performed structural searches on the SQL database. We merged the results from those two different search technologies on the fly, supporting mixed structural and full-text queries. However, this merging, combined with other high-level features of Daisy, was not designed to handle very large data sets. For Lily, we decided that a better approach would be to perform all searching using one technology, Lucene. A downside to Lucene is that index updates are only visible with some delay to searchers, though work is ongoing to improve this. At its heart it is a text-search library, though with its fielded documents and the trie-range queries, it handles more data-oriented queries quite well. Lucene in itself is a library, not a standalone application, nor a scalable search solution. But all this can be built on top. The best known standalone search server on top of Lucene is SOLR, which we decided to use in Lily. But before we made that choice, we considered a lot of the available options: - Katta . Katta provides a powerful scalable search model whereby each node is responsible for searching on a number of shards, replicas of the shards are present on multiple of the nodes. This provides scaling for both index size and number of users, and gracefully handles node failures since the shards that were on a failed node will be available online on some other nodes. However, Katta is only a search solution, not an indexing solution, and does not offer extra search features such as faceting. - Hadoop contrib/index. This is a MapReduce solution for building Lucene indexes. The nice thing about it is that the MR framework manages spreading the index building work over multiple nodes, reschedules failed jobs, and so on. It can also be used to update existing indexes. The number of index shards is determined by the number of reduce tasks. Hadoop contrib/index is an ideal complement to Katta. The downside is that it is inherently batch-oriented, which excludes profiting from the ongoing Lucene near-real time (NRT) work. - The tools from LinkedIn (blog). LinkedIn has made available some cool Lucene-related projects like Bobo, an optimized facet browser that does not rely on cached bitsets, and Zoie, a real-time index-search solution (built in a different way than what is available in Lucene 3). They are apparently integrating it all in Sensei. It is interesting to study the design of these projects. - ElasticSearch . ElasticSearch (ES) is a very new project, that appeared as a one-man project at about the same time we made our choice for SOLR. One can easily launch a number of ES nodes, they find themselves without configuration. Multiple indexes can be created using a simple REST API. When creating an index, you specify the number of shards and replicas you desire. It is designed to work on cloud computing solutions like EC2, where the local disk is only a temporary storage. There is a lot more to tell, but you can read that on their website. Despite the name 'elastic', it does not support indexes growing dynamically in the same way as tables can grow in HBase: the number of shards is fixed when creating the index. However, if you find yourself in need of more shards, you can create a new index with more shards and re-index your content into that. The number of shards is not related to the number of nodes, so you can plan for growth by choosing e.g. 10 shards even if you have just one or two nodes to start with. - Lucandra, Lucehbase and Hbasene. These projects work by storing the inverted index on top of Cassandra respectively HBase. The use of a database is quite different from Lucene's segment-based approach. While it makes the storage of the inverted index scalable, it does not necessarily make all of Lucene's functionality scalable, such as sorting and faceting which depend on the field caches and bitset-based filters. Moreover, for HBase, which we know best, the storage is not as scalable as it may seem, since terms are stored as rows and the postings lists (= the documents containing the term) as columns. Usually the number of terms in a corpus is relatively limited, while the number of documents can be huge, but columns in HBase do not scale in the same way as rows. We think the scaling (sharding, replication) needs to happen on the level of Lucene instances itself, rather than just the storage below it. Still, it is interesting to watch how these projects will evolve. - Building our own. Another option was to just take Lucene itself and build our own scalable search solution using it. In this case we would have gone for a Katta/ElasticSearch-like approach to sharding and replication, with a focus on the search features we are most interested in (such as faceting). However, we decided that this would take too much of our time. - Our choice, SOLR . SOLR is a standalone Lucene-based search server made publicly available in 2006, making it the oldest of the solutions listed here. It makes a lot of Lucene functionality easily available, adds a schema with field types, faceting, different kinds of caches and cache warming, a user-friendly safe query syntax, and more. SOLR supports replicas and distributed searching over multiple SOLR instances, though you are responsible for setting it up all yourself, it is very much a static solution. Work on cloudifying SOLRis ongoing. SOLR has lots of users, there is a book, there are companies supporting it, it has a large team , and the Lucene and SOLR projects recently merged. We don't expect Lily to be deployed on EC2 infrastructure, but more in private cloud/datacenter settings with customers. However, should ES have appeared sooner on our radar, there's a good chance we would have looked at it more in-depth. SOLR cloud uses ZooKeeper which we'll need between Lily clients and servers anyway, so that was a nice fit with our architecture. I haven't compared the search language/features between SOLR and ES myself though, it could be that you need specific stuff from ES (like: the EC2 focus). Hope this helps, Steven. -- Steven Noels http://outerthought.org/ Outerthought Open Source Java & XML stevenn at outerthought.org Makers of the Daisy CMS --00163628443c908ce904881b37a2--