Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 25 Feb 2011 21:03:22 +0000 (UTC)
From: "Jason Rutherglen (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: 
 <1302890213.737.1298667802373.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <531563484.15184.1297703757561.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] Commented: (HBASE-3529) Add search to HBase
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999565#comment-12999565 ] 

Jason Rutherglen commented on HBASE-3529:
-----------------------------------------

bq. Writing the indexes to HDFS is possible after LUCENE-2373?

Right, that's implemented in trunk as the append codecs. https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-misc/org/apache/lucene/index/codecs/appending/AppendingCodec.html

bq. We get direct reads from HDFS via HDFS-347 and the OS block cache can help there?

BlockReaderLocal is sync'd on each method, that's something we've outgrown in Lucene a while back (and in it's place NIOFSDirectory is most used, with MMap second).  We'd likely have a couple of options here, write to HDFS and [probably] slow queries to some extent, or write directly to a local directory and have the mechanical overhead of copying index files in/out of HDFS.

> Add search to HBase
> -------------------
>
>                 Key: HBASE-3529
>                 URL: https://issues.apache.org/jira/browse/HBASE-3529
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.0
>            Reporter: Jason Rutherglen
>
> Using the Apache Lucene library we can add freetext search to HBase.  The advantages of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore will explicitly not be used because the document/row data is stored in HBase.  We will need to solve what the best data structure for efficiently mapping a docid -> row key is.  It could be a docstore, field cache, column stride fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
> * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira