Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Tue, 1 Mar 2011 13:15:37 +0000 (UTC)
From: "Jason Rutherglen (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: 
 <513131236.4931.1298985337302.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <531563484.15184.1297703757561.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] Commented: (HBASE-3529) Add search to HBase
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000866#comment-13000866 ] 

Jason Rutherglen commented on HBASE-3529:
-----------------------------------------

@Stack Thanks for the analysis.  I forgot to mention that each subquery would also require it's own FSInputStream, which would be too many file descriptors.  The heap required for 25 bytes * 2 mil docs is 50MB, eg, that's too much?  

I think we can go ahead with the positional read which'd only require an FSInputStream per file, to be shared by all readers of that file (using FileChannel.read(ByteBuffer dst, long position) underneath.  Given the number of blocks per Lucene file will be < 10 and the blocks are of a fixed size, we can divide the (offset / blocksize) to efficiently obtain the block index?  I think it'll be efficient to translate a file offset into a local block file, eg, I'm not sure why LocatedBlocks.findBlock uses a binary search because I'm not familiar enough with HDFS. Then we'd just need to cache the LocatedBlock(s), instead of looking them up from the DataNode on each small read byte[1024] call.

In summary:

* DFSClient.DFSInputStream.getBlockRange looks fast enough for many calls per second
* locatedBlocks.findBlock uses a binary search for some reason, that'll be a bottleneck, why can't we divide the number the offset by the number of blocks.  Oh ok, that's because block sizes are variable.  I guess if the number of blocks is small the binary search will always be fast?  Or we can detect if the blocks are of the same size and divide to get the correct block?
* DFSClient.DFSInputStream.fetchBlockByteRange is a hotspot because it calls chooseDataNode, whose return value [DNAddrPair] can be cached inside of LocatedBlock?
* Later in fetchBlockByteRange we call DFSClient.createClientDatanodeProtocolProxy() and make a local RPC call, getBlockPathInfo.  I think the results of this [BlockPathInfo] can be cached into LocatedBlock as well?
* Then instead of instantiating a new BlockReader object, we can call FileChannel.read(ByteBuffer b, long pos) directly?
* With this solution in place we can safely store documents in the docstore without any worries, and in addition use the system that most efficient in Lucene today, all the while using the fewest file descriptors possible.

> Add search to HBase
> -------------------
>
>                 Key: HBASE-3529
>                 URL: https://issues.apache.org/jira/browse/HBASE-3529
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.0
>            Reporter: Jason Rutherglen
>
> Using the Apache Lucene library we can add freetext search to HBase.  The advantages of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore will explicitly not be used because the document/row data is stored in HBase.  We will need to solve what the best data structure for efficiently mapping a docid -> row key is.  It could be a docstore, field cache, column stride fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
> * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira