hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3529) Add search to HBase
Date Wed, 02 Mar 2011 06:06:36 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001299#comment-13001299

Jason Rutherglen commented on HBASE-3529:

bq. Why niopositional read? How is that different than the pread that is already in the dfsclient

I think the goal of HDFS-347 is it'll automatically switch between reading over the network
and reading locally?  So the pread'll do one or the other?

bq. You going to get the FileChannel off the socket (thats the nio bit)?

That's just for the local file.  

bq. What if the data is not local, usually it is (> 99% of the time), but is not always;
e.g. in time of failure or perhaps after a rebalance. 

If we read off a socket I think there's going to be be a serious degradation in performance.
 I think that's an invariant of search?

{quote}A version that respects 'security', where the 'cleared' fd is passed via unix domain
sockets, for the dfsclient to use going direct is probably what'll go in sometime soon hopefully.{quote}

That'll be good!  I think this initial version (of HDFS modifications) is simply to get things
going, as these other [HDFS] improvements are added we can use them and the DFSInputStream
methods used by HDFSDirectory'll be the same?

{quote}You are messing down deep below hbase in dfs. I'm a little worried that you'll do a
bunch of custom work that may work for your lucene directory implementation but that it will
be so particular, it won't be accepted back into hdfs.{quote}

If we need to pass the FD using Unix domain sockets then the HDFS work won't be useful.  If
the UDS's enable positional read, then the [Lucene] HDFSDirectory will work well.  

> Add search to HBase
> -------------------
>                 Key: HBASE-3529
>                 URL: https://issues.apache.org/jira/browse/HBASE-3529
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.0
>            Reporter: Jason Rutherglen
> Using the Apache Lucene library we can add freetext search to HBase.  The advantages
of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, AND, OR,
NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already architecturally sound,
scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  This means
cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user
to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region
server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore will explicitly
not be used because the document/row data is stored in HBase.  We will need to solve what
the best data structure for efficiently mapping a docid -> row key is.  It could be a docstore,
field cache, column stride fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be searched on, meaning
there's no need for a separate search related system in Zookeeper.
> * Integrate search with HBase's RPC mechanism

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message