hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-3529) Add search to HBase
Date Thu, 26 May 2011 05:35:47 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039516#comment-13039516
] 

Jason Rutherglen commented on HBASE-3529:
-----------------------------------------

In regards to checksums, I think we can verify/checksum the Lucene index
files only once when HDFSDirectory is created. We cannot checksum per file
open as the overhead would be too much. I think there'll need to be a hook
added to run the checksum via the HDFS client?

The other issue is ensuring data locality as otherwise Lucene queries will
be unusable due to the inherent random access pattern. I think for this
we'll need to add something to the NameNode? Perhaps it would be a custom
placement policy, where if a given file is part of the Lucene index and
not local, we ask the NameNode to make it local (thereby over replicating
the file). I think this'll be a separate Jira issue?

bq. User inserts data using normal HBase APIs

Yes, even if we [possibly] support Solr, we'd only be implementing a
subset of the Solr functionality. One of the things that would go unused
is the ability to update documents using Solr APIs (which we'd turn off),
instead the data will only be updated via HBase. The Solr query APIs and
schema would be the main parts of Solr we'd be using. This can be roughly
defined as making using of the request handlers and search components:
http://wiki.apache.org/solr/SearchComponent which perhaps should be
modularized out of Solr anyways.

{quote} User can now perform an arbitrary lucene search over the table,
resulting in completely up-to-date results? (ie spans both memstore and
flushed data)? {quote}

Yes, that is correct.

> Add search to HBase
> -------------------
>
>                 Key: HBASE-3529
>                 URL: https://issues.apache.org/jira/browse/HBASE-3529
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.0
>            Reporter: Jason Rutherglen
>         Attachments: HBASE-3529.patch
>
>
> Using the Apache Lucene library we can add freetext search to HBase.  The advantages
of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, AND, OR,
NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already architecturally sound,
scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  This means
cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user
to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region
server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore will explicitly
not be used because the document/row data is stored in HBase.  We will need to solve what
the best data structure for efficiently mapping a docid -> row key is.  It could be a docstore,
field cache, column stride fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be searched on, meaning
there's no need for a separate search related system in Zookeeper.
> * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message