hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-3529) Add search to HBase
Date Fri, 13 May 2011 21:26:48 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033311#comment-13033311
] 

Jason Rutherglen commented on HBASE-3529:
-----------------------------------------

bq. Awesome stuff. These query times above are using the hacky (non-secure non-checksummed)
implementation of HDFS-347?

It's hackier than that.  It's basically obtaining the java.io.File directly from the FSInputStream.
 However it's a good baseline to benchmark against things like HADOOP-6311 + HDFS-347.  Those
need to wait for HBase that works with Hadoop 0.22/trunk anyways?

{quote}
User defines some special property on a column family that they want to be searchable, this
property would include a solr schema which specifies analyzers and fields
{quote}

Currently there's a DocumentTransformer class which needs to be implemented to transform column-family
edits into a Lucene document.  That could use the Solr schema for example or any other separate
system to tokenize the byte[]s into a Document.

{quote}User can now perform an arbitrary lucene search over the table, resulting in completely
up-to-date results? (ie spans both memstore and flushed data)?{quote}

I think for now we need to offer an external commit on the index, as Lucene only has near
realtime search (eg, small segments will be written out, which will overwhelm HDFS).  LUCENE-2312
will implement realtime search (eg, searching on the RAM buffer as it's being built).  The
recent LUCENE-3092 could be used in the meantime to build segments in RAM, and only flush
to HDFS when it's too RAM consuming, then we would not need to force the user to 'commit'
the index.

To answer the question, yes, though today the indexing performance will not be as good as
when LUCENE-2312 is implemented or the user will need to 'commit' the index to search on the
latest data.

Getting all of Solr work work with this system is fairly doable.  Each Solr core would map
to a region.  Things like replication would be disabled.  The config files would be stored
in HDFS (instead of the local filesystem).  For distributed queries, we need SOLR-1431, and
then to implement distributed networking using HBase RPC instead of Solr's HTTP RPC.  There
are other smaller internal things that'd need to change in Solr.  I think HBase RPC is aware
of where regions live etc so I don't think we need to worry about putting failover logic into
the distributed search code?

I'm going to post additional benchmarks shortly, eg, for 100,000 and 1 mil documents.

> Add search to HBase
> -------------------
>
>                 Key: HBASE-3529
>                 URL: https://issues.apache.org/jira/browse/HBASE-3529
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.0
>            Reporter: Jason Rutherglen
>         Attachments: HBASE-3529.patch
>
>
> Using the Apache Lucene library we can add freetext search to HBase.  The advantages
of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, AND, OR,
NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already architecturally sound,
scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  This means
cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user
to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region
server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore will explicitly
not be used because the document/row data is stored in HBase.  We will need to solve what
the best data structure for efficiently mapping a docid -> row key is.  It could be a docstore,
field cache, column stride fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be searched on, meaning
there's no need for a separate search related system in Zookeeper.
> * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message