hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-3529) Add search to HBase
Date Mon, 18 Apr 2011 02:35:06 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020892#comment-13020892
] 

Jason Rutherglen commented on HBASE-3529:
-----------------------------------------

I updated the HBase search branch at Github and created complete instructions for how to execute
the benchmark. This should also help with examining the code. The HBASE-SEARCH project contains
10,000 bz2 compressed wiki-en documents which account for 100 MB of the download. The slightly
modified Lucene libraries are located in the lib/ directory (so that you do not need to download
the entire Lucene branch source). 

https://github.com/jasonrutherglen/HBASE-SEARCH/blob/trunk/BENCHMARK.txt 

The Lucene vs. HBase Search indexing and search times will be located in the file: 
target/surefire-reports/org.apache.hadoop.hbase.search.TestSearchBenchmark-output.txt 

{noformat}
Benchmark Execution Instructions

Create a directory for the HBase Lucene installation.  Then run the following:

git clone git://github.com/jasonrutherglen/HDFS-347-HBASE.git HDFS-347-HBASE
cd HDFS-347-HBASE
ant mvn-install
cd ..

git clone git://github.com/jasonrutherglen/HBASE-SEARCH.git HBASE-SEARCH
cd HBASE-SEARCH
cd lib
./install-libs.sh
cd ..
cd wiki-en
tar -jxvf 10000.bz2
cd ..
mvn test -Dtest=TestSearchBenchmark
{noformat} 

Feel free to let me know if there are problems or if you have questions.

> Add search to HBase
> -------------------
>
>                 Key: HBASE-3529
>                 URL: https://issues.apache.org/jira/browse/HBASE-3529
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.0
>            Reporter: Jason Rutherglen
>         Attachments: HBASE-3529.patch
>
>
> Using the Apache Lucene library we can add freetext search to HBase.  The advantages
of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, AND, OR,
NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already architecturally sound,
scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  This means
cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user
to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region
server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore will explicitly
not be used because the document/row data is stored in HBase.  We will need to solve what
the best data structure for efficiently mapping a docid -> row key is.  It could be a docstore,
field cache, column stride fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be searched on, meaning
there's no need for a separate search related system in Zookeeper.
> * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message