hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Stack <st...@duboce.net>
Subject Re: Newbie: best practice for building sharded SOLR indexes
Date Sun, 07 Dec 2008 09:45:28 GMT
tim robertson wrote:
> Can someone please help me with the best way to build up SOLR indexes
> from data held in HBase, that will be too large to sit on a single
> machine (100s millions rows)?
> I am assuming in a 20 node Hadoop cluster, I should build a 20 shard
> index and use SOLRs distributed search?
You could do your sharding by server but what happens if an hbase node 
crashes during your indexing job?  The regions that were on server 20 
will be distributed among the remaining 19.  If 20 comes back, balancing 
may put other than original regions on 20th server.

Natural 'unit' in hbase is the region.  You might shard by region.   If 
so, there are table input formats that split tables by region.  Could 
serve as input to your mapreduce indexing job.  See in our mapred 
package.  There is a mapreduce job that makes a full-text index of a 
tables' contents as an example.

If you wanted to do it by server, studying the TableInputFormat and 
organize splits by region address.

Will your hbase instance be changing while the index job runs?

How do you make a SOLR shard?  Is it a special lucene index format with 
required fields or does SOLR not care and will serve any lucene index?

> What is the best way to build each shard please? - use HBase as input
> source to Map reduce and push into the local node index in a
> Map/Reduce operation?
Would katta help, http://katta.wiki.sourceforge.net/?  Invoke it after 
your MR indexing job finishes to push the shards out to serving local disks?


View raw message