hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tim robertson" <timrobertson...@gmail.com>
Subject Re: Newbie: best practice for building sharded SOLR indexes
Date Sun, 07 Dec 2008 10:15:02 GMT
Thanks Michael for your reply and help,

> You could do your sharding by server but what happens if an hbase node
> crashes during your indexing job?  The regions that were on server 20 will
> be distributed among the remaining 19.  If 20 comes back, balancing may put
> other than original regions on 20th server.

Ok, understood.  I presume for deployment then it would make sense to
separate HDFS machines and the machines with the lucene index also?
(right now I am not looking for capacity - I will worry about
replicated indexes for hits per second later, which I presume is
fairly easy with more hardware...)

> Natural 'unit' in hbase is the region.  You might shard by region.   If so,
> there are table input formats that split tables by region.  Could serve as
> input to your mapreduce indexing job.  See in our mapred package.  There is
> a mapreduce job that makes a full-text index of a tables' contents as an
> example.
> If you wanted to do it by server, studying the TableInputFormat and organize
> splits by region address.

I've just read up on what a region is and this sounds like a good
start for shard strategy.  I'll get some tests running on the
TableIndexFormat and look at the code behind it.

> Will your hbase instance be changing while the index job runs?

Not intentionally...

> How do you make a SOLR shard?  Is it a special lucene index format with
> required fields or does SOLR not care and will serve any lucene index?

Good questions and highlighting my newness to this, including lucene!
So far I have generated my SOLR indexes from a big tab file, into a
single index which proved too big for one machine.  SOLR does not
manage shards during writing, and you must do the sharding yourself,
so I just split my tab file into 2 and loaded one into each.  I was
under the impression lucene could not do structured searches (a column
value between 10-20 and date after 01/01/2008 kind of stuff), hence
looking straight to SOLR.  I will get into them more and find out to
answer these questions - too many technologies to learn...

> Would katta help, http://katta.wiki.sourceforge.net/?  Invoke it after your
> MR indexing job finishes to push the shards out to serving local disks?

Thanks for the pointer - I will look into it.

Thanks again,


View raw message