hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Solr on Hbase
Date Fri, 15 May 2009 20:18:53 GMT
Thanks for the offer Jay.

Here are list of current 0.20.0 outstanding bugs: *http://tinyurl.com/cgmarz
*

You any good at webapps?  If so, check out HBASE-1395.

HBASE-1192 is about adapting a cache patch to leverage the experience of the
SOLR folks (Some SOLR lads pointed us at their experience in this area -- we
should leverage their experience).

Otherwise, if good at lucene, etc., checkout the table indexing job under
the mapreduce package.  It works but you might have suggests as to how to
improve; e.g. have it produce indices that could be put under a SOLR
cluster?


The below sounds sweet.  The SOLR index would be client-side?  How would it
scale?

St.Ack



On Fri, May 15, 2009 at 12:50 PM, Jay Booth <jaybooth@gmail.com> wrote:

> Hey guys, I have a lot of experience with Lucene and Solr (not much of an
> emailer though) and was planning on spending the weekend doing a code-binge
> and contributing something to Hbase so I can put it on my resume.  Any
> suggestions as far as things you're really trying to get out for .20 and
> could use some help would be appreciated, I also had the following idea for
> running Solr on Hadoop:
>
> - Initially entirely client-side, with potentially big chunks moved over to
> the cluster side in a hbase-solr.jar later for efficiency
> -  Client maintains a mapping of schema names to solr schema.xml
> -    On first load of a schema, creates a main table with rowkeys and a
> bunch of secondary tables for secondary indices, tokenizing as appropriate
> based on the config
> -  Client accepts update, delete, query and "edit" requests
>  -  first 3 are handled just like they are now in solr although update
> (delete/re-insert all columns for a row) will likely be pretty inefficient
> on HBase's architecture, hence introduction of "edit" to reduce row bloat
> in
> HBASE
>  -  queries are automatically handled via multiple queries to hbase,
> pulling sets of rowid's and reducing to only those that hit, then pulling
> the actual documents of all hits.  this will be pretty inefficient compared
> to running inside the cluster but trying to limit what I take on in the
> first pass.
>
> I'll probably only get a small subset of the functionality defined in a
> solr
> schema.xml over the weekend, but wanted to bounce the idea out there..  is
> this something the community's interested in?  Lots of Solr users out
> there,
> if they could seamlessly switch to this and I don't create massive
> performance problems, seems like it would be useful.
>
> Otherwise, what are open bugs that could particularly use attention right
> now?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message