hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Dahlke <drew.dah...@bronto.com>
Subject Re: Hbase - Solr Integration
Date Fri, 30 Sep 2011 14:17:50 GMT
Hi David,

I did a little proof of concept a few weeks ago indexing hundreds of
millions of rows from hbase in solr using the near real time stuff in
solr's trunk.

You *could* write map reduce jobs against hbase to generate lucene
indexes on a periodic basis if you want, but that's not going to be
real time in the least. If that interested you, take a peek at the
source code for Katta.

Like you, I wanted updates to be indexed in near real time. At the
time of writing, they haven't made a point release of Solr that
includes the near real time code that came out of twitter. It's been
merged into trunk and is actually quite stable. Check out trunk,
compile it, and then configure the near real time stuff. They've
introduced the concept of 'soft commits' which make new documents
available to the index in near real time without all the overhead of
flushing to disk (hard commit). In my case, I set it to automatically
soft commit once a second and hard commit once an hour.

There's nothing hbase specific about my test. I just added some code
to CC solr on writes I do to hbase using solr's rest api.

Each document in my test was quite small <1k. I had 1 ec2 large
instance running solr and a hbase row scanner iterating over a table
posting documents to solr as fast as it could. When the index was
small, the indexing speed was a draw dropping 3500 document
additions/sec. As the index grew to ~50million it had tapered off to
800/sec. The key to keeping things fast is to keep individual indexes
small. Solr's answer to this is running multiple 'cores'. It's
basically a rest api for sharding your solr index. Maybe you shard it
1 core per customer? When querying you can specify multiple cores to
execute that query against, run multiple cores on a machine, etc.

I realize sharding solr to match the scalability of a distributed
database probably doesn't sound very magical. It's a lot of legwork &
that's exactly what's motivating projects like Elastic Search &
Lucandra. I experimented with both and sadly those experiments went
poorly compared to traditional solr.

Hope that helps,

On Thu, Sep 29, 2011 at 6:37 PM, Andrew Hu <andrewhuzz@live.com> wrote:
> Hi David,
> I am currently working with HBase with 100 columns. My requirement is
> perform real time search on HBase using rowkeys, and these many columns (
>  all within 1 family only in the schema). Typical query can be SQL type
> with AND OR NOT operators using these columns. I have ruled out batch processing, such
> Hive. My question is:
> - HBase + Solr will probably give you
> better query speed, but you need to maintain the both clusters, pushing
> data from HBase to Solr, and perhaps update Solr index pretty frequently.
> - Using HBase only and search needs to be
> against all of these columns, you need to either build secondary indexes
>  for each of the column ( if master table is 1 million rows, you will
> end up with 100 millions row + 1 million of original master table,
> which will use quite a lot of space), but I suppose search can be done
> pretty fast as well ?
> Not sure what is the best approach, any suggestions ?
> Thanks
> -Andrew
>> From: buttler1@llnl.gov
>> To: user@hbase.apache.org
>> Date: Thu, 29 Sep 2011 08:38:12 -0700
>> Subject: RE: Hbase - Solr Integration
>> It sounds like you should investigate the Lily Project.  They have already done
a lot of work to integrate Solr and HBase into a single solution.  I did something similar
before they released their project -- I like my use of dynamic schema's, but their overall
approach is probably more solid.  In particular they have given careful consideration as
to what to do with large objects, and how to integrate them into the system.  And most importantly,
their project is open.
>> There was also some talk earlier of integrating HBase and Solr -- you might want
to search the list for some of Jason's posts.  I think that is a work in progress still.
>> Otherwise you will have to roll your own solution.  It is actually not too difficult
to set up a system to publish HBase contents to Solr.  The difficulty is in maintaining a
consistent view of the data between the two.  I believe Lily uses queues to keep updates
in sync.  If you can tolerate some delay, you could simply update your indexes on a regular
basis, or set up your application to populate HBase and Solr simultaneously.  The biggest
challenge is resharding.  HBase will automatically split regions when they become too large.
 Solr doesn't have that capability yet, so you will have to manage the shards yourself.
>> Another approach is to look at Elastic Search. That is a Lucene based system that
does do automatic sharding.
>> Direct search on HBase requires either a clever key encoding (like OpenTSDB), and/or
multiple copies of the data to imitate secondary indexes.
>> Dave
>> -----Original Message-----
>> From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
>> Sent: Thursday, September 29, 2011 2:52 AM
>> To: user@hbase.apache.org
>> Subject: Hbase - Solr Integration
>> Hi Friends,
>> I am storing my data in Hbase. I want to do search using Solr. I can't find much
documentation about the integration. Is there any documentation to integrate these two.
>> Please Suggest
>> Regards,
>> Stuti Awasthi
>> -----------------------------------------------------------------------------------------------------------------------
>> The contents of this e-mail and any attachment(s) are confidential and intended for
the named recipient(s) only.
>> It shall not attach any liability on the originator or HCL or its affiliates. Any
views or opinions presented in
>> this email are solely those of the author and may not necessarily reflect the opinions
of HCL or its affiliates.
>> Any form of reproduction, dissemination, copying, disclosure, modification, distribution
and / or publication of
>> this message without the prior written consent of the author of this e-mail is strictly
prohibited. If you have
>> received this email in error please delete it and notify the sender immediately.
Before opening any mail and
>> attachments please check them for viruses and defect.
>> -----------------------------------------------------------------------------------------------------------------------

View raw message