hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TuX RaceR <tuxrace...@gmail.com>
Subject Re: IHBase indexes persistence
Date Sun, 21 Mar 2010 10:08:00 GMT
Thank you Ryan for your answer.
I really like Solr, but to me it does not scale in the same way Hbase 
scales. Solr 1.4 ships with index replication: that is very nice and 
easy to use, but from the scaling point of view you are for instance 
limited by the disk size. Then you have shards: I'll have another look 
at Katta but the Katta-Solr integration Jira 
http://issues.apache.org/jira/browse/SOLR-1395 mentions search times 
rather long: "The
KattaClientTest test case shows a Katta cluster being created locally, a 
couple of cores/shards being placed into the cluster, then a query being 
executed that returns the correct number of results. It takes about 30s 
- 1.5min to run".
And yes Google seems (http://infolab.stanford.edu/~backrub/google.html) 
to have a dedicated index structure. I looked at Nutch which sounds like 
a direct opensource implementation of Google search, but I do not 
understand yet how to extract the distributed indexing part of the whole 
project (this is the part that I am really interested in as I do not 
have to crawl the web)


Ryan Rawson wrote:
> Hey guys,
> I hate to ruin it for you, but Google search does not use bigtable at
> the query time.  If you would like an example of a good robust search
> and indexing system, you could have a look at lucene library, the solr
> system build on lucene, and katta which is another system building on
> lucene.
> -ryan
> On Sat, Mar 20, 2010 at 3:13 PM, TuX RaceR <tuxracer69@gmail.com> wrote:
>> Hello Hbase user List!
>> The feature provided by IHbase is very appealing. It seems to correspond to
>> a use case very common in applications (at least in mine ;) )
>> Dan Washusen wrote:
>>> Not at the moment.  It currently keeps a copy of each unique indexed
>>> value and each row key in memory...
>> Is there a more robust indexing on the roadmap?
>> HBase if I understand well proposes an opensource version of Google
>> Bigtable.
>> To me the most striking difference between Hbase and Bigtable is for
>> narrowing searches; the example below shows what I mean by narrowing:
>> If in Google you search for the word
>> hbase:
>> (i.e using:
>> http://www.google.com/search?q=hbase
>> )
>> you get a fast answer
>> (typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17*
>> seconds))
>> Now if you search all pages coming for the hadoop.apache.org host name (or
>> base URL), that is with the query:
>> hbase +site:hadoop.apache.org
>> (i.e using the URL:
>> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
>> )
>> you get a pretty fast answer to:
>> (typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org* for
>> *hbase*. (*0.12* seconds) )
>> It seems to me that the second search uses a secondary index on a column
>> named 'site' to scan the 'hbase' based keys. Obviously Google found a good
>> way to implement this (good= fast and scalable)
>> Is this Google second indexing documented somewhere? Is that implemented
>> using something like IHbase or more something like THbase, or something
>> else?
>> Also, why IHbase stays in the 'contrib' tree? Is that because the code is
>> not at the same level as the main hbase code (not as tested, not as robust,
>> etc...)?
>> Thanks
>> TuX

View raw message