hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TuX RaceR <tuxrace...@gmail.com>
Subject Re: IHBase indexes persistence
Date Sun, 21 Mar 2010 10:08:00 GMT
Thank you Ryan for your answer.
I really like Solr, but to me it does not scale in the same way Hbase 
scales. Solr 1.4 ships with index replication: that is very nice and 
easy to use, but from the scaling point of view you are for instance 
limited by the disk size. Then you have shards: I'll have another look 
at Katta but the Katta-Solr integration Jira 
http://issues.apache.org/jira/browse/SOLR-1395 mentions search times 
rather long: "The
KattaClientTest test case shows a Katta cluster being created locally, a 
couple of cores/shards being placed into the cluster, then a query being 
executed that returns the correct number of results. It takes about 30s 
- 1.5min to run".
And yes Google seems (http://infolab.stanford.edu/~backrub/google.html) 
to have a dedicated index structure. I looked at Nutch which sounds like 
a direct opensource implementation of Google search, but I do not 
understand yet how to extract the distributed indexing part of the whole 
project (this is the part that I am really interested in as I do not 
have to crawl the web)

Thanks
TuX


Ryan Rawson wrote:
> Hey guys,
>
> I hate to ruin it for you, but Google search does not use bigtable at
> the query time.  If you would like an example of a good robust search
> and indexing system, you could have a look at lucene library, the solr
> system build on lucene, and katta which is another system building on
> lucene.
>
> -ryan
>
> On Sat, Mar 20, 2010 at 3:13 PM, TuX RaceR <tuxracer69@gmail.com> wrote:
>   
>> Hello Hbase user List!
>>
>> The feature provided by IHbase is very appealing. It seems to correspond to
>> a use case very common in applications (at least in mine ;) )
>>
>> Dan Washusen wrote:
>>     
>>> Not at the moment.  It currently keeps a copy of each unique indexed
>>> value and each row key in memory...
>>>
>>>       
>> Is there a more robust indexing on the roadmap?
>> HBase if I understand well proposes an opensource version of Google
>> Bigtable.
>> To me the most striking difference between Hbase and Bigtable is for
>> narrowing searches; the example below shows what I mean by narrowing:
>>
>> If in Google you search for the word
>>
>> hbase:
>>
>> (i.e using:
>> http://www.google.com/search?q=hbase
>> )
>> you get a fast answer
>> (typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17*
>> seconds))
>>
>> Now if you search all pages coming for the hadoop.apache.org host name (or
>> base URL), that is with the query:
>>
>> hbase +site:hadoop.apache.org
>>
>> (i.e using the URL:
>> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
>> )
>> you get a pretty fast answer to:
>> (typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org* for
>> *hbase*. (*0.12* seconds) )
>>
>> It seems to me that the second search uses a secondary index on a column
>> named 'site' to scan the 'hbase' based keys. Obviously Google found a good
>> way to implement this (good= fast and scalable)
>> Is this Google second indexing documented somewhere? Is that implemented
>> using something like IHbase or more something like THbase, or something
>> else?
>> Also, why IHbase stays in the 'contrib' tree? Is that because the code is
>> not at the same level as the main hbase code (not as tested, not as robust,
>> etc...)?
>>
>> Thanks
>> TuX
>>
>>
>>     


Mime
View raw message