Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of tuxracer69@gmail.com
 designates 209.85.218.220 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:user-agent:mime-version:to:subject:references
         :in-reply-to:content-type:content-transfer-encoding;
        b=V/pAidvos5udMLl7uESu+gYx3h5FP7WWZOA60NjBZ+efV9VfGfbjxZH/KCw4d4KOip
         Dyr/ySv4Ygl9G5nS5cmXnguZqBT3w5UurC2CHoiEY4F7ZXjr7jx+kaFDzqKUVoBDhSmA
         6AjaKVJs/izGFb6sG38vb//QuagbRl3kERaOQ=
Message-ID: <4BA5F000.9010902@gmail.com>
Date: Sun, 21 Mar 2010 10:08:00 +0000
From: TuX RaceR <tuxracer69@gmail.com>
User-Agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090701)
MIME-Version: 1.0
To: hbase-user@hadoop.apache.org
Subject: Re: IHBase indexes persistence
References: <web-25152171@smtp.ua.md>
	 <7c457ebe1003201412j2f489bddrab74ed284c01b89b@mail.gmail.com>
	 <4BA54878.5030202@gmail.com>
 <78568af11003201523o79d8c172t7fc76ee0cd9e8838@mail.gmail.com>
In-Reply-To: <78568af11003201523o79d8c172t7fc76ee0cd9e8838@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Thank you Ryan for your answer.
I really like Solr, but to me it does not scale in the same way Hbase 
scales. Solr 1.4 ships with index replication: that is very nice and 
easy to use, but from the scaling point of view you are for instance 
limited by the disk size. Then you have shards: I'll have another look 
at Katta but the Katta-Solr integration Jira 
http://issues.apache.org/jira/browse/SOLR-1395 mentions search times 
rather long: "The
KattaClientTest test case shows a Katta cluster being created locally, a 
couple of cores/shards being placed into the cluster, then a query being 
executed that returns the correct number of results. It takes about 30s 
- 1.5min to run".
And yes Google seems (http://infolab.stanford.edu/~backrub/google.html) 
to have a dedicated index structure. I looked at Nutch which sounds like 
a direct opensource implementation of Google search, but I do not 
understand yet how to extract the distributed indexing part of the whole 
project (this is the part that I am really interested in as I do not 
have to crawl the web)

Thanks
TuX


Ryan Rawson wrote:
> Hey guys,
>
> I hate to ruin it for you, but Google search does not use bigtable at
> the query time.  If you would like an example of a good robust search
> and indexing system, you could have a look at lucene library, the solr
> system build on lucene, and katta which is another system building on
> lucene.
>
> -ryan
>
> On Sat, Mar 20, 2010 at 3:13 PM, TuX RaceR <tuxracer69@gmail.com> wrote:
>   
>> Hello Hbase user List!
>>
>> The feature provided by IHbase is very appealing. It seems to correspond to
>> a use case very common in applications (at least in mine ;) )
>>
>> Dan Washusen wrote:
>>     
>>> Not at the moment.  It currently keeps a copy of each unique indexed
>>> value and each row key in memory...
>>>
>>>       
>> Is there a more robust indexing on the roadmap?
>> HBase if I understand well proposes an opensource version of Google
>> Bigtable.
>> To me the most striking difference between Hbase and Bigtable is for
>> narrowing searches; the example below shows what I mean by narrowing:
>>
>> If in Google you search for the word
>>
>> hbase:
>>
>> (i.e using:
>> http://www.google.com/search?q=hbase
>> )
>> you get a fast answer
>> (typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17*
>> seconds))
>>
>> Now if you search all pages coming for the hadoop.apache.org host name (or
>> base URL), that is with the query:
>>
>> hbase +site:hadoop.apache.org
>>
>> (i.e using the URL:
>> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
>> )
>> you get a pretty fast answer to:
>> (typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org* for
>> *hbase*. (*0.12* seconds) )
>>
>> It seems to me that the second search uses a secondary index on a column
>> named 'site' to scan the 'hbase' based keys. Obviously Google found a good
>> way to implement this (good= fast and scalable)
>> Is this Google second indexing documented somewhere? Is that implemented
>> using something like IHbase or more something like THbase, or something
>> else?
>> Also, why IHbase stays in the 'contrib' tree? Is that because the code is
>> not at the same level as the main hbase code (not as tested, not as robust,
>> etc...)?
>>
>> Thanks
>> TuX
>>
>>
>>