Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 16299 invoked from network); 21 Mar 2010 10:08:33 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Mar 2010 10:08:33 -0000 Received: (qmail 24899 invoked by uid 500); 21 Mar 2010 10:08:32 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 24623 invoked by uid 500); 21 Mar 2010 10:08:31 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 24615 invoked by uid 99); 21 Mar 2010 10:08:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 Mar 2010 10:08:30 +0000 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=AWL,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tuxracer69@gmail.com designates 209.85.218.220 as permitted sender) Received: from [209.85.218.220] (HELO mail-bw0-f220.google.com) (209.85.218.220) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 Mar 2010 10:08:23 +0000 Received: by bwz20 with SMTP id 20so4117400bwz.12 for ; Sun, 21 Mar 2010 03:08:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=AQJaH58M4a/RB45byl4hjhR9vhcR9+3K/U1cYREV/Ho=; b=f98XmFt8645RbNvT9jipdnh8pzFknbvafrdYj2ITJQAXG0z6XlgBoC9E0fz++7doL4 2qfJoSgIcIDTypPK+vXhRF7YbbYrk7LGMIamoCm5U1REqg5JPQJVlvzJRtOe9fQkjEzA lOqrBAbun1pXIZTKGC/NmZXI0XNsm1stMSITA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=V/pAidvos5udMLl7uESu+gYx3h5FP7WWZOA60NjBZ+efV9VfGfbjxZH/KCw4d4KOip Dyr/ySv4Ygl9G5nS5cmXnguZqBT3w5UurC2CHoiEY4F7ZXjr7jx+kaFDzqKUVoBDhSmA 6AjaKVJs/izGFb6sG38vb//QuagbRl3kERaOQ= Received: by 10.204.10.3 with SMTP id n3mr2984618bkn.81.1269166082176; Sun, 21 Mar 2010 03:08:02 -0700 (PDT) Received: from [192.168.1.64] (78-86-128-147.zone2.bethere.co.uk [78.86.128.147]) by mx.google.com with ESMTPS id x16sm13823045bku.23.2010.03.21.03.08.01 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sun, 21 Mar 2010 03:08:01 -0700 (PDT) Message-ID: <4BA5F000.9010902@gmail.com> Date: Sun, 21 Mar 2010 10:08:00 +0000 From: TuX RaceR User-Agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090701) MIME-Version: 1.0 To: hbase-user@hadoop.apache.org Subject: Re: IHBase indexes persistence References: <7c457ebe1003201412j2f489bddrab74ed284c01b89b@mail.gmail.com> <4BA54878.5030202@gmail.com> <78568af11003201523o79d8c172t7fc76ee0cd9e8838@mail.gmail.com> In-Reply-To: <78568af11003201523o79d8c172t7fc76ee0cd9e8838@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Thank you Ryan for your answer. I really like Solr, but to me it does not scale in the same way Hbase scales. Solr 1.4 ships with index replication: that is very nice and easy to use, but from the scaling point of view you are for instance limited by the disk size. Then you have shards: I'll have another look at Katta but the Katta-Solr integration Jira http://issues.apache.org/jira/browse/SOLR-1395 mentions search times rather long: "The KattaClientTest test case shows a Katta cluster being created locally, a couple of cores/shards being placed into the cluster, then a query being executed that returns the correct number of results. It takes about 30s - 1.5min to run". And yes Google seems (http://infolab.stanford.edu/~backrub/google.html) to have a dedicated index structure. I looked at Nutch which sounds like a direct opensource implementation of Google search, but I do not understand yet how to extract the distributed indexing part of the whole project (this is the part that I am really interested in as I do not have to crawl the web) Thanks TuX Ryan Rawson wrote: > Hey guys, > > I hate to ruin it for you, but Google search does not use bigtable at > the query time. If you would like an example of a good robust search > and indexing system, you could have a look at lucene library, the solr > system build on lucene, and katta which is another system building on > lucene. > > -ryan > > On Sat, Mar 20, 2010 at 3:13 PM, TuX RaceR wrote: > >> Hello Hbase user List! >> >> The feature provided by IHbase is very appealing. It seems to correspond to >> a use case very common in applications (at least in mine ;) ) >> >> Dan Washusen wrote: >> >>> Not at the moment. It currently keeps a copy of each unique indexed >>> value and each row key in memory... >>> >>> >> Is there a more robust indexing on the roadmap? >> HBase if I understand well proposes an opensource version of Google >> Bigtable. >> To me the most striking difference between Hbase and Bigtable is for >> narrowing searches; the example below shows what I mean by narrowing: >> >> If in Google you search for the word >> >> hbase: >> >> (i.e using: >> http://www.google.com/search?q=hbase >> ) >> you get a fast answer >> (typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17* >> seconds)) >> >> Now if you search all pages coming for the hadoop.apache.org host name (or >> base URL), that is with the query: >> >> hbase +site:hadoop.apache.org >> >> (i.e using the URL: >> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org >> ) >> you get a pretty fast answer to: >> (typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org* for >> *hbase*. (*0.12* seconds) ) >> >> It seems to me that the second search uses a secondary index on a column >> named 'site' to scan the 'hbase' based keys. Obviously Google found a good >> way to implement this (good= fast and scalable) >> Is this Google second indexing documented somewhere? Is that implemented >> using something like IHbase or more something like THbase, or something >> else? >> Also, why IHbase stays in the 'contrib' tree? Is that because the code is >> not at the same level as the main hbase code (not as tested, not as robust, >> etc...)? >> >> Thanks >> TuX >> >> >>