Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of oss.akk@gmail.com designates
 209.85.210.172 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=iZVX+NomR6RQzemKQ/umOq0rsgfzCclMlRBz5dRX+YNE3GakG6T+sFbuF93mOaf0u6
         ubIjHtwZLjpcTjRplXLd/pwInSkwpbEnnGenSpKF1qvelNttf0Nz6OaXL8bdL49FubQH
         6XpeA51uIBMIBQzwzNOTPASrCqquTTVcUvpv0=
MIME-Version: 1.0
In-Reply-To: <4BCC0EF4.40708@gmail.com>
References: <201003251042.17623.thomas@koch.ro> <4BCC0EF4.40708@gmail.com>
Date: Tue, 20 Apr 2010 22:30:39 -0700
Message-ID: <r2icd4bdd931004202230k1b9a288ep30323d099838510a@mail.gmail.com>
Subject: Re: ported lucandra: lucene index on HBase
From: Karthik K <oss.akk@gmail.com>
To: hbase-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0016368e2a6796bd150484b87e39

--0016368e2a6796bd150484b87e39
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi TuX -
   There is a different project, being done here about using HBase as the
backing store of TF-IDF, here at -  http://github.com/akkumar/hbasene , but
addressing the same problem and I am speaking on  behalf of that.


On Mon, Apr 19, 2010 at 1:06 AM, TuX RaceR <tuxracer69@gmail.com> wrote:

> Hi Thomas,
>
> Thanks for sharing your code for lucehbase.
> The schema you used  seems the same as the one use in lucandra:
>
> -------------------
> *Documents Ids are currently random and autogenerated.
>
> *Term keys and Document Keys are encoded as follows (using a random binar=
y
> delimiter)
>
>     Term Key                     col name         value
>     "index_name/field/term" =3D> { documentId , position vector }
>
>     Document Key
>     "index_name/documentId" =3D> { fieldName , value }
> --------------------
>
> I have two questions:
> 1) for a given term key, the number of column can get potentially very
> large. Have you tried another schema where the document id is put in the
> key, i.e.:
>
>     Term Key                                               col name
> value
>     "index_name/field/term/docid" =3D> { info , position vector }
> That way you get trivial paging in the case where a lot of documents
> contain the term.
>


The documents are encoded using a compressed bitset to scale, since with th=
e
docid being part of the key,  (docid * unique terms) , it will not address
the best locality of reference for unions/  intersections  / range queries
etc.

The HBase RPC is being modified , to append a docid to an already existing
field/term , to the compressed encoding stored in the family/ col. name, to
achieve the locality of reference and scale with the number of documents.


>
> 2) once you get the list of docids, to get the document details (i.e the
> pairs { fieldName , value }), you will trigger a lot of random access
> queries to Hbase (where in 1, with the alternative schema
> "index_name/field/term/docid" you open a scanner and with the schema
> "index_name/field/term" you just get one row). I am wondering how you can
> get fast answers that way. If you have few fields, would it be a good ide=
a
> to store also the values in the index (only the alternative schema
> "index_name/field/term/docid" allows this)?
>

Once the documents go in the index, for all practical purpose, the
manipulation is done across numbers , assigned to the user specified id
space.
More often than not, the only field that is stored is the "id" , that is
retrieved after all the computation, that can then be used to index into
another store to retrieve other details of the search schema. Except for
limited cases (sorting / faceting etc.) , using the tf-idf representation
for storing the 'field's in document goes against the format being used and
is advised to be used sparingly.

There is a low-volume mailing list here -
http://groups.google.com/group/hbasene-user , for discussion about the same=
,
that you can hop on if you are interested.


> Thanks
> TuX
>
>
>
>
> Thomas Koch wrote:
>
>> Hi,
>>
>> Lucandra stores a lucene index on cassandra:
>>
>> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-ba=
ckend
>>
>> As the author of lucandra writes: "I=92m sure something similar could be
>> built on hbase."
>>
>> So here it is:
>> http://github.com/thkoch2001/lucehbase
>>
>> This is only a first prototype which has not been tested on anything rea=
l
>> yet. But if you're interested, please join me to get it production ready=
!
>>
>> I propose to keep this thread on hbase-user and java-dev only.
>> Would it make sense to aim this project to become an hbase contrib? Or a
>> lucene contrib?
>>
>> Best regards,
>>
>> Thomas Koch, http://www.koch.ro
>>
>>
>
>

--0016368e2a6796bd150484b87e39--