Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 69655 invoked from network); 21 Apr 2010 05:31:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Apr 2010 05:31:12 -0000 Received: (qmail 33972 invoked by uid 500); 21 Apr 2010 05:31:11 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 33714 invoked by uid 500); 21 Apr 2010 05:31:08 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 33705 invoked by uid 99); 21 Apr 2010 05:31:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Apr 2010 05:31:07 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of oss.akk@gmail.com designates 209.85.210.172 as permitted sender) Received: from [209.85.210.172] (HELO mail-yx0-f172.google.com) (209.85.210.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Apr 2010 05:31:01 +0000 Received: by yxe2 with SMTP id 2so4464143yxe.2 for ; Tue, 20 Apr 2010 22:30:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:content-type; bh=QZPym018Gtj2UfzEpG5mMukFDow7grqckXZB0i6mjLU=; b=mb1Ey+2ed0M5JliZuTfWyXQzxYXDQozmEsKGEgn6ca6pTvbdoKm89nohZRDrrTW85m FgyBb+KybOfnFlVlmdgOAxVyWPOZeaQzRJ6ShLXExIgo/Nvk2lAHW/V6atV6zpFSjj8r 22b/fbtEIoJlZlXWopQfd2C8RWjg6utKq0/5w= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=iZVX+NomR6RQzemKQ/umOq0rsgfzCclMlRBz5dRX+YNE3GakG6T+sFbuF93mOaf0u6 ubIjHtwZLjpcTjRplXLd/pwInSkwpbEnnGenSpKF1qvelNttf0Nz6OaXL8bdL49FubQH 6XpeA51uIBMIBQzwzNOTPASrCqquTTVcUvpv0= MIME-Version: 1.0 Received: by 10.231.174.1 with HTTP; Tue, 20 Apr 2010 22:30:39 -0700 (PDT) In-Reply-To: <4BCC0EF4.40708@gmail.com> References: <201003251042.17623.thomas@koch.ro> <4BCC0EF4.40708@gmail.com> Date: Tue, 20 Apr 2010 22:30:39 -0700 Received: by 10.101.184.14 with SMTP id l14mr18054933anp.108.1271827839499; Tue, 20 Apr 2010 22:30:39 -0700 (PDT) Message-ID: Subject: Re: ported lucandra: lucene index on HBase From: Karthik K To: hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016368e2a6796bd150484b87e39 X-Virus-Checked: Checked by ClamAV on apache.org --0016368e2a6796bd150484b87e39 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi TuX - There is a different project, being done here about using HBase as the backing store of TF-IDF, here at - http://github.com/akkumar/hbasene , but addressing the same problem and I am speaking on behalf of that. On Mon, Apr 19, 2010 at 1:06 AM, TuX RaceR wrote: > Hi Thomas, > > Thanks for sharing your code for lucehbase. > The schema you used seems the same as the one use in lucandra: > > ------------------- > *Documents Ids are currently random and autogenerated. > > *Term keys and Document Keys are encoded as follows (using a random binar= y > delimiter) > > Term Key col name value > "index_name/field/term" =3D> { documentId , position vector } > > Document Key > "index_name/documentId" =3D> { fieldName , value } > -------------------- > > I have two questions: > 1) for a given term key, the number of column can get potentially very > large. Have you tried another schema where the document id is put in the > key, i.e.: > > Term Key col name > value > "index_name/field/term/docid" =3D> { info , position vector } > That way you get trivial paging in the case where a lot of documents > contain the term. > The documents are encoded using a compressed bitset to scale, since with th= e docid being part of the key, (docid * unique terms) , it will not address the best locality of reference for unions/ intersections / range queries etc. The HBase RPC is being modified , to append a docid to an already existing field/term , to the compressed encoding stored in the family/ col. name, to achieve the locality of reference and scale with the number of documents. > > 2) once you get the list of docids, to get the document details (i.e the > pairs { fieldName , value }), you will trigger a lot of random access > queries to Hbase (where in 1, with the alternative schema > "index_name/field/term/docid" you open a scanner and with the schema > "index_name/field/term" you just get one row). I am wondering how you can > get fast answers that way. If you have few fields, would it be a good ide= a > to store also the values in the index (only the alternative schema > "index_name/field/term/docid" allows this)? > Once the documents go in the index, for all practical purpose, the manipulation is done across numbers , assigned to the user specified id space. More often than not, the only field that is stored is the "id" , that is retrieved after all the computation, that can then be used to index into another store to retrieve other details of the search schema. Except for limited cases (sorting / faceting etc.) , using the tf-idf representation for storing the 'field's in document goes against the format being used and is advised to be used sparingly. There is a low-volume mailing list here - http://groups.google.com/group/hbasene-user , for discussion about the same= , that you can hop on if you are interested. > Thanks > TuX > > > > > Thomas Koch wrote: > >> Hi, >> >> Lucandra stores a lucene index on cassandra: >> >> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-ba= ckend >> >> As the author of lucandra writes: "I=92m sure something similar could be >> built on hbase." >> >> So here it is: >> http://github.com/thkoch2001/lucehbase >> >> This is only a first prototype which has not been tested on anything rea= l >> yet. But if you're interested, please join me to get it production ready= ! >> >> I propose to keep this thread on hbase-user and java-dev only. >> Would it make sense to aim this project to become an hbase contrib? Or a >> lucene contrib? >> >> Best regards, >> >> Thomas Koch, http://www.koch.ro >> >> > > --0016368e2a6796bd150484b87e39--