Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 17847 invoked from network); 8 Jan 2011 22:00:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Jan 2011 22:00:08 -0000 Received: (qmail 8148 invoked by uid 500); 8 Jan 2011 22:00:06 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 8043 invoked by uid 500); 8 Jan 2011 22:00:06 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 8035 invoked by uid 99); 8 Jan 2011 22:00:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Jan 2011 22:00:06 +0000 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of octo47@gmail.com designates 209.85.161.41 as permitted sender) Received: from [209.85.161.41] (HELO mail-fx0-f41.google.com) (209.85.161.41) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Jan 2011 21:59:59 +0000 Received: by fxm12 with SMTP id 12so11830481fxm.14 for ; Sat, 08 Jan 2011 13:59:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=+Uuow2mFYe+Uudp8Y+PeW/psbr7roHcwGMpPJiDSSzQ=; b=tWAx7+hMM8Ud8O71ccgf6Eb0xpU6cuOBqn0HXssW7qkTMz3Irxnfe4KwBVFi/m2zha T+m1w7Zp0IFkK2clQazGQg5YwFOC6Wnwu5YGS2bygCltCfbjybfGWONbXj6ERhb0cCgc BXygdK/tjEy2kF6OWIVw3PAS7On+wxUGXnLcM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=mb0F6GuXR67VuqyB8CiVPxfZ206cRCXBX6mviuZ4osDRjaaPNsHsGG1zHVDuSzwoHu faSEXI6JgKpcPFSmU678pMIhcrdsyiZXfqqqDk2y/KGkF6ImlXRpdaQFsXSEi9yRG0bv V6543P6J9CYxRSTGyaQNJOoSkbJpUEc/1Kmbc= Received: by 10.223.107.66 with SMTP id a2mr1548734fap.92.1294523979426; Sat, 08 Jan 2011 13:59:39 -0800 (PST) MIME-Version: 1.0 Received: by 10.223.117.67 with HTTP; Sat, 8 Jan 2011 13:59:19 -0800 (PST) In-Reply-To: References: From: Andrey Stepachev Date: Sun, 9 Jan 2011 00:59:19 +0300 Message-ID: Subject: Re: question about merge-join (or AND operator betwween colums) To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001636c5a860f26a3004995cd9f0 X-Virus-Checked: Checked by ClamAV on apache.org --001636c5a860f26a3004995cd9f0 Content-Type: text/plain; charset=UTF-8 More details on binary sorting you can read http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/ 2011/1/8 Jack Levin > Basic problem described: > > user uploads 1 image and creates some text -10 days ago, then creates 1000 > text messages on between 9 days ago and today: > > > row key | fm:type --> value > > > 00days:uid | type:text --> text_id > > . > > . > > 09days:uid | type:text --> text_id > > > 10days:uid | type:photo --> URL > > | type:text --> text_id > > > Skip all the way to 10days:uid row, without reading 00days:id - 09:uid > rows. > Ideally we do not want to read all 1000 entries that have _only_ text. We > want to get to last entry in the most efficient way possible. > > > -Jack > > > > > On Sat, Jan 8, 2011 at 11:43 AM, Stack wrote: > > Strike that. This is a Scan, so can't do blooms + filter. Sorry. > > Sounds like a coprocessor then. You'd have your query 'lean' on the > > column that you know has the lesser items and then per item, you'd do > > a get inside the coprocessor against the column of many entries. The > > get would go via blooms. > > > > St.Ack > > > > > > On Sat, Jan 8, 2011 at 11:39 AM, Stack wrote: > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin wrote: > >>> Yes, we thought about using filters, the issue is, if one family > >>> column has 1ml values, and second family column has 10 values at the > >>> bottom, we would end up scanning and filtering 99990 records and > >>> throwing them away, which seems inefficient. > >> > >> Blooms+filters? > >> St.Ack > >> > > > --001636c5a860f26a3004995cd9f0--