Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 49557 invoked from network); 8 Jan 2011 22:58:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Jan 2011 22:58:24 -0000 Received: (qmail 51675 invoked by uid 500); 8 Jan 2011 22:58:23 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 51643 invoked by uid 500); 8 Jan 2011 22:58:23 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 51635 invoked by uid 99); 8 Jan 2011 22:58:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Jan 2011 22:58:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of magnito@gmail.com designates 209.85.210.169 as permitted sender) Received: from [209.85.210.169] (HELO mail-iy0-f169.google.com) (209.85.210.169) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Jan 2011 22:58:18 +0000 Received: by iyj17 with SMTP id 17so18075868iyj.14 for ; Sat, 08 Jan 2011 14:57:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=cWVqIGfQvddtLL/HVAKkxntMNz4oIteH3YTqXdmWOfY=; b=XsaIORLtUNV7JjOiM73Q5vq8WhxdVwhhZbr2uwnUFEki4WSAxqMIhrSH1nkET6IdL5 /8uKkyqFvHwjWG2OzZhkVsHkcK685/WxmysyzNd4qphSpVMI4mxv0uiBswoNhxlozTAS cqJt91aBJ4kKDYu8f8mdGdEFDpxOvgcYTLEDM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=bCg2BtisylqYO1hkKooh42tqs6wMLEQK5PN2/7pxsWQGU71tZIZ7ocJmvg4H5qjQaL 8scldq2F8p8LNTyuVUCRgq2ZMIGMA4awsF7bqTu8DSYXoX5z/qGcEI7RgzlaMqgg6Y2P O4Wo53d8DWMC7s3VQsomkrAb6IirtXQhlJwHk= MIME-Version: 1.0 Received: by 10.231.19.132 with SMTP id a4mr10619623ibb.170.1294527477802; Sat, 08 Jan 2011 14:57:57 -0800 (PST) Received: by 10.231.215.130 with HTTP; Sat, 8 Jan 2011 14:57:57 -0800 (PST) In-Reply-To: References: Date: Sat, 8 Jan 2011 14:57:57 -0800 Message-ID: Subject: Re: question about merge-join (or AND operator betwween colums) From: Jack Levin To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=0022150482b77767da04995daaf5 --0022150482b77767da04995daaf5 Content-Type: text/plain; charset=ISO-8859-1 Future wise we plan to have millions of rows, probably across multiple regions, even if IO is not a problem, doing millions of filter operations does not make much sense. -Jack On Sat, Jan 8, 2011 at 2:54 PM, Andrey Stepachev wrote: > Ok. Understand. > > But do you check is it really an issue? I think that it is only 1 IO here, > (especially > if compression used)? You have big rows? > > > > 2011/1/9 Jack Levin > > > Sorting is not the issue, the location of data can be in the beginning, > > middle or end, or any combination of thereof. I only given the worst > case > > scenario example, I understand that filtering will produce results we > want > > but at cost of examining every row and offloading AND/join logic to the > > application. > > > > -Jack > > > > On Sat, Jan 8, 2011 at 1:59 PM, Andrey Stepachev > wrote: > > > > > More details on binary sorting you can read > > > > > > > > > http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/ > > > > > > 2011/1/8 Jack Levin > > > > > > > Basic problem described: > > > > > > > > user uploads 1 image and creates some text -10 days ago, then creates > > > 1000 > > > > text messages on between 9 days ago and today: > > > > > > > > > > > > row key | fm:type --> value > > > > > > > > > > > > 00days:uid | type:text --> text_id > > > > > > > > . > > > > > > > > . > > > > > > > > 09days:uid | type:text --> text_id > > > > > > > > > > > > 10days:uid | type:photo --> URL > > > > > > > > | type:text --> text_id > > > > > > > > > > > > Skip all the way to 10days:uid row, without reading 00days:id - > 09:uid > > > > rows. > > > > Ideally we do not want to read all 1000 entries that have _only_ > text. > > > We > > > > want to get to last entry in the most efficient way possible. > > > > > > > > > > > > -Jack > > > > > > > > > > > > > > > > > > > > On Sat, Jan 8, 2011 at 11:43 AM, Stack wrote: > > > > > Strike that. This is a Scan, so can't do blooms + filter. Sorry. > > > > > Sounds like a coprocessor then. You'd have your query 'lean' on > the > > > > > column that you know has the lesser items and then per item, you'd > do > > > > > a get inside the coprocessor against the column of many entries. > The > > > > > get would go via blooms. > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > On Sat, Jan 8, 2011 at 11:39 AM, Stack wrote: > > > > >> On Sat, Jan 8, 2011 at 11:35 AM, Jack Levin > > > wrote: > > > > >>> Yes, we thought about using filters, the issue is, if one family > > > > >>> column has 1ml values, and second family column has 10 values at > > the > > > > >>> bottom, we would end up scanning and filtering 99990 records and > > > > >>> throwing them away, which seems inefficient. > > > > >> > > > > >> Blooms+filters? > > > > >> St.Ack > > > > >> > > > > > > > > > > > > > > > --0022150482b77767da04995daaf5--