Return-Path: X-Original-To: apmail-incubator-accumulo-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-accumulo-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A482090A8 for ; Sat, 10 Mar 2012 00:15:40 +0000 (UTC) Received: (qmail 75703 invoked by uid 500); 10 Mar 2012 00:15:40 -0000 Delivered-To: apmail-incubator-accumulo-user-archive@incubator.apache.org Received: (qmail 75599 invoked by uid 500); 10 Mar 2012 00:15:40 -0000 Mailing-List: contact accumulo-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: accumulo-user@incubator.apache.org Delivered-To: mailing list accumulo-user@incubator.apache.org Received: (qmail 75579 invoked by uid 99); 10 Mar 2012 00:15:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Mar 2012 00:15:39 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL,TO_NO_BRKTS_PCNT X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.213.175] (HELO mail-yx0-f175.google.com) (209.85.213.175) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Mar 2012 00:15:33 +0000 Received: by yenm3 with SMTP id m3so1305894yen.6 for ; Fri, 09 Mar 2012 16:15:11 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding:x-gm-message-state; bh=Or0x93kIw/04aj5lQFnwFMN0w6bXaKTz4s3ZEmzq8MM=; b=FgTWU7XxSB73Gstm69D1DQ8hPc1xPYaAYX1M5f79gaqux3PC3MmRkef5PvGG7n5k+9 VfdpyY7yjJSFPYPVkqS/g/DpDxmvOld3o81MugIzFfIFmnyLaIOk5kQI2IYJCXqgOUVC ogrGV1CzWBKIcfF+Jy9Le1vXphVq1G5hfZKhYJc1RhV9y5qxEyEjaIOn5Cw6SK5gc1h0 4f0WdCQ6+P8bwSgtcr8HtgOYvWsoe7lkNFOY1p/CIEcVgLOvnx3MpKZbNcFNj47AdsFx PgZ+lWAuD8VgngVii33XY67dSBGb5BjIARHGGUjc2FRrYI26MjZkRaEz0IawKCrGqDh3 FsDA== MIME-Version: 1.0 Received: by 10.229.137.65 with SMTP id v1mr383100qct.44.1331338511495; Fri, 09 Mar 2012 16:15:11 -0800 (PST) Received: by 10.229.164.20 with HTTP; Fri, 9 Mar 2012 16:15:11 -0800 (PST) In-Reply-To: References: <35CFD224-DCD7-4C42-B93B-19298D9D5AA6@cordovas.org> Date: Fri, 9 Mar 2012 19:15:11 -0500 Message-ID: Subject: Re: filter on value ranges From: Keith Turner To: accumulo-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQlveafJ5Ri+jWgn0hGLxxpK5VmAKe3pHf+yN3aIItgb9s750k64BPWabX0q4iGvwlNxeE/h X-Virus-Checked: Checked by ClamAV on apache.org The scanner is reading the data sequentially and pushing the filtering out to tablet servers sequentially. You can use a batch scanner to initiate parallel filtering on tablets server, however if a lot data would be returned use map reduce. On Fri, Mar 9, 2012 at 6:46 PM, Kini, Ameet M. wrote: > > > Keith/Aaron, > > All good points. For the short term, I'm ok with a full scan for this que= ry. The main requirement at this point is to push all filtering down to the= tablet server. > > Not sure if I understand why map reduce is needed - how does the Scanner = + Custom Filter approach fall short over map reduce? > > -Ameet Kini > > > -----Original Message----- > From: Keith Turner [mailto:keith@deenlo.com] > Sent: Friday, March 09, 2012 3:50 PM > To: accumulo-user@incubator.apache.org > Subject: Re: filter on value ranges > > Using a filter vs building an index depends on the circumstances. =A0Are > you going to ask the question multiple times? =A0How much data will the > query match? =A0Etc. =A0 =A0If 50% of the rows will match the filter > criteria, do a full table scan is ok and you probably want to do that > scan using map reduce. =A0The map reduce job can use a filter to push > computation to the tserver. > > Keith > > On Fri, Mar 9, 2012 at 3:31 PM, Aaron Cordova wrote: >> If I may make one more argument for not using a filter and using a separ= ate table as a secondary index instead, keep in mind you'll have to scan ov= er the entire table to perform this query, since the rows containing the va= lues you're after may appear anywhere in the table, i.e. all your queries w= ill take a long time. >> >> For most users of Accumulo time is more precious than storage space (whi= ch keeps getting cheaper and more plentiful, unlike time), so creating a se= condary index is the path usually chosen over full table scans. >> >> >> On Mar 9, 2012, at 2:48 PM, Keith Turner wrote: >> >>> The WholeRowIterator can filter rows, just override it and implement >>> the filter function. >>> >>> Also new in 1.4 is org.apache.accumulo.core.iterators.user.RowFilter. >>> If provides similar functionality, but does not require reading the >>> entire row into memory. >>> >>> Keith >>> >>> On Fri, Mar 9, 2012 at 1:11 PM, Kini, Ameet M. wrote: >>>> >>>> >>>> >>>> >>>> Thanks for the comments. >>>> >>>> >>>> >>>> I'm ok with rolling my own iterator/filter but not sure how to go abou= t >>>> doing it (see next para), so it'd be great to get pointers on it. =A0I= 'd >>>> prefer keeping the schema to how it is today where each employee is >>>> represented by a row in the table with a properties cf containing name= and >>>> salary cq. Here's how it looks today >>>> >>>> >>>> >>>> rowID colfam =A0 =A0 colqual =A0 =A0 =A0 =A0 value >>>> >>>> >>>> >>>> abc =A0properties name =A0 =A0 =A0 =A0 =A0 =A0john >>>> >>>> abc =A0properties salary =A0 =A0 =A0 =A0 =A010000 >>>> >>>> def =A0properties name =A0 =A0 =A0 =A0 =A0 =A0alice >>>> >>>> def =A0properties salary =A0 =A0 =A0 =A0 =A020000 >>>> >>>> >>>> >>>> Part of my confusion lies in not knowing how to implement this range f= ilter >>>> class, because my query needs to get both the name as well as salary b= ased >>>> on a particular salary. What I would like to do is something like a Fi= lter >>>> equivalent to WholeRowIterator, say WholeRowFilter whose accept(Key k,= Value >>>> v) was provided the entire row in the Value argument alongwith appropr= iate >>>> encodeRow/decodeRow as in WholeRowIterator. If the accept method retur= ns >>>> true, the whole row is returned to the client. Then I could extend thi= s >>>> class by writing a MyRangeFilter which would look inside the row and m= ake >>>> row level accept/reject decisions based on values of particular cq. >>>> >>>> >>>> >>>> Maybe this WholeRowFilter is already there in some form? >>>> >>>> >>>> >>>> -Ameet Kini >>>> >>>> >>>> >>>> From: Aaron Cordova [mailto:aaron@cordovas.org] >>>> Sent: Friday, March 09, 2012 9:20 AM >>>> To: accumulo-user@incubator.apache.org >>>> Subject: Re: filter on value ranges >>>> >>>> >>>> >>>> To answer your question, I would not use built-in iterators for this. >>>> >>>> >>>> >>>> But if you were determined, you could use what is known as 'document >>>> sharding' as opposed to 'term sharding' and use an intersecting iterat= or. >>>> >>>> >>>> >>>> Instructions on how to do this should be added to the manual ... >>>> >>>> >>>> >>>> >>>> >>>> On Mar 9, 2012, at 9:07 AM, Kini, Ameet M. wrote: >>>> >>>> >>>> >>>> >>>> >>>> In 1.4, is there a way to use built-in iterators to run the following = query >>>> : >>>> >>>> =A0 "get the name and salary of all employees where the salary is betw= een X >>>> and Y" >>>> >>>> >>>> >>>> Assuming a straightforward schema where name and salary are both cq. >>>> >>>> >>>> >>>> I'd like both the cq restriction and the range predicate applied on th= e >>>> tservers. >>>> >>>> >>>> >>>> I see that Scanner.setColumnQualifierRegex would take care of the cq >>>> restriction. But I don't know of a built-in iterator for the range pre= dicate >>>> and I don't know of how to compose those two iterators. >>>> >>>> >>>> >>>> Thanks, >>>> >>>> -Ameet Kini >>>> >>>> >>>> >>>> >>