Mailing-List: contact accumulo-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: accumulo-user@incubator.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <D4A4E7DA705CD141A76877C2206A60A20710531C@IMCMBX02.MITRE.ORG>
References: <D4A4E7DA705CD141A76877C2206A60A207105282@IMCMBX02.MITRE.ORG>
	<CAGUtCHo13mfvBZLOjuG2F3ix9sfZ6CdTq_js5bLuLpJ5Rg1X=A@mail.gmail.com>
	<35CFD224-DCD7-4C42-B93B-19298D9D5AA6@cordovas.org>
	<CAGUtCHqsgNBy+Tpwvk_+r74VJitxK4Td8-OGUtJUuOp5-bMJYg@mail.gmail.com>
	<D4A4E7DA705CD141A76877C2206A60A20710531C@IMCMBX02.MITRE.ORG>
Date: Fri, 9 Mar 2012 19:15:11 -0500
Message-ID: 
 <CAGUtCHoMp+9wZF54V+4vNDGcCBBjOkc_48aWV=nWdx4MpEyNLg@mail.gmail.com>
Subject: Re: filter on value ranges
From: Keith Turner <keith@deenlo.com>
To: accumulo-user@incubator.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

The scanner is reading the data sequentially and pushing the filtering
out to tablet servers sequentially.  You can use a batch scanner to
initiate parallel filtering on tablets server, however if a lot data
would be returned use map reduce.

On Fri, Mar 9, 2012 at 6:46 PM, Kini, Ameet M. <akini@mitre.org> wrote:
>
>
> Keith/Aaron,
>
> All good points. For the short term, I'm ok with a full scan for this que=
ry. The main requirement at this point is to push all filtering down to the=
 tablet server.
>
> Not sure if I understand why map reduce is needed - how does the Scanner =
+ Custom Filter approach fall short over map reduce?
>
> -Ameet Kini
>
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Friday, March 09, 2012 3:50 PM
> To: accumulo-user@incubator.apache.org
> Subject: Re: filter on value ranges
>
> Using a filter vs building an index depends on the circumstances. =A0Are
> you going to ask the question multiple times? =A0How much data will the
> query match? =A0Etc. =A0 =A0If 50% of the rows will match the filter
> criteria, do a full table scan is ok and you probably want to do that
> scan using map reduce. =A0The map reduce job can use a filter to push
> computation to the tserver.
>
> Keith
>
> On Fri, Mar 9, 2012 at 3:31 PM, Aaron Cordova <aaron@cordovas.org> wrote:
>> If I may make one more argument for not using a filter and using a separ=
ate table as a secondary index instead, keep in mind you'll have to scan ov=
er the entire table to perform this query, since the rows containing the va=
lues you're after may appear anywhere in the table, i.e. all your queries w=
ill take a long time.
>>
>> For most users of Accumulo time is more precious than storage space (whi=
ch keeps getting cheaper and more plentiful, unlike time), so creating a se=
condary index is the path usually chosen over full table scans.
>>
>>
>> On Mar 9, 2012, at 2:48 PM, Keith Turner wrote:
>>
>>> The WholeRowIterator can filter rows, just override it and implement
>>> the filter function.
>>>
>>> Also new in 1.4 is org.apache.accumulo.core.iterators.user.RowFilter.
>>> If provides similar functionality, but does not require reading the
>>> entire row into memory.
>>>
>>> Keith
>>>
>>> On Fri, Mar 9, 2012 at 1:11 PM, Kini, Ameet M. <akini@mitre.org> wrote:
>>>>
>>>>
>>>>
>>>>
>>>> Thanks for the comments.
>>>>
>>>>
>>>>
>>>> I'm ok with rolling my own iterator/filter but not sure how to go abou=
t
>>>> doing it (see next para), so it'd be great to get pointers on it. =A0I=
'd
>>>> prefer keeping the schema to how it is today where each employee is
>>>> represented by a row in the table with a properties cf containing name=
 and
>>>> salary cq. Here's how it looks today
>>>>
>>>>
>>>>
>>>> rowID colfam =A0 =A0 colqual =A0 =A0 =A0 =A0 value
>>>>
>>>>
>>>>
>>>> abc =A0properties name =A0 =A0 =A0 =A0 =A0 =A0john
>>>>
>>>> abc =A0properties salary =A0 =A0 =A0 =A0 =A010000
>>>>
>>>> def =A0properties name =A0 =A0 =A0 =A0 =A0 =A0alice
>>>>
>>>> def =A0properties salary =A0 =A0 =A0 =A0 =A020000
>>>>
>>>>
>>>>
>>>> Part of my confusion lies in not knowing how to implement this range f=
ilter
>>>> class, because my query needs to get both the name as well as salary b=
ased
>>>> on a particular salary. What I would like to do is something like a Fi=
lter
>>>> equivalent to WholeRowIterator, say WholeRowFilter whose accept(Key k,=
 Value
>>>> v) was provided the entire row in the Value argument alongwith appropr=
iate
>>>> encodeRow/decodeRow as in WholeRowIterator. If the accept method retur=
ns
>>>> true, the whole row is returned to the client. Then I could extend thi=
s
>>>> class by writing a MyRangeFilter which would look inside the row and m=
ake
>>>> row level accept/reject decisions based on values of particular cq.
>>>>
>>>>
>>>>
>>>> Maybe this WholeRowFilter is already there in some form?
>>>>
>>>>
>>>>
>>>> -Ameet Kini
>>>>
>>>>
>>>>
>>>> From: Aaron Cordova [mailto:aaron@cordovas.org]
>>>> Sent: Friday, March 09, 2012 9:20 AM
>>>> To: accumulo-user@incubator.apache.org
>>>> Subject: Re: filter on value ranges
>>>>
>>>>
>>>>
>>>> To answer your question, I would not use built-in iterators for this.
>>>>
>>>>
>>>>
>>>> But if you were determined, you could use what is known as 'document
>>>> sharding' as opposed to 'term sharding' and use an intersecting iterat=
or.
>>>>
>>>>
>>>>
>>>> Instructions on how to do this should be added to the manual ...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mar 9, 2012, at 9:07 AM, Kini, Ameet M. wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> In 1.4, is there a way to use built-in iterators to run the following =
query
>>>> :
>>>>
>>>> =A0 "get the name and salary of all employees where the salary is betw=
een X
>>>> and Y"
>>>>
>>>>
>>>>
>>>> Assuming a straightforward schema where name and salary are both cq.
>>>>
>>>>
>>>>
>>>> I'd like both the cq restriction and the range predicate applied on th=
e
>>>> tservers.
>>>>
>>>>
>>>>
>>>> I see that Scanner.setColumnQualifierRegex would take care of the cq
>>>> restriction. But I don't know of a built-in iterator for the range pre=
dicate
>>>> and I don't know of how to compose those two iterators.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> -Ameet Kini
>>>>
>>>>
>>>>
>>>>
>>