accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kini, Ameet M." <ak...@mitre.org>
Subject RE: filter on value ranges
Date Fri, 09 Mar 2012 23:46:27 GMT


Keith/Aaron,

All good points. For the short term, I'm ok with a full scan for this query. The main requirement
at this point is to push all filtering down to the tablet server. 

Not sure if I understand why map reduce is needed - how does the Scanner + Custom Filter approach
fall short over map reduce? 

-Ameet Kini


-----Original Message-----
From: Keith Turner [mailto:keith@deenlo.com] 
Sent: Friday, March 09, 2012 3:50 PM
To: accumulo-user@incubator.apache.org
Subject: Re: filter on value ranges

Using a filter vs building an index depends on the circumstances.  Are
you going to ask the question multiple times?  How much data will the
query match?  Etc.    If 50% of the rows will match the filter
criteria, do a full table scan is ok and you probably want to do that
scan using map reduce.  The map reduce job can use a filter to push
computation to the tserver.

Keith

On Fri, Mar 9, 2012 at 3:31 PM, Aaron Cordova <aaron@cordovas.org> wrote:
> If I may make one more argument for not using a filter and using a separate table as
a secondary index instead, keep in mind you'll have to scan over the entire table to perform
this query, since the rows containing the values you're after may appear anywhere in the table,
i.e. all your queries will take a long time.
>
> For most users of Accumulo time is more precious than storage space (which keeps getting
cheaper and more plentiful, unlike time), so creating a secondary index is the path usually
chosen over full table scans.
>
>
> On Mar 9, 2012, at 2:48 PM, Keith Turner wrote:
>
>> The WholeRowIterator can filter rows, just override it and implement
>> the filter function.
>>
>> Also new in 1.4 is org.apache.accumulo.core.iterators.user.RowFilter.
>> If provides similar functionality, but does not require reading the
>> entire row into memory.
>>
>> Keith
>>
>> On Fri, Mar 9, 2012 at 1:11 PM, Kini, Ameet M. <akini@mitre.org> wrote:
>>>
>>>
>>>
>>>
>>> Thanks for the comments.
>>>
>>>
>>>
>>> I'm ok with rolling my own iterator/filter but not sure how to go about
>>> doing it (see next para), so it'd be great to get pointers on it.  I'd
>>> prefer keeping the schema to how it is today where each employee is
>>> represented by a row in the table with a properties cf containing name and
>>> salary cq. Here's how it looks today
>>>
>>>
>>>
>>> rowID colfam     colqual         value
>>>
>>>
>>>
>>> abc  properties name            john
>>>
>>> abc  properties salary          10000
>>>
>>> def  properties name            alice
>>>
>>> def  properties salary          20000
>>>
>>>
>>>
>>> Part of my confusion lies in not knowing how to implement this range filter
>>> class, because my query needs to get both the name as well as salary based
>>> on a particular salary. What I would like to do is something like a Filter
>>> equivalent to WholeRowIterator, say WholeRowFilter whose accept(Key k, Value
>>> v) was provided the entire row in the Value argument alongwith appropriate
>>> encodeRow/decodeRow as in WholeRowIterator. If the accept method returns
>>> true, the whole row is returned to the client. Then I could extend this
>>> class by writing a MyRangeFilter which would look inside the row and make
>>> row level accept/reject decisions based on values of particular cq.
>>>
>>>
>>>
>>> Maybe this WholeRowFilter is already there in some form?
>>>
>>>
>>>
>>> -Ameet Kini
>>>
>>>
>>>
>>> From: Aaron Cordova [mailto:aaron@cordovas.org]
>>> Sent: Friday, March 09, 2012 9:20 AM
>>> To: accumulo-user@incubator.apache.org
>>> Subject: Re: filter on value ranges
>>>
>>>
>>>
>>> To answer your question, I would not use built-in iterators for this.
>>>
>>>
>>>
>>> But if you were determined, you could use what is known as 'document
>>> sharding' as opposed to 'term sharding' and use an intersecting iterator.
>>>
>>>
>>>
>>> Instructions on how to do this should be added to the manual ...
>>>
>>>
>>>
>>>
>>>
>>> On Mar 9, 2012, at 9:07 AM, Kini, Ameet M. wrote:
>>>
>>>
>>>
>>>
>>>
>>> In 1.4, is there a way to use built-in iterators to run the following query
>>> :
>>>
>>>   "get the name and salary of all employees where the salary is between X
>>> and Y"
>>>
>>>
>>>
>>> Assuming a straightforward schema where name and salary are both cq.
>>>
>>>
>>>
>>> I'd like both the cq restriction and the range predicate applied on the
>>> tservers.
>>>
>>>
>>>
>>> I see that Scanner.setColumnQualifierRegex would take care of the cq
>>> restriction. But I don't know of a built-in iterator for the range predicate
>>> and I don't know of how to compose those two iterators.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> -Ameet Kini
>>>
>>>
>>>
>>>
>

Mime
View raw message