hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Behaviour of filters within scans
Date Mon, 19 Apr 2010 05:41:03 GMT
Yes you are correct, filterRow() only offers the chance to reject the
row, editing the row was expected to be done in the filterKeyValue()

The problem with the filter "interface" is it is highly tied to the
implementation, which is why things look perhaps a little weird and
not super generic. Previously the filter was expected to be run only
at the StoreScanner level, so that might explain a few things.

I think an additional edit call to allow a filter to have ultimate
last minute decision making on a row's worth of results might be
workable now.

I'd review such a patch.


On Sun, Apr 18, 2010 at 10:30 PM, Juhani Connolly <juhani@ninja.co.jp> wrote:
> Thanks for your response
> On 04/19/2010 12:59 PM, Ryan Rawson wrote:
>> I think all the functionality is there between these 2 calls:
>> Filter#filterKeyValue(KeyValue kv);
>> and
>> Filter#filterRow();
>> In the first call you can cache the KeyValues locally in the filter
>> state (in a List<KeyValue>  for example).  In the last call you can do
>> your custom logic based on all the KeyValues you have seen.  There is
>> little to no cost to do this, since retaining references to a KeyValue
>> is cheap (ish, relatively, etc).
> But ultimately the only thing I can do with Filter#filterRow() is drop the
> full row? Am I missing something here? Were I to store references to all the
> key values that have passed through at most I could zero out their buffers
> in the #filterRow call? I'm not sure what the consequences of this might be
> afterwords as the scanner tries to send a load of empty cells. Looking at
> HRegionServer#next(final long scannerId, int nbRows), it seems to me that
> they would get packed into Result to get sent back to the client. I could
> certainly cut down on a lot of transfer by just sending "empty" keyvalues,
> but it still seems like a lot of overhead that could be lost by a small api
> change. Or am I missing something here?
>> The filter implementation has changed a bit since August 2009, and it
>> might be possible to create a call like
>> Filter#filterRow(List<KeyValue>  results) that is called at the "end"
>> of a row... you can get the same effect as I noted above.  It is just
>> a matter of API, not of semantics.
> Having followed the code, it did seem like it would be trivial to implement
> such an extra api either before or after the Filter#filterRow(). I believe
> the option of having the ability to knock keyvals out of the list would save
> on processing later.
> I would be happy to try putting together the minor modification to
> RegionScanner and adding a unit test if such a modification were welcome.
>> I would generally discourage you from structuring your data to fit an
>> internal implementation detail.  While there are no current plans to
>> change sorting order, it would make your code more brittle.
> I certainly wouldn't want to do it :) I'm going to have to see how much
> overhead I get with a) just dealing with it client end or b) keeping
> references and zeroing the keyvals and go from there.
>> -ryan
>> On Sun, Apr 18, 2010 at 8:48 PM, Juhani Connolly<juhani@ninja.co.jp>
>>  wrote:
>>> I've spent some time looking through the regionscanner logic, in
>>> particular
>>> the filter related parts and would want to check if a) my current
>>> understanding is correct and b) if this may be subject to change.
>>> short/simplified version to avoid getting sidetracked:
>>> - A RegionScanner is built from a series of scanners attached to each
>>> Store.
>>> - This list of scanners is stored in a KeyValueHeap which compares
>>> KeyValues
>>> to sort the order in which entries are retrieved by RegionScanner->next
>>>  - To check the order in which keys will be returned, and thus filtered
>>> one
>>> can look at KeyValue.KeyComparator->compare. It's something like: sort by
>>> row, then column family, then column, then timestamp
>>> Filters are applied as described in
>>> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/filter/Filter.html
>>> In the end, when using filterKeyValue(KeyValue) one can expect the
>>> keyValues
>>> to be sent to it in a sorted order. Will this always be the case?
>>> I ask this because I currently plan to filter the values of col-b based
>>> on
>>> the values in col-a. This could be achieved by making sure col-a compares
>>> lower than col-b and storing some kind of data(e.g. a list of "ok"
>>> timestamps) within the custom filter. Does this all sound ok?
>>> Finally it would be nice to see the option to filter a full set, as
>>> naming
>>> columns to guarrantee a certain sorting for filters seems pretty dubious:
>>> - Probably in HRegion.Regionserver->next after nextInternal, before
>>> filterRow?
>>> - This would allow a potential filter to go through the gathered results
>>> and
>>> prune them depending on intercolumn dependencies?
>>> - I believe it would unlock a lot of possibilities for custom filters
>>> that
>>> could cut down on significant amount of transfers where a rows data could
>>> be
>>> pruned regionserver side rather than at the client. My particular
>>> application is to only store col-b where there is a col-a with a
>>> corresponding timestamp that matches specific conditions. In my
>>> particular
>>> case this results in massive reductions in the amount of cells being sent
>>> from the regionserver.
>>> Any thoughts would be appreciated.
>>> As an aside, I believe HRegion.RegionScanner->nextInternal is doing
>>> filterRowKey for every key in a row even if it has passed once? Is this
>>> intentional behaviour(it seems somewhat unexpected), as otherwise it
>>> could
>>> be optimised by just checking the samerow variable.

View raw message