hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Behaviour of filters within scans
Date Mon, 19 Apr 2010 03:59:42 GMT
I think all the functionality is there between these 2 calls:

Filter#filterKeyValue(KeyValue kv);

In the first call you can cache the KeyValues locally in the filter
state (in a List<KeyValue> for example).  In the last call you can do
your custom logic based on all the KeyValues you have seen.  There is
little to no cost to do this, since retaining references to a KeyValue
is cheap (ish, relatively, etc).

The filter implementation has changed a bit since August 2009, and it
might be possible to create a call like
Filter#filterRow(List<KeyValue> results) that is called at the "end"
of a row... you can get the same effect as I noted above.  It is just
a matter of API, not of semantics.

I would generally discourage you from structuring your data to fit an
internal implementation detail.  While there are no current plans to
change sorting order, it would make your code more brittle.


On Sun, Apr 18, 2010 at 8:48 PM, Juhani Connolly <juhani@ninja.co.jp> wrote:
> I've spent some time looking through the regionscanner logic, in particular
> the filter related parts and would want to check if a) my current
> understanding is correct and b) if this may be subject to change.
> short/simplified version to avoid getting sidetracked:
> - A RegionScanner is built from a series of scanners attached to each Store.
> - This list of scanners is stored in a KeyValueHeap which compares KeyValues
> to sort the order in which entries are retrieved by RegionScanner->next
>  - To check the order in which keys will be returned, and thus filtered one
> can look at KeyValue.KeyComparator->compare. It's something like: sort by
> row, then column family, then column, then timestamp
> Filters are applied as described in
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/filter/Filter.html
> In the end, when using filterKeyValue(KeyValue) one can expect the keyValues
> to be sent to it in a sorted order. Will this always be the case?
> I ask this because I currently plan to filter the values of col-b based on
> the values in col-a. This could be achieved by making sure col-a compares
> lower than col-b and storing some kind of data(e.g. a list of "ok"
> timestamps) within the custom filter. Does this all sound ok?
> Finally it would be nice to see the option to filter a full set, as naming
> columns to guarrantee a certain sorting for filters seems pretty dubious:
> - Probably in HRegion.Regionserver->next after nextInternal, before
> filterRow?
> - This would allow a potential filter to go through the gathered results and
> prune them depending on intercolumn dependencies?
> - I believe it would unlock a lot of possibilities for custom filters that
> could cut down on significant amount of transfers where a rows data could be
> pruned regionserver side rather than at the client. My particular
> application is to only store col-b where there is a col-a with a
> corresponding timestamp that matches specific conditions. In my particular
> case this results in massive reductions in the amount of cells being sent
> from the regionserver.
> Any thoughts would be appreciated.
> As an aside, I believe HRegion.RegionScanner->nextInternal is doing
> filterRowKey for every key in a row even if it has passed once? Is this
> intentional behaviour(it seems somewhat unexpected), as otherwise it could
> be optimised by just checking the samerow variable.

View raw message