hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Filter use cases
Date Wed, 27 May 2009 06:28:40 GMT
Hi all,

With HBASE-1304, it's time to normalize and review our filter API.

Here are a few givens:
- all calls must be byte[] offset,int offset, int length
- maybe we can have calls for KeyValue (which encodes all parts of the key &
value as per the name)
- we'd like to get rid of the calls:
--   boolean filterRow(final SortedMap<byte [], Cell> columns);
--   boolean filterRow(final List<KeyValue> results);
These calls are expensive, and there is no reason to have them.

Here is a proposal, imagine a filter will see this sequence of calls:
- reset()
- filterRowKey(byte[],int,int) - true to include row, false to skip to next
row
- filterKeyValue(KeyValue) - true to include key/value, false to skip
-- can choose to filter on family, qualifier, value, anything really.
- filterRow() - true to include entire row, false to post-hoc veto row

In this case one could implement the "filterIfColumnMissing" feature of
ColumnValueFilter by carrying state and returning false from filterRow() to
veto the row based on the columns/values we didn't see.

In any of these cases, all these functions will be called quite frequently,
so efficiency of the code is paramount.  It's probable that filterRowKey()
will be 'cached' by the calling code, but filterKeyValue() is called for
nearly every single value we would normally return (ie: it's applied _AFTER_
column matching and version and timestamp and delete tracking).

The goal is to:
(a) make the implementation easy and performant
(b) make the API normative and easy to code for
(c) make everything work

Thoughts?
-ryan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message