hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: Filter use cases
Date Wed, 27 May 2009 23:35:11 GMT
+1 on this API. Looks good. 




________________________________
From: Ryan Rawson <ryanobjc@gmail.com>
To: hbase-dev@hadoop.apache.org; jlist@streamy.com
Sent: Wednesday, May 27, 2009 12:06:31 AM
Subject: Re: Filter use cases

Here is a suggested API.  I included the call flow in the interface docs as
well.

I dropped rowProcessed() since only PageRowFilter used it, and it can get
the data elseway.  I also dropped processAlways() as well.  This seems like
internal workings to RowFilterSet, and should ideally be maintained there.

This row filter interface supports 1 feature we can't right now:
- filter upto N columns, skip the rest.

Right now we can do that, but not efficiently.

Remember, as we write filters, columns are seen in sorted order.  To be
efficient, at all steps we need to take advantage of the sorted order of
things.

/**
* Interface for row and column filters directly applied within the
regionserver.
* A filter can expect the following call sequence:
*
* - reset();
* - filterAllRemaining() -> true indicates scan is over, false, keep going
on.
* - filterRowKey(byte[],int,int); -> true to drop this row
* if false, we will also call:
* - filterValue(KeyValue); -> true to drop this key/value
* - filterRow(); -> last chance to drop entire row based on the sequence of
* filterValue() calls. Eg: filter a row if it doesn't contain a specified
column.
*
* Filter instances are created one per region/scan.
*/
public interface NewRowFilterInterface extends Writable {
  /**
   * Reset the state of the filter between rows.
   */
  public void reset();

  /**
   * Filters a row based on the row key. If this returns true, the entire
   * row will be excluded.  If false, each KeyValue in the row will be
   * passed to filterValue() below.
   *
   * @param buffer buffer containing row key
   * @param offset offset into buffer where row key starts
   * @param length length of the row key
   * @return true, remove entire row, false, include the row (maybe).
   */
  public boolean filterRowKey(byte [] buffer, int offset, int length);

  /**
   * If this returns true, the scan will terminate.
   *
   * @return true to end scan, false to continue.
   */
  public boolean filterAllRemaining();

  /**
   * A way to filter based on the column family, column qualifier and/or the
   * column value. Return code is described below.  This allows filters to
   * filter only certain number of columns, then terminate without matching
ever
   * column.
   *
   * @param v the KeyValue in question
   * @return code as described below
   */
  public ReturnCode filterValue(KeyValue v);

  /**
   * Return codes for filterValue().
   */
  public enum ReturnCode {
    /**
     * Include the KeyValue
     */
    INCLUDE,
    /**
     * Skip this KeyValue
     */
    SKIP,
    /**
     * Done with columns, skip to next row. Note that filterRow() will
     * still be called.
     */
    NEXT_ROW,
  };

  /**
   * Last chance to veto row based on previous filterValue() calls. The
filter
   * needs to retain state then return a particular value for this call if
they
   * wish to exclude a row if a certain column is missing (for example).
   *
   * @return true to exclude row, false to include row.
   */
  public boolean filterRow();

}


On Tue, May 26, 2009 at 11:39 PM, Jonathan Gray <jlist@streamy.com> wrote:

> This sounds like a good initial approach for a new filter interface.
>
> +1 on moving forward with what you propose, allowing for modifications as
> we reimplement and integrate.
>
> Good stuff, Ryan!
>
> JG
>
> On Tue, May 26, 2009 11:28 pm, Ryan Rawson wrote:
> > Hi all,
> >
> >
> > With HBASE-1304, it's time to normalize and review our filter API.
> >
> >
> > Here are a few givens:
> > - all calls must be byte[] offset,int offset, int length
> > - maybe we can have calls for KeyValue (which encodes all parts of the
> key
> > &
> > value as per the name) - we'd like to get rid of the calls:
> > --   boolean filterRow(final SortedMap<byte [], Cell> columns);
> > --   boolean filterRow(final List<KeyValue> results);
> > These calls are expensive, and there is no reason to have them.
> >
> >
> > Here is a proposal, imagine a filter will see this sequence of calls:
> > - reset()
> > - filterRowKey(byte[],int,int) - true to include row, false to skip to
> > next row - filterKeyValue(KeyValue) - true to include key/value, false to
> > skip -- can choose to filter on family, qualifier, value, anything
> really.
> >  - filterRow() - true to include entire row, false to post-hoc veto row
> >
> >
> > In this case one could implement the "filterIfColumnMissing" feature of
> > ColumnValueFilter by carrying state and returning false from filterRow()
> > to veto the row based on the columns/values we didn't see.
> >
> > In any of these cases, all these functions will be called quite
> > frequently, so efficiency of the code is paramount.  It's probable that
> > filterRowKey() will be 'cached' by the calling code, but filterKeyValue()
> > is called for nearly every single value we would normally return (ie:
> it's
> > applied _AFTER_ column matching and version and timestamp and delete
> > tracking).
> >
> > The goal is to:
> > (a) make the implementation easy and performant
> > (b) make the API normative and easy to code for
> > (c) make everything work
> >
> >
> > Thoughts?
> > -ryan
> >
> >
>
>



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message