hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Bengtson" <e...@jpox.org>
Subject Re: Filter use cases
Date Fri, 29 May 2009 05:22:25 GMT

--   BlackBerry® from Mobistar    ---

-----Original Message-----
From: Clint Morgan <clint.morgan@troove.net>

Date: Thu, 28 May 2009 13:45:57 
To: <hbase-dev@hadoop.apache.org>
Subject: Re: Filter use cases


Looks good to me. +1


On Thu, May 28, 2009 at 12:26 AM, Ryan Rawson <ryanobjc@gmail.com> wrote:

> Thanks all,
>
> The old RowFilterInterface will _sort of_ work.  The new code will call
> filterRowKey(byte[],int,int) and filterAllRemaining().  I unit tested the
> RowInclusiveStop, and Prefix filters along with the WhileMatchRowFilter to
> wrap them.  Tests pass.
>
> More complex filters such as ColumnMatchFilter won't work, and need to be
> ported to the new API before 0.20 (maybe tomorrow, eh?).   Stop filter is
> not necessary as a stop-row is built into the Scan specification now.
>
> We might need to wrap some of the client-API to take existing use cases and
> translate them into the new code.  Eg: detect a stop-row-filter and use the
> Scan(start,end) new-API instead, etc, etc.
>
>
>
> On Wed, May 27, 2009 at 4:35 PM, Andrew Purtell <apurtell@apache.org>
> wrote:
>
> > +1 on this API. Looks good.
> >
> >
> >
> >
> > ________________________________
> > From: Ryan Rawson <ryanobjc@gmail.com>
> > To: hbase-dev@hadoop.apache.org; jlist@streamy.com
> > Sent: Wednesday, May 27, 2009 12:06:31 AM
> > Subject: Re: Filter use cases
> >
> > Here is a suggested API.  I included the call flow in the interface docs
> as
> > well.
> >
> > I dropped rowProcessed() since only PageRowFilter used it, and it can get
> > the data elseway.  I also dropped processAlways() as well.  This seems
> like
> > internal workings to RowFilterSet, and should ideally be maintained
> there.
> >
> > This row filter interface supports 1 feature we can't right now:
> > - filter upto N columns, skip the rest.
> >
> > Right now we can do that, but not efficiently.
> >
> > Remember, as we write filters, columns are seen in sorted order.  To be
> > efficient, at all steps we need to take advantage of the sorted order of
> > things.
> >
> > /**
> > * Interface for row and column filters directly applied within the
> > regionserver.
> > * A filter can expect the following call sequence:
> > *
> > * - reset();
> > * - filterAllRemaining() -> true indicates scan is over, false, keep
> going
> > on.
> > * - filterRowKey(byte[],int,int); -> true to drop this row
> > * if false, we will also call:
> > * - filterValue(KeyValue); -> true to drop this key/value
> > * - filterRow(); -> last chance to drop entire row based on the sequence
> of
> > * filterValue() calls. Eg: filter a row if it doesn't contain a specified
> > column.
> > *
> > * Filter instances are created one per region/scan.
> > */
> > public interface NewRowFilterInterface extends Writable {
> >  /**
> >   * Reset the state of the filter between rows.
> >   */
> >  public void reset();
> >
> >  /**
> >   * Filters a row based on the row key. If this returns true, the entire
> >   * row will be excluded.  If false, each KeyValue in the row will be
> >   * passed to filterValue() below.
> >   *
> >   * @param buffer buffer containing row key
> >   * @param offset offset into buffer where row key starts
> >   * @param length length of the row key
> >   * @return true, remove entire row, false, include the row (maybe).
> >   */
> >  public boolean filterRowKey(byte [] buffer, int offset, int length);
> >
> >  /**
> >   * If this returns true, the scan will terminate.
> >   *
> >   * @return true to end scan, false to continue.
> >   */
> >  public boolean filterAllRemaining();
> >
> >  /**
> >   * A way to filter based on the column family, column qualifier and/or
> the
> >   * column value. Return code is described below.  This allows filters to
> >   * filter only certain number of columns, then terminate without
> matching
> > ever
> >   * column.
> >   *
> >   * @param v the KeyValue in question
> >   * @return code as described below
> >   */
> >  public ReturnCode filterValue(KeyValue v);
> >
> >  /**
> >   * Return codes for filterValue().
> >   */
> >  public enum ReturnCode {
> >    /**
> >     * Include the KeyValue
> >     */
> >    INCLUDE,
> >    /**
> >     * Skip this KeyValue
> >     */
> >    SKIP,
> >    /**
> >     * Done with columns, skip to next row. Note that filterRow() will
> >     * still be called.
> >     */
> >    NEXT_ROW,
> >  };
> >
> >  /**
> >   * Last chance to veto row based on previous filterValue() calls. The
> > filter
> >   * needs to retain state then return a particular value for this call if
> > they
> >   * wish to exclude a row if a certain column is missing (for example).
> >   *
> >   * @return true to exclude row, false to include row.
> >   */
> >  public boolean filterRow();
> >
> > }
> >
> >
> > On Tue, May 26, 2009 at 11:39 PM, Jonathan Gray <jlist@streamy.com>
> wrote:
> >
> > > This sounds like a good initial approach for a new filter interface.
> > >
> > > +1 on moving forward with what you propose, allowing for modifications
> as
> > > we reimplement and integrate.
> > >
> > > Good stuff, Ryan!
> > >
> > > JG
> > >
> > > On Tue, May 26, 2009 11:28 pm, Ryan Rawson wrote:
> > > > Hi all,
> > > >
> > > >
> > > > With HBASE-1304, it's time to normalize and review our filter API.
> > > >
> > > >
> > > > Here are a few givens:
> > > > - all calls must be byte[] offset,int offset, int length
> > > > - maybe we can have calls for KeyValue (which encodes all parts of
> the
> > > key
> > > > &
> > > > value as per the name) - we'd like to get rid of the calls:
> > > > --   boolean filterRow(final SortedMap<byte [], Cell> columns);
> > > > --   boolean filterRow(final List<KeyValue> results);
> > > > These calls are expensive, and there is no reason to have them.
> > > >
> > > >
> > > > Here is a proposal, imagine a filter will see this sequence of calls:
> > > > - reset()
> > > > - filterRowKey(byte[],int,int) - true to include row, false to skip
> to
> > > > next row - filterKeyValue(KeyValue) - true to include key/value,
> false
> > to
> > > > skip -- can choose to filter on family, qualifier, value, anything
> > > really.
> > > >  - filterRow() - true to include entire row, false to post-hoc veto
> row
> > > >
> > > >
> > > > In this case one could implement the "filterIfColumnMissing" feature
> of
> > > > ColumnValueFilter by carrying state and returning false from
> > filterRow()
> > > > to veto the row based on the columns/values we didn't see.
> > > >
> > > > In any of these cases, all these functions will be called quite
> > > > frequently, so efficiency of the code is paramount.  It's probable
> that
> > > > filterRowKey() will be 'cached' by the calling code, but
> > filterKeyValue()
> > > > is called for nearly every single value we would normally return (ie:
> > > it's
> > > > applied _AFTER_ column matching and version and timestamp and delete
> > > > tracking).
> > > >
> > > > The goal is to:
> > > > (a) make the implementation easy and performant
> > > > (b) make the API normative and easy to code for
> > > > (c) make everything work
> > > >
> > > >
> > > > Thoughts?
> > > > -ryan
> > > >
> > > >
> > >
> > >
> >
> >
> >
> >
> >
>

Mime
View raw message