hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Lam <chiling...@gmail.com>
Subject Re: setTimeRange and setMaxVersions seem to be inefficient
Date Tue, 28 Aug 2012 20:52:33 GMT
Hi Lars:

I see. Please refer to the inline comment below.

Best Regards,

Jerry

On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <lhofhansl@yahoo.com> wrote:

> What I was saying was: It depends. :)
>
> First off, how do you get to 1000 versions? In 0.94++ older version are
> pruned upon flush, so you need 333 flushes (assuming 3 versions on the CF)
> to get 1000 versions.
>

I forgot that the default number of version to keep is 3. If this is what
people use most of the time, yes you are right for this type of scenarios
where the number of version per column to keep is small.

By that time some compactions will have happened and you're back to close
> to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you
> have).
>
> Now, if you have that many version because because you set VERSIONS=>1000
> in your CF... Then imagine you have 100 columns with 1000 versions each.
>

Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
versioning myself)

In your scenario below you'd do 100000 comparisons if the filter would be
> evaluated after the version counting. But only 1100 with the current code.
> (or at least in that ball park)
>

This is where I don't quite understand what you mean.

if the framework counts the number of ReturnCode.INCLUDE and then stops
feeding the KeyValue into the filterKeyValue method after it reaches the
count specified in setMaxVersions (i.e. 1 for the case we discussed),
should then be just 100 comparisons only (at most) instead of 1100
comparisons? Maybe I don't understand how the current way is doing...



>
> The gist is: One can construct scenarios where one approach is better than
> the other. Only one order is possible.
> If you write a custom filter and you care about these things you should
> use the seek hints.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Jerry Lam <chilinglam@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
> Cc:
> Sent: Tuesday, August 28, 2012 7:17 AM
> Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
>
> Hi Lars:
>
> Thanks for the reply.
> I need to understand if I misunderstood the perceived inefficiency because
> it seems you don't think quite the same.
>
> Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a
> table and each column has 1000 versions. Using the following code (the code
> might have errors and don't compile):
> /**
> * This is very simple use case of a ColumnPrefixFilter.
> * In fact all other filters that make use of filterKeyValue will see
> similar
> * performance problems that I have concerned with when the number of
> * versions per column could be huge.
>
> Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
> Scan scan = new Scan();
> scan.setFilter(filter);
> ResultScanner scanner = table.getScanner(scan);
> for (Result result : scanner) {
>     for (KeyValue kv : result.raw()) {
>         System.out.println("KV: " + kv + ", Value: " +
>         Bytes.toString(kv.getValue()));
>     }
> }
> scanner.close();
> */
>
> Implicitly, the number of version per column that is going to return is 1
> (the latest version). User might expect that only 2 comparisons for column
> prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes
> the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and
> 1000 for col-2) for col-2 (1 per version) because all versions of the
> column have the same prefix for obvious reason. For col-1, it will skip
> using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1.
>
> In summary, the 1000 comparisons (5000 byte comparisons) for the column
> prefix "col-2" is wasted because only 1 version is returned to user. Also,
> I believe this inefficiency is hidden from the user code but it affects all
> filters that use filterKeyValue as the main execution for filtering KVs. Do
> we have a case to improve HBase to handle this inefficiency? :) It seems
> valid unless you prove otherwise.
>
> Best Regards,
>
> Jerry
>
>
>
> On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lhofhansl@yahoo.com>
> wrote:
>
> > First off regarding "inefficiency"... If version counting would happen
> > first and then filter were executed we'd have folks "complaining" about
> > inefficiencies as well:
> > ("Why does the code have to go through the versioning stuff when my
> filter
> > filters the row/column/version anyway?")  ;-)
> >
> >
> > For your problem, you want to make use of "seek hints"...
> >
> > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
> >
> > That way the scanning framework will know to skip ahead to the next
> > column, row, or a KV of your choosing. (see Filter.filterKeyValue and
> > Filter.getNextKeyHint).
> >
> > (as an aside, it would probably be nice if Filters also had
> > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by
> StoreScanner)
> >
> > Have a look at ColumnPrefixFilter as an example.
> > I also wrote a short post here:
> >
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
> >
> > Does that help?
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Jerry Lam <chilinglam@gmail.com>
> > To: "user@hbase.apache.org" <user@hbase.apache.org>
> > Cc: "user@hbase.apache.org" <user@hbase.apache.org>
> > Sent: Monday, August 27, 2012 5:59 PM
> > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> >
> > Hi Lars:
> >
> > Thanks for confirming the inefficiency of the implementation for this
> > case. For my case, a column can have more than 10K versions, I need a
> quick
> > way to stop the scan from digging the column once there is a match
> > (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can
> notify
> > the framework to stop and go to next column once the number of versions
> > specify in setMaxVersions is met.
> >
> > For now, I guess I have to hack it in the custom filter (I.e. I keep the
> > count myself)? If you have a better way to achieve this, please share :)
> >
> > Best Regards,
> >
> > Jerry
> >
> > Sent from my iPad (sorry for spelling mistakes)
> >
> > On 2012-08-27, at 20:11, lars hofhansl <lhofhansl@yahoo.com> wrote:
> >
> > > Currently filters are evaluated before we do version counting.
> > >
> > > Here's a comment from ScanQueryMatcher.java:
> > >     /**
> > >      * Filters should be checked before checking column trackers. If we
> > do
> > >      * otherwise, as was previously being done, ColumnTracker may
> > increment its
> > >      * counter for even that KV which may be discarded later on by
> > Filter. This
> > >      * would lead to incorrect results in certain cases.
> > >      */
> > >
> > >
> > > So this is by design. (Doesn't mean it's correct or desirable, though.)
> > >
> > > -- Lars
> > >
> > >
> > > ----- Original Message -----
> > > From: Jerry Lam <chilinglam@gmail.com>
> > > To: user <user@hbase.apache.org>
> > > Cc:
> > > Sent: Monday, August 27, 2012 2:40 PM
> > > Subject: setTimeRange and setMaxVersions seem to be inefficient
> > >
> > > Hi HBase community:
> > >
> > > I tried to use setTimeRange and setMaxVersions to limit the number of
> KVs
> > > return per column. The behaviour is as I would expect that is
> > > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version
> of
> > KV
> > > with timestamp that is less than or equal to T.
> > > However, I noticed that all versions of the KeyValue for a particular
> > > column are processed through a custom filter I implemented even though
> I
> > > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if
> > ONE
> > > KV of a particular column has ReturnCode.INCLUDE, the framework will
> jump
> > > to the next COL instead of iterating through all versions of the
> column.
> > >
> > > Can someone confirm me if this is the expected behaviour (iterating
> > through
> > > all versions of a column before setMaxVersions take effect)? If this is
> > an
> > > expected behaviour, what is your recommendation to speed this up?
> > >
> > > Best Regards,
> > >
> > > Jerry
> > >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message