hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: Simple stastics per region
Date Wed, 27 Feb 2013 00:27:13 GMT
Just had a discussion with the Phoenix folks (my cubicle neighbors :) ).
Turns out that the types of problem we're trying to solve for Phoenix would need equal-depth
histograms, whereas for decisions such as picking a 2ndary index equal-width histograms are
often used.
So a key in this is a proper framework through, which, stats can hooked up and calculated.
OSGi for coprocessors would be nice, but may also be overkill for this.
Maybe something like the chores framework would work.

In either case, there will be core stats (that would allow HBase to decide between a scan
and a multi get), and user defined stats to help higher layers such as Phoenix, or an indexing
library.


-- Lars



________________________________
 From: Enis Söztutar <enis.soz@gmail.com>
To: "dev@hbase.apache.org" <dev@hbase.apache.org> 
Sent: Tuesday, February 26, 2013 4:15 PM
Subject: Re: Simple stastics per region
 
+1 for core. I can see that histograms might help us in automatic splits
and merges as well.


On Tue, Feb 26, 2013 at 3:27 PM, Andrew Purtell <apurtell@apache.org> wrote:

> If this is going to be a CP then other CPs need an easy way to use the
> output stats. If a subsequent proposal from core requires statistics from
> this CP does that then mandate it itself must be a CP? What if that can't
> work?
>
> Putting the stats into a table addresses the first concern.
>
> For the second, it is an issue that comes up I think when building a
> generally useful shared function as a CP. Please consider inserting my
> earlier comments about OSGi here, in that we trend toward a real module
> system if we're not careful (unless that is the aim).
>
>
> On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <jesse.k.yates@gmail.com
> >wrote:
>
> > TL;DR Making it part of the UI and ensuring that you don't load things
> the
> > wrong way seem to be the only reasons for making this part of core -
> > certainly not bad reasons. They are fairly easy to handle as a CP though,
> > so maybe its not necessary immediately.
> >
> > I ended up writing a simple stats framework last week (ok, its like 6
> > classes) that makes it easy to create your own stats for a table. Its all
> > coprocessor based, and as Lars suggested, hooks up to the major
> compactions
> > to let you build per-column-per-region stats and writes it to a 'system'
> > table = "_stats_".
> >
> > With the framework you could easily write your own custom stats, from
> > simple things like min/max keys to things like fixed width or fixed depth
> > histograms, or even more complicated. There has been some internal
> > discussion around how to make this available to the community (as part of
> > Phoenix, core in HBase, an independent github project, ...?).
> >
> > The biggest isssue around having it all CP based is that you need to be
> > really careful to ensure that it comes _after_ all the other compaction
> > coprocessors. This way you know exactly what keys come out and have
> correct
> > statistics (for that point in time). Not a huge issue - you just need to
> be
> > careful. Baking the stats framework into HBase is really nice in that we
> > can be sure we never mess this up.
> >
> > Building it into the core of HBase isn't going to get us per-region
> > statistics without a whole bunch of pain - compactions per store make
> this
> > a pain to actualize; there isn't a real advantage here, as I'd like to
> keep
> > it per CF, if only not to change all the things.
> >
> > Further, this would be a great first use-case for real system tables.
> > Mixing this data with .META. is going to be a bit of a mess, especially
> for
> > doing clean scans, etc. to read the stats. Also, I'd be gravely concerned
> > to muck with such important state, especially if we make a 'statistic' a
> > pluggable element (so people can easily expand their own).
> >
> > And sure, we could make it make pretty graphs on the UI, no harm in it
> and
> > very little overhead :)
> >
> > -------------------
> > Jesse Yates
> > @jesse_yates
> > jyates.github.com
> >
> >
> > On Tue, Feb 26, 2013 at 2:08 PM, Stack <stack@duboce.net> wrote:
> >
> > > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <larsh@apache.org>
> > wrote:
> > >
> > > > This topic comes up now and then (see recent discussion about
> > translating
> > > > multi Gets into Scan+Filter).
> > > >
> > > > It's not that hard to keep statistics as part of compactions.
> > > > I envision two knobs:
> > > > 1. Max number of distinct values to track directly. If a column has
> > less
> > > > this # of values, keep track of their occurrences explicitly.
> > > > 2. Number of (equal width) histogram partitions to maintain.
> > > >
> > > > Statistics would be kept per store (i.e. per region per column
> family)
> > > and
> > > > stored into an HBase table (one row per store).Initially we could
> just
> > > > support major compactions that atomically insert a new version of
> that
> > > > statistics for the store.
> > > >
> > > >
> > > Sounds great.
> > >
> > > In .META. add columns for each each cf on each region row?  Or another
> > > table?
> > >
> > > What kind of stats would you keep?  Would they be useful for operators?
> >  Or
> > > just for stuff like say Phoenix making decisions?
> > >
> > >
> > >
> > > > An simple implementation (not knowing ahead of time how many values
> it
> > > > will see during the compaction) could start by keeping track of
> > > individual
> > > > values for columns. If it gets past the max # of distinct values to
> > > track,
> > > > start with equal width histograms (using the distinct values picket
> up
> > so
> > > > far to estimate an initial partition width).
> > > > If the number of partition gets larger than what was configured it
> > would
> > > > increase the width and merge the previous counts into the new width
> > > (which
> > > > means the new partition width must be a multiple of the previous
> size).
> > > > There's probably a lot of other fanciness that could be used here
> > > (haven't
> > > > spent a lot of time thinking about details).
> > > >
> > > >
> > > > Is this something that should be in core HBase or rather be
> implemented
> > > as
> > > > coprocessor?
> > > >
> > >
> > >
> > > I think it could go in core if it generated pretty pictures.
> > >
> > > St.Ack
> > >
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message