hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Buttler, David" <buttl...@llnl.gov>
Subject RE: feature request (count)
Date Mon, 06 Jun 2011 20:14:42 GMT
I will second the idea of having just a count of key-value entries.

However, I am not sure about Matt's idea of knowing the number of rows/ KV entries based on
numPuts / numDeletes.  If I have maxVersions=1, and I put 1000 KV entries with the same key,
wouldn't that change my count by a different number than if maxVersions = 1000?

Dave

-----Original Message-----
From: Matt Corgan [mailto:mcorgan@hotpads.com] 
Sent: Friday, June 03, 2011 4:12 PM
To: user@hbase.apache.org
Cc: billgraham@gmail.com
Subject: Re: feature request (count)

Storing numPuts, numDelets, and maxVersions of each block in the block index
could be useful.  If a block is all puts, no deletes, and maxVersions=1,
then you are more sure of the count.  If the block indexes indicate that no
other blocks overlap, then the count could be correct without ever hitting
the disk.

Those metrics could be useful for speeding up compactions as well.  Maybe
you can avoid uncompressing and recompressing the data block.


On Fri, Jun 3, 2011 at 3:56 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:

> Stats are a good idea, having a fuzzy count is sometimes good enough.
> Getting exact counts without actually reading the data will be very
> difficult.  Perhaps there will be future clever ideas that make this
> easier?
>
> On Fri, Jun 3, 2011 at 3:50 PM, Bill Graham <billgraham@gmail.com> wrote:
> > One alternative option is to calculate some stats during compactions and
> > store that somewhere for retrieval. The metrics wouldn't be up to date of
> > course, since they've be stats from the last compaction time. I think
> that
> > would still be useful info to have, but it's different than what's being
> > requested.
> >
> >
> > On Fri, Jun 3, 2011 at 3:40 PM, Jack Levin <magnito@gmail.com> wrote:
> >
> >> "Each HFile knows how many KV entries there are in it, but this does
> >> not map in a general way to the
> >> number of rows, or the number of rows with a specific column."
> >>
> >> It would be nice to have an index like that;  Would solve a lot of
> >> issues for people migrating from mysql.  I assume that without the
> >> 'count' feature, people are resorting to storing dataset elements in
> >> other engines, which is not great, since you then end up to require a
> >> non-hbase index to be consistent and authoritative for all of your
> >> datasets that require counts.
> >>
> >> -Jack
> >>
> >>
> >> On Fri, Jun 3, 2011 at 3:24 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
> >> > This is a commonly requested feature, and it remains unimplemented
> >> > because it is actually quite hard.  Each HFile knows how many KV
> >> > entries there are in it, but this does not map in a general way to the
> >> > number of rows, or the number of rows with a specific column. Keeping
> >> > track of the row count as new rows are created is also not as easy as
> >> > it seems - this is because a Put does not know if a row already exists
> >> > or not.  Making it aware of that fact would require doing a get before
> >> > a put - not cheap.
> >> >
> >> > -ryan
> >> >
> >> > On Fri, Jun 3, 2011 at 3:20 PM, Jack Levin <magnito@gmail.com> wrote:
> >> >> I have a feature request:  There should be a native function called
> >> >> 'count', that produces count of rows based on specific family filter,
> >> >> that is internal to HBASE and won't be required to read CELLs off the
> >> >> disk/cache.  Just count up the rows in the most efficient way
> >> >> possible.  I realize that family definitions are part of the cells,
> so
> >> >> it would be nice to have an index that somehow can produce low IO/CPU
> >> >> hit to hbase when doing a count (for example enabling an index like
> >> >> that in table schema would be how you turn it on for a specific
> >> >> family).
> >> >>
> >> >> Best,
> >> >>
> >> >> -Jack
> >> >>
> >> >
> >>
> >
>
Mime
View raw message