hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: count of rows in table
Date Fri, 30 Jul 2010 03:13:28 GMT
RowCounter job counts rows. Its answer will be how many distinct row keys
were in the table approximately at a given time range.

Even if the implementation uses first kv filter nothing about what I just
said is false.

A KeyValue counter would tell you how many cells and versions there were
total don't you think?

On Jul 29, 2010 7:58 PM, "Angus He" <angushe@gmail.com> wrote:
> Column names are just optional for RowCounter job.
>
> To be more accurate, RowCounter is a KeyValueCounter.
> If no columns are specified, only the first KeyValues of each row are
> included, then get the RowCounter.
>
>
> On Fri, Jul 30, 2010 at 9:28 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>> If someone can share the commandline for running RowCounter, that would
be
>> great.
>>
>> Also, hbase shell count doesn't require column name. Why does RowCounter
>> require it ?
>>
>> Thanks
>>
>> On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> That table appears to be empty.  Eg:
>>>
>>> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>>>
>>>
>>> So back to the count issue... Counting in databases is a classic
>>> problem. Unless your DB system is keeping stats on how many
>>> inserts/deletes and thus how big it thinks the table is, you have to
>>> count all the rows by reading them.  HBase is no different, and a
>>> little harder, because we have a variable length data format, so we
>>> can't just estimate row sizes from file sizes.  Keeping distributed
>>> stats is not impossible, but certainly not on any priority list to be
>>> implemented - of course JIRAs/patches welcome etc.
>>>
>>> -ryan
>>>
>>>
>>> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>> > We use HBase 0.20.5
>>> >
>>> > Here is the snippet from RowCounter output:
>>> >
>>> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
>>> scanning
>>> > at REGION => {NAME =>
>>> >
>>>
'2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
>>> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '', ENCODED
=>
>>> > 1375318608, TABLE => {{NAME =>
>>> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
>>> FAMILIES
>>> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL
=>
>>> '31536000',
>>> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
{NAME
>>> =>
>>> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
BLOCKSIZE
>>> =>
>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME =>
'v',
>>> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE
=>
>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
>>> Task:attempt_local_0001_m_000000_0
>>> > is done. And is in the process of commiting
>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>>> attempt_local_0001_m_000000_0
>>> > is allowed to commit now
>>> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of
task
>>> > 'attempt_local_0001_m_000000_0' to
>>> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>>> > 'attempt_local_0001_m_000000_0' done.
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete: job_local_0001
>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:
FILE_BYTES_WRITTEN=1624956
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
>>> >
>>> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>>> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>>> >
>>> > But there are many records in the table I was querying.
>>> >
>>> > Can someone comment ?
>>> >
>>> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <
jdcryans@apache.org
>>> >wrote:
>>> >
>>> >> In 0.89 you can specify CACHE for the count command. Set it higher
(it
>>> >> defaults to 10 rows per call).
>>> >>
>>> >> Also you can use the RowCounter MR job.
>>> >>
>>> >> J-D
>>> >>
>>> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yuzhihong@gmail.com>
wrote:
>>> >> > Hi,
>>> >> > The count method in HBase shell is quite slow.
>>> >> > Is there a way to obtain count faster ?
>>> >> >
>>> >> > Thanks
>>> >> >
>>> >>
>>> >
>>>
>>
>
>
>
> --
> Regards
> Angus

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message