hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Angus He <angu...@gmail.com>
Subject Re: count of rows in table
Date Fri, 30 Jul 2010 03:54:35 GMT
By the way

If users input multiple columns, it seems that the current
implementation of RowCounter employs the OR logical operation.

Is the AND more reasonable?



On Fri, Jul 30, 2010 at 11:13 AM, Ryan Rawson <ryanobjc@gmail.com> wrote:
> RowCounter job counts rows. Its answer will be how many distinct row keys
> were in the table approximately at a given time range.
>
> Even if the implementation uses first kv filter nothing about what I just
> said is false.
>
> A KeyValue counter would tell you how many cells and versions there were
> total don't you think?
>
> On Jul 29, 2010 7:58 PM, "Angus He" <angushe@gmail.com> wrote:
>> Column names are just optional for RowCounter job.
>>
>> To be more accurate, RowCounter is a KeyValueCounter.
>> If no columns are specified, only the first KeyValues of each row are
>> included, then get the RowCounter.
>>
>>
>> On Fri, Jul 30, 2010 at 9:28 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>>> If someone can share the commandline for running RowCounter, that would
> be
>>> great.
>>>
>>> Also, hbase shell count doesn't require column name. Why does RowCounter
>>> require it ?
>>>
>>> Thanks
>>>
>>> On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> That table appears to be empty.  Eg:
>>>>
>>>> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>>>>
>>>>
>>>> So back to the count issue... Counting in databases is a classic
>>>> problem. Unless your DB system is keeping stats on how many
>>>> inserts/deletes and thus how big it thinks the table is, you have to
>>>> count all the rows by reading them.  HBase is no different, and a
>>>> little harder, because we have a variable length data format, so we
>>>> can't just estimate row sizes from file sizes.  Keeping distributed
>>>> stats is not impossible, but certainly not on any priority list to be
>>>> implemented - of course JIRAs/patches welcome etc.
>>>>
>>>> -ryan
>>>>
>>>>
>>>> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>> > We use HBase 0.20.5
>>>> >
>>>> > Here is the snippet from RowCounter output:
>>>> >
>>>> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
>>>> scanning
>>>> > at REGION => {NAME =>
>>>> >
>>>>
> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
>>>> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '',
ENCODED
> =>
>>>> > 1375318608, TABLE => {{NAME =>
>>>> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
>>>> FAMILIES
>>>> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2',
TTL =>
>>>> '31536000',
>>>> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
> {NAME
>>>> =>
>>>> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
> BLOCKSIZE
>>>> =>
>>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME =>
'v',
>>>> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE
=>
>>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
>>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
>>>> Task:attempt_local_0001_m_000000_0
>>>> > is done. And is in the process of commiting
>>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>>>> attempt_local_0001_m_000000_0
>>>> > is allowed to commit now
>>>> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of
> task
>>>> > 'attempt_local_0001_m_000000_0' to
>>>> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
>>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>>>> > 'attempt_local_0001_m_000000_0' done.
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete: job_local_0001
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:
> FILE_BYTES_WRITTEN=1624956
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
>>>> >
>>>> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
>>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>>>> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
>>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>>>> >
>>>> > But there are many records in the table I was querying.
>>>> >
>>>> > Can someone comment ?
>>>> >
>>>> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <
> jdcryans@apache.org
>>>> >wrote:
>>>> >
>>>> >> In 0.89 you can specify CACHE for the count command. Set it higher
> (it
>>>> >> defaults to 10 rows per call).
>>>> >>
>>>> >> Also you can use the RowCounter MR job.
>>>> >>
>>>> >> J-D
>>>> >>
>>>> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yuzhihong@gmail.com>
wrote:
>>>> >> > Hi,
>>>> >> > The count method in HBase shell is quite slow.
>>>> >> > Is there a way to obtain count faster ?
>>>> >> >
>>>> >> > Thanks
>>>> >> >
>>>> >>
>>>> >
>>>>
>>>
>>
>>
>>
>> --
>> Regards
>> Angus
>



-- 
Regards
Angus

Mime
View raw message