hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Birchfield <jbirchfi...@stumbleupon.com>
Subject Re: HBase Table Row Count Optimization - A Solicitation For Help
Date Sat, 21 Sep 2013 01:34:41 GMT
Excellent!  Will do!

Birchj
On Sep 20, 2013, at 6:32 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Please take a look at the javadoc
> for src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java
> 
> As long as the machine can reach your HBase cluster, you should be able to
> run AggregationClient and utilize the AggregateImplementation endpoint in
> the region servers.
> 
> Cheers
> 
> 
> On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <
> jbirchfield@stumbleupon.com> wrote:
> 
>> Thanks Ted.
>> 
>> That was the direction I have been working towards as I am learning today.
>> Much appreciation to all the replies to this thread.
>> 
>> Whether I keep the MapReduce job or utilize the Aggregation coprocessor
>> (which is turning out that it should be possible for me here), I need to
>> make sure I am running the client in an efficient manner.  Lars may have
>> hit upon the core problem.  I am not running the map reduce job on the
>> cluster, but rather from a stand alone remote java client executing the job
>> in process.  This may very well turn out to be the number one issue.  I
>> would love it if this turns out to be true.  Would make this a great
>> learning lesson for me as a relative newcomer to working with HBase, and
>> potentially allow me to finish this initial task much quicker than I was
>> thinking.
>> 
>> So assuming the MapReduce jobs need to be run on the cluster instead of
>> locally, does a coprocessor endpoint client need to be run the same, or is
>> it safe to run it on a remote machine since the work gets distributed out
>> to the region servers?  Just wondering if I would run into the same issues
>> if what I said above holds true.
>> 
>> Thanks!
>> Birch
>> On Sep 20, 2013, at 6:17 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> 
>>> In 0.94, we have AggregateImplementation, an endpoint coprocessor, which
>>> implements getRowNum().
>>> 
>>> Example is in AggregationClient.java
>>> 
>>> Cheers
>>> 
>>> 
>>> On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <larsh@apache.org> wrote:
>>> 
>>>> From your numbers below you have about 26k regions, thus each region is
>>>> about 545tb/26k = 20gb. Good.
>>>> 
>>>> How many mappers are you running?
>>>> And just to rule out the obvious, the M/R is running on the cluster and
>>>> not locally, right? (it will default to a local runner when it cannot
>> use
>>>> the M/R cluster).
>>>> 
>>>> Some back of the envelope calculations tell me that assuming 1ge network
>>>> cards, the best you can expect for 110 machines to map through this
>> data is
>>>> about 10h. (so way faster than what you see).
>>>> (545tb/(110*1/8gb/s) ~ 40ks ~11h)
>>>> 
>>>> 
>>>> We should really add a rowcounting coprocessor to HBase and allow using
>> it
>>>> via M/R.
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: James Birchfield <jbirchfield@stumbleupon.com>
>>>> To: user@hbase.apache.org
>>>> Sent: Friday, September 20, 2013 5:09 PM
>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For
>> Help
>>>> 
>>>> 
>>>> I did not implement accurate timing, but the current table being counted
>>>> has been running for about 10 hours, and the log is estimating the map
>>>> portion at 10%
>>>> 
>>>> 2013-09-20 23:40:24,099 INFO  [main] Job                            :
>> map
>>>> 10% reduce 0%
>>>> 
>>>> So a loooong time.  Like I mentioned, we have billions, if not trillions
>>>> of rows potentially.
>>>> 
>>>> Thanks for the feedback on the approaches I mentioned.  I was not sure
>> if
>>>> they would have any effect overall.
>>>> 
>>>> I will look further into coprocessors.
>>>> 
>>>> Thanks!
>>>> Birch
>>>> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <vrodionov@carrieriq.com
>>> 
>>>> wrote:
>>>> 
>>>>> How long does it take for RowCounter Job for largest table to finish
on
>>>> your cluster?
>>>>> 
>>>>> Just curious.
>>>>> 
>>>>> On your options:
>>>>> 
>>>>> 1. Not worth it probably - you may overload your cluster
>>>>> 2. Not sure this one differs from 1. Looks the same to me but more
>>>> complex.
>>>>> 3. The same as 1 and 2
>>>>> 
>>>>> Counting rows in efficient way can be done if you sacrifice some
>>>> accuracy :
>>>>> 
>>>>> 
>>>> 
>> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
>>>>> 
>>>>> Yeah, you will need coprocessors for that.
>>>>> 
>>>>> Best regards,
>>>>> Vladimir Rodionov
>>>>> Principal Platform Engineer
>>>>> Carrier IQ, www.carrieriq.com
>>>>> e-mail: vrodionov@carrieriq.com
>>>>> 
>>>>> ________________________________________
>>>>> From: James Birchfield [jbirchfield@stumbleupon.com]
>>>>> Sent: Friday, September 20, 2013 3:50 PM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For
>> Help
>>>>> 
>>>>> Hadoop 2.0.0-cdh4.3.1
>>>>> 
>>>>> HBase 0.94.6-cdh4.3.1
>>>>> 
>>>>> 110 servers, 0 dead, 238.2364 average load
>>>>> 
>>>>> Some other info, not sure if it helps or not.
>>>>> 
>>>>> Configured Capacity: 1295277834158080 (1.15 PB)
>>>>> Present Capacity: 1224692609430678 (1.09 PB)
>>>>> DFS Remaining: 624376503857152 (567.87 TB)
>>>>> DFS Used: 600316105573526 (545.98 TB)
>>>>> DFS Used%: 49.02%
>>>>> Under replicated blocks: 0
>>>>> Blocks with corrupt replicas: 1
>>>>> Missing blocks: 0
>>>>> 
>>>>> It is hitting a production cluster, but I am not really sure how to
>>>> calculate the load placed on the cluster.
>>>>> On Sep 20, 2013, at 3:19 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>> 
>>>>>> How many nodes do you have in your cluster ?
>>>>>> 
>>>>>> When counting rows, what other load would be placed on the cluster
?
>>>>>> 
>>>>>> What is the HBase version you're currently using / planning to use
?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> 
>>>>>> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <
>>>>>> jbirchfield@stumbleupon.com> wrote:
>>>>>> 
>>>>>>>     After reading the documentation and scouring the mailing
list
>>>>>>> archives, I understand there is no real support for fast row
counting
>>>> in
>>>>>>> HBase unless you build some sort of tracking logic into your
code.
>> In
>>>> our
>>>>>>> case, we do not have such logic, and have massive amounts of
data
>>>> already
>>>>>>> persisted.  I am running into the issue of very long execution
of the
>>>>>>> RowCounter MapReduce job against very large tables (multi-billion
for
>>>> many
>>>>>>> is our estimate).  I understand why this issue exists and am
slowly
>>>>>>> accepting it, but I am hoping I can solicit some possible ideas
to
>> help
>>>>>>> speed things up a little.
>>>>>>> 
>>>>>>>     My current task is to provide total row counts on about 600
>>>>>>> tables, some extremely large, some not so much.  Currently, I
have a
>>>>>>> process that executes the MapRduce job in process like so:
>>>>>>> 
>>>>>>>                     Job job = RowCounter.createSubmittableJob(
>>>>>>> 
>> ConfigManager.getConfiguration(),
>>>>>>> new String[]{tableName});
>>>>>>>                     boolean waitForCompletion =
>>>>>>> job.waitForCompletion(true);
>>>>>>>                     Counters counters = job.getCounters();
>>>>>>>                     Counter rowCounter =
>>>>>>> counters.findCounter(hbaseadminconnection.Counters.ROWS);
>>>>>>>                     return rowCounter.getValue();
>>>>>>> 
>>>>>>>     At the moment, each MapReduce job is executed in serial order,
>> so
>>>>>>> counting one table at a time.  For the current implementation
of this
>>>> whole
>>>>>>> process, as it stands right now, my rough timing calculations
>> indicate
>>>> that
>>>>>>> fully counting all the rows of these 600 tables will take anywhere
>>>> between
>>>>>>> 11 to 22 days.  This is not what I consider a desirable timeframe.
>>>>>>> 
>>>>>>>     I have considered three alternative approaches to speed things
>>>> up.
>>>>>>> 
>>>>>>>     First, since the application is not heavily CPU bound, I
could
>>>> use
>>>>>>> a ThreadPool and execute multiple MapReduce jobs at the same
time
>>>> looking
>>>>>>> at different tables.  I have never done this, so I am unsure
if this
>>>> would
>>>>>>> cause any unanticipated side effects.
>>>>>>> 
>>>>>>>     Second, I could distribute the processes.  I could find as
many
>>>>>>> machines that can successfully talk to the desired cluster properly,
>>>> give
>>>>>>> them a subset of tables to work on, and then combine the results
post
>>>>>>> process.
>>>>>>> 
>>>>>>>     Third, I could combine both the above approaches and run
a
>>>>>>> distributed set of multithreaded process to execute the MapReduce
>> jobs
>>>> in
>>>>>>> parallel.
>>>>>>> 
>>>>>>>     Although it seems to have been asked and answered many times,
I
>>>>>>> will ask once again.  Without the need to change our current
>>>> configurations
>>>>>>> or restart the clusters, is there a faster approach to obtain
row
>>>> counts?
>>>>>>> FYI, my cache size for the Scan is set to 1000.  I have experimented
>>>> with
>>>>>>> different numbers, but nothing made a noticeable difference.
 Any
>>>> advice or
>>>>>>> feedback would be greatly appreciated!
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Birch
>>>>> 
>>>>> 
>>>>> Confidentiality Notice:  The information contained in this message,
>>>> including any attachments hereto, may be confidential and is intended
>> to be
>>>> read only by the individual or entity to whom this message is
>> addressed. If
>>>> the reader of this message is not the intended recipient or an agent or
>>>> designee of the intended recipient, please note that any review, use,
>>>> disclosure or distribution of this message or its attachments, in any
>> form,
>>>> is strictly prohibited.  If you have received this message in error,
>> please
>>>> immediately notify the sender and/or Notifications@carrieriq.com and
>>>> delete or destroy any copy of this message and its attachments.
>>>> 
>> 
>> 


Mime
View raw message