hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Doubt in HBase
Date Fri, 21 Aug 2009 06:37:22 GMT
Well the inputs to those reducers would be the empty set, they
wouldn't have anything to do and their output would also be nil as
well.

If you are doing something like this, and your operation is
communitive, consider using a combiner so that you don't shuffle as
much data. A large amount of shuffled data can make map-reduces
slower. While map-reduce is a sorter, shuffling 1500gb just takes a
little while you know?

you can also set the # of reducers as well. but the mapping of reduce
keys to reducer instances is random/hashed iirc.  The normative case
however is to a large number of reduce keys, rather than only a small
amount.

Generally speaking, use the combiner functionality. It keeps the data
sizes low.  High reduce counts is better for when you have to shuffle
a lot of data with many distinct reduce keys.

This is getting pretty OT, I suggest revisiting the map-reduce paper
and the hadoop docs.

-ryan

On Thu, Aug 20, 2009 at 9:24 PM, john smith<js1987.smith@gmail.com> wrote:
> Thanks for all your replies guys ,.As bharath said , what is the case when
> number of reducers becomes more than number of distinct Map key outputs?
>
> On Fri, Aug 21, 2009 at 9:39 AM, bharath vissapragada <
> bharathvissapragada1990@gmail.com> wrote:
>
>> Aamandeep , Gray and Purtell thanks for your replies .. I have found them
>> very useful.
>>
>> You said to increase the number of reduce tasks . Suppose the number of
>> reduce tasks is more than number of distinct map output keys , some of the
>> reduce processes may go waste ? is that the case?
>>
>> Also  I have one more doubt ..I have 5 values for a corresponding key on
>> one
>> region  and other 2 values on 2 different region servers.
>> Does hadoop Map reduce take care of moving these 2 diff values to the
>> region
>> with 5 values instead of moving those 5 values to other system to minimize
>> the dataflow? Is this what is happening inside ?
>>
>> On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell <apurtell@apache.org>
>> wrote:
>>
>> > The behavior of TableInputFormat is to schedule one mapper for every
>> table
>> > region.
>> >
>> > In addition to what others have said already, if your reducer is doing
>> > little more than storing data back into HBase (via TableOutputFormat),
>> then
>> > you can consider writing results back to HBase directly from the mapper
>> to
>> > avoid incurring the overhead of sort/shuffle/merge which happens within
>> the
>> > Hadoop job framework as map outputs are input into reducers. For that
>> type
>> > of use case -- using the Hadoop mapreduce subsystem as essentially a grid
>> > scheduler -- something like job.setNumReducers(0) will do the trick.
>> >
>> > Best regards,
>> >
>> >   - Andy
>> >
>> >
>> >
>> >
>> > ________________________________
>> > From: john smith <js1987.smith@gmail.com>
>> > To: hbase-user@hadoop.apache.org
>> > Sent: Friday, August 21, 2009 12:42:36 AM
>> > Subject: Doubt in HBase
>> >
>> > Hi all ,
>> >
>> > I have one small doubt . Kindly answer it even if it sounds silly.
>> >
>> > Iam using Map Reduce in HBase in distributed mode .  I have a table which
>> > spans across 5 region servers . I am using TableInputFormat to read the
>> > data
>> > from the tables in the map . When i run the program , by default how many
>> > map regions are created ? Is it one per region server or more ?
>> >
>> > Also after the map task is over.. reduce task is taking a bit more time .
>> > Is
>> > it due to moving the map output across the regionservers? i.e, moving the
>> > values of same key to a particular reduce phase to start the reducer? Is
>> > there any way i can optimize the code (e.g. by storing data of same
>> reducer
>> > nearby )
>> >
>> > Thanks :)
>> >
>> >
>> >
>> >
>>
>

Mime
View raw message