hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Serge Blazhievsky <Serge.Blazhiyevs...@nice.com>
Subject Re: how to fine tuning my map reduce job that is generating a lot of intermediate key-value pairs (a lot of I/O operations)
Date Wed, 04 Apr 2012 17:17:04 GMT
How many datanodes do you use fir your job?

On 4/3/12 8:11 PM, "Jane Wayne" <jane.wayne2978@gmail.com> wrote:

>i don't have the option of setting the map heap size to 2 GB since my
>real environment is AWS EMR and the constraints are set.
>
>http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html this
>link is where i am currently reading on the meaning of io.sort.factor
>and io.sort.mb.
>
>it seems io.sort.mb tunes the map tasks and io.sort.factor tunes the
>shuffle/reduce task. am i correct to say then that io.sort.factor is
>not relevant here (yet, anways)? since i don't really make it to the
>reduce phase (except for only a very small data size).
>
>in that link above, here is the description for, io.sort.mb:  The
>cumulative size of the serialization and accounting buffers storing
>records emitted from the map, in megabytes. there's a paragraph above
>the table that is value is simply the threshold that triggers a sort
>and spill to the disk. furthermore, it says, "If either buffer fills
>completely while the spill is in progress, the map thread will block,"
>which is what i believe is happening in my case.
>
>this sentence concerns me, "Minimizing the number of spills to disk
>can decrease map time, but a larger buffer also decreases the memory
>available to the mapper." to minimize the number of spills, you need a
>larger buffer; however, this statement seems to suggest to NOT
>minimize the number of spills; a) you will not decrease map time, b)
>you will not decrease the memory available to the mapper. so, in your
>advice below, you say to increase, but i may actually want to decrease
>the value for io.sort.mb. (if i understood the documentation
>correctly, ????)
>
>it seems these three map tuning parameters, io.sort.mb,
>io.sort.record.percent, and io.sort.spill.percent are a pain-point
>trading off between speed and memory. to me, if you set them high,
>more serialized data + metadata are stored in memory before a spill
>(an I/O operation is performed). you also get less merges (less I/O
>operation?), but the negatives are blocking map operations and more
>memory requirements. if you set them low, there are more frequent
>spills (more I/O operations), but less memory requirements. it just
>seems like no matter what you do, you are stuck: you may stall the
>mapper if the values are high because of the amount of time required
>to spill an enormous amount of data; you may stall the mapper if the
>values are low because of the amount of I/O operations required
>(spill/merge).
>
>i must be understanding something wrong here because everywhere i
>read, hadoop is supposed to be #1 at sorting. but here, in dealing
>with the intermediary key-value pairs, in the process of sorting,
>mappers can stall for any number of reasons.
>
>does anyone know any competitive dynamic hadoop clustering service
>like AWS EMR? the reason why i ask is because AWS EMR does not use
>HDFS (it uses S3), and therefore, data locality is not possible. also,
>i have read the TCP protocol is not efficient for network transfers;
>if the S3 node and task nodes are far, this distance will certainly
>exacerbate the situation of slow speed. it seems there are a lot of
>factors working against me.
>
>any help is appreciated.
>
>On Tue, Apr 3, 2012 at 7:48 AM, Bejoy Ks <bejoy.hadoop@gmail.com> wrote:
>>
>> Jane,
>>       From my first look, properties that can help you could be
>> - Increase io sort factor to 100
>> - Increase io.sort.mb to 512Mb
>> - increase map task heap size to 2GB.
>>
>> If the task still stalls, try providing lesser input for each mapper.
>>
>> Regards
>> Bejoy KS
>>
>> On Tue, Apr 3, 2012 at 2:08 PM, Jane Wayne <jane.wayne2978@gmail.com>
>>wrote:
>>
>> > i have a map reduce job that is generating a lot of intermediate
>>key-value
>> > pairs. for example, when i am 1/3 complete with my map phase, i may
>>have
>> > generated over 130,000,000 output records (which is about 9
>>gigabytes). to
>> > get to the 1/3 complete mark is very fast (less than 10 minutes), but
>>at
>> > the 1/3 complete mark, it seems to stall. when i look at the counter
>>logs,
>> > i do not see any logging of spilling yet. however, on the web job UI,
>>i see
>> > that FILE_BYTES_WRITTEN and Spilled Records keeps increasing.
>>needless to
>> > say, i have to dig deeper to see what is going on.
>> >
>> > my question is, how do i fine tune my map reduce job with the above
>> > properties? namely, the property of generating a lot of intermediate
>> > key-value pairs? it seems the I/O operations are negatively impacting
>>the
>> > job speed. there are so many map- and reduce-side tuning properties
>>(see
>> > Tom White, Hadoop, 2nd edition, pp 181-182), i am a little unsure
>>about
>> > just how to approach the tuning parameters. since the slow down is
>> > happening during the map-phase/task, i assume i should narrow down on
>>the
>> > map-side tuning properties.
>> >
>> > by the way, i am using the CPU-intensive c1.medium instances of
>>amazon web
>> > service's (AWS) elastic map reduce (EMR) on hadoop v0.20. a compute
>>node
>> > has 2 mappers, 1 reducers, and 384 MB JVM memory per task. this
>>instance
>> > type is documented to have moderate I/O performance.
>> >
>> > any help on fine tuning my particular map reduce job is appreciated.
>> >


Mime
View raw message