hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sudhakara st <sudhakara...@gmail.com>
Subject Re: Streaming jobs getting poor locality
Date Thu, 23 Jan 2014 09:42:54 GMT
I think In order to configure a Hadoop Job to read the Compressed input you
have to specify compression codec in code or in command linelike
*-D io.compression.codecs=org.apache.hadoop.io.compress.BZip2Codec *


On Thu, Jan 23, 2014 at 12:40 AM, Williams, Ken <Ken.Williams@windlogics.com
> wrote:

>  Hi,
>
>
>
> I posted a question to Stack Overflow yesterday about an issue I’m seeing,
> but judging by the low interest (only 7 views in 24 hours, and 3 of them
> are probably me! :-) it seems like I should switch venue.  I’m pasting the
> same question here in hopes of finding someone with interest.
>
>
>
> Original SO post is at
> http://stackoverflow.com/questions/21266248/hadoop-jobs-getting-poor-locality.
>
>
>
> *****************
>
> I have some fairly simple Hadoop streaming jobs that look like this:
>
>
>
> yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
>
>   -files hdfs:///apps/local/count.pl \
>
>   -input /foo/data/bz2 \
>
>   -output /user/me/myoutput \
>
>   -mapper "cut -f4,8 -d," \
>
>   -reducer count.pl \
>
>   -combiner count.pl
>
>
>
> The count.pl script is just a simple script that accumulates counts in a
> hash and prints them out at the end - the details are probably not relevant
> but I can post it if necessary.
>
>
>
> The input is a directory containing 5 files encoded with bz2 compression,
> roughly the same size as each other, for a total of about 5GB (compressed).
>
>
>
> When I look at the running job, it has 45 mappers, but they're all running
> on one node. The particular node changes from run to run, but always only
> one node. Therefore I'm achieving poor data locality as data is transferred
> over the network to this node, and probably achieving poor CPU usage too.
>
>
>
> The entire cluster has 9 nodes, all the same basic configuration. The
> blocks of the data for all 5 files are spread out among the 9 nodes, as
> reported by the HDFS Name Node web UI.
>
>
>
> I'm happy to share any requested info from my configuration, but this is a
> corporate cluster and I don't want to upload any full config files.
>
>
>
> It looks like this previous thread [ why map task always running on a
> single node -
> http://stackoverflow.com/questions/12135949/why-map-task-always-running-on-a-single-node]
is relevant but not conclusive.
>
>
>
> *****************
>
>
>
> Thanks.
>
>
>
> --
>
> Ken Williams, Senior Research Scientist
>
> *Wind**Logics*
>
> http://windlogics.com
>
>
>
> ------------------------------
>
> CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution of
> any kind is strictly prohibited. If you are not the intended recipient,
> please contact the sender via reply e-mail and destroy all copies of the
> original message. Thank you.
>



-- 

Regards,
...Sudhakara.st

Mime
View raw message