hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Madhav Sharan <msha...@usc.edu>
Subject Re: All nodes are not used
Date Tue, 09 Aug 2016 20:53:28 GMT
Thanks Mahesh

Till now I am not able to run the whole job in a limited time period. So I
am looking for optimizations and resource utilization. May be I can try
tweaking input split size if it helps.

Thanks for your help, It explains the behaviour

--
Madhav Sharan


On Tue, Aug 9, 2016 at 1:28 PM, Mahesh Balija <balijamahesh.mca@gmail.com>
wrote:

> Hi Madhav,
>
> The behaviour to me sounds normal.
> If the Block Size is 128 MB there could possibly be ~24 Mappers (i.e.,
> containers used).
> You cannot use entire cluster as the blocks could be only in the nodes
> being used.
>
> You should not try using the entire cluster resources for following reason
>
> The time required to initialize the container vs the time required to
> process the amount of data should be optimum to maximize the conainer
> utilization, that is why the block size 128 MB is been choosen, in many
> cases this InputSplit size is increased to optimize the containers
> utilization depending on the workloads.
>
> Best,
> Mahesh.B.
>
>
>
> On Tue, Aug 9, 2016 at 12:19 AM, Madhav Sharan <msharan@usc.edu> wrote:
>
>> Hi Hadoop users,
>>
>> I am running a m/r job with an input file of 23 million records. I can
>> see all our files are not getting used.
>>
>> What can I change to utilize all nodes?
>>
>>
>> Containers Mem Used Mem Avail Vcores used Vcores avail
>> 8 11.25 GB 0 B 8 0
>> 0 0 B 11.25 GB 0 8
>> 0 0 B 11.25 GB 0 8
>> 8 11.25 GB 0 B 8 0
>> 8 11.25 GB 0 B 8 0
>> 7 11.25 GB 0 B 7 1
>> 5 7.03 GB 4.22 GB 5 3
>> 0 0 B 11.25 GB 0 8
>> 0 0 B 11.25 GB 0 8
>>
>>
>> My command looks like -
>>
>> hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar
>> gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation
>> /user/pts/output/MeanChiSquareAndSimilarityInput
>> /user/pts/output/MeanChiSquaredCalcOutput
>>
>> Directory - */user/pts/output/MeanChiSquareAndSimilarityInput* have a
>> input file of 23 m records. File size is ~3 GB
>>
>> Code - https://github.com/smadha/pooled_time_series/blob/master/src
>> /main/java/gov/nasa/jpl/memex/pooledtimeseries/MeanChiSquare
>> DistanceCalculation.java#L135
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_smadha_pooled-5Ftime-5Fseries_blob_master_src_main_java_gov_nasa_jpl_memex_pooledtimeseries_MeanChiSquareDistanceCalculation.java-23L135&d=DQMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=ZQO-otgJ4EOvBzmchAV--4QdJcYvW3BYTxuPziQ53EM&s=tCPLOH7YJVRXRKfaD8HM3f-imDvx5VACqBiAXkK7S1w&e=>
>>
>>
>> --
>> Madhav Sharan
>>
>>
>

Mime
View raw message