hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahesh Balija <balijamahesh....@gmail.com>
Subject Re: All nodes are not used
Date Tue, 09 Aug 2016 20:28:18 GMT
Hi Madhav,

The behaviour to me sounds normal.
If the Block Size is 128 MB there could possibly be ~24 Mappers (i.e.,
containers used).
You cannot use entire cluster as the blocks could be only in the nodes
being used.

You should not try using the entire cluster resources for following reason

The time required to initialize the container vs the time required to
process the amount of data should be optimum to maximize the conainer
utilization, that is why the block size 128 MB is been choosen, in many
cases this InputSplit size is increased to optimize the containers
utilization depending on the workloads.

Best,
Mahesh.B.



On Tue, Aug 9, 2016 at 12:19 AM, Madhav Sharan <msharan@usc.edu> wrote:

> Hi Hadoop users,
>
> I am running a m/r job with an input file of 23 million records. I can see
> all our files are not getting used.
>
> What can I change to utilize all nodes?
>
>
> Containers Mem Used Mem Avail Vcores used Vcores avail
> 8 11.25 GB 0 B 8 0
> 0 0 B 11.25 GB 0 8
> 0 0 B 11.25 GB 0 8
> 8 11.25 GB 0 B 8 0
> 8 11.25 GB 0 B 8 0
> 7 11.25 GB 0 B 7 1
> 5 7.03 GB 4.22 GB 5 3
> 0 0 B 11.25 GB 0 8
> 0 0 B 11.25 GB 0 8
>
>
> My command looks like -
>
> hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar
> gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation
> /user/pts/output/MeanChiSquareAndSimilarityInput /user/pts/output/
> MeanChiSquaredCalcOutput
>
> Directory - */user/pts/output/MeanChiSquareAndSimilarityInput* have a
> input file of 23 m records. File size is ~3 GB
>
> Code - https://github.com/smadha/pooled_time_series/blob/
> master/src/main/java/gov/nasa/jpl/memex/pooledtimeseries/
> MeanChiSquareDistanceCalculation.java#L135
>
>
> --
> Madhav Sharan
>
>

Mime
View raw message