hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: hadoop/yarn and task parallelization on non-hdfs filesystems
Date Fri, 15 Aug 2014 11:15:02 GMT
Does your non-HDFS filesystem implement a getBlockLocations API, that
MR relies on to know how to split files?

The API is at http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,
long, long), and MR calls it at
https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L392

If not, perhaps you can enforce a manual chunking by asking MR to use
custom min/max split sizes values via config properties:
https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L66

On Fri, Aug 15, 2014 at 10:16 AM, Calvin <iphcalvin@gmail.com> wrote:
> I've looked a bit into this problem some more, and from what another
> person has written, HDFS is tuned to scale appropriately [1] given the
> number of input splits, etc.
>
> In the case of utilizing the local filesystem (which is really a
> network share on a parallel filesystem), the settings might be set
> conservatively in order not to thrash the local disks or present a
> bottleneck in processing.
>
> Since this isn't a big concern, I'd rather tune the settings to
> efficiently utilize the local filesystem.
>
> Are there any pointers to where in the source code I could look in
> order to tweak such parameters?
>
> Thanks,
> Calvin
>
> [1] https://stackoverflow.com/questions/25269964/hadoop-yarn-and-task-parallelization-on-non-hdfs-filesystems
>
> On Tue, Aug 12, 2014 at 12:29 PM, Calvin <iphcalvin@gmail.com> wrote:
>> Hi all,
>>
>> I've instantiated a Hadoop 2.4.1 cluster and I've found that running
>> MapReduce applications will parallelize differently depending on what
>> kind of filesystem the input data is on.
>>
>> Using HDFS, a MapReduce job will spawn enough containers to maximize
>> use of all available memory. For example, a 3-node cluster with 172GB
>> of memory with each map task allocating 2GB, about 86 application
>> containers will be created.
>>
>> On a filesystem that isn't HDFS (like NFS or in my use case, a
>> parallel filesystem), a MapReduce job will only allocate a subset of
>> available tasks (e.g., with the same 3-node cluster, about 25-40
>> containers are created). Since I'm using a parallel filesystem, I'm
>> not as concerned with the bottlenecks one would find if one were to
>> use NFS.
>>
>> Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml)
>> configuration that will allow me to effectively maximize resource
>> utilization?
>>
>> Thanks,
>> Calvin



-- 
Harsh J

Mime
View raw message