hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1423) Improve performance of CombineFileInputFormat when multiple pools are configured
Date Wed, 03 Feb 2010 23:29:28 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

dhruba borthakur updated MAPREDUCE-1423:

    Attachment: CombineFileInputFormatPerformance.txt

The conversion of strings to Path() occurs only once. In the presence of multiple pools, this
improves performance by an order of magnitude. A job that needed 6 hours to create splits
now takes about 1.5 hours.

> Improve performance of CombineFileInputFormat when multiple pools are configured
> --------------------------------------------------------------------------------
>                 Key: MAPREDUCE-1423
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1423
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: CombineFileInputFormatPerformance.txt
> I have a map-reduce job that is using CombineFileInputFormat. It has configured 10000
pools and 30000 files. The time to create the splits takes more than an hour. The reaosn being
that CombineFileInputFormat.getSplits() converts the same path from String to Path object
multiple times, one for each instance of a pool. Similarly, it calls Path.toUri(0 multiple
times. This code can be optimized.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message