hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1374) Reduce memory footprint of FileSplit
Date Wed, 05 May 2010 00:46:05 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chris Douglas updated MAPREDUCE-1374:

    Status: Open  (was: Patch Available)

* The unit test mixes JUnit3 and JUnit4; instead of extending {{TestCase}}, statically importing
the asserts is consistent.
* I agree with Todd/Amar/Tom on using a {{WeakHashMap}} instead of {{String::intern}} for
the hosts. The guarantees offered by the latter are much stronger what is required to support
this case.
* Using {{String::intern}} for the input path is taking a good idea too far; for long-running
clients submitting many jobs, the cache footprint could be excessive. Further, if the file
is splittable, creating several splits with the same (immutable) {{Path}} reference is pretty
cheap. The space savings effected by making this member a {{String}} do not seem very compelling.
* If your tests suggest that caching input paths is important, then keeping a {{WeakHashMap<Path,String>}}
would avoid the overhead of {{URI::toString}} and the temporary objects it creates (as opposed
to computing the result and then looking it up in the cache).

> Reduce memory footprint of FileSplit
> ------------------------------------
>                 Key: MAPREDUCE-1374
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 0.20.1, 0.21.0, 0.22.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>             Fix For: 0.21.0, 0.22.0
>         Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch, MAPREDUCE-1374.3.patch
> We can have many FileInput objects in the memory, depending on the number of mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those Strings for
host names.
> {code}
> FileInputFormat.java:
>       for (NodeInfo host: hostList) {
>         // Strip out the port number from the host name
> -        retVal[index++] = host.node.getName().split(":")[0];
> +        retVal[index++] = host.node.getName().split(":")[0].intern();
>         if (index == replicationFactor) {
>           done = true;
>           break;
>         }
>       }
> {code}
> More on String.intern(): http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from {{Path}} to
{{String}}. {{Path}} contains a {{java.net.URI}} which internally contains ~10 String fields.
This will also be a huge saving.
> {code}
>   private Path file;
> {code}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message