hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harsh J (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8045) org.apache.hadoop.mapreduce.lib.output.MultipleOutputs does not handle many files well
Date Thu, 09 Feb 2012 15:17:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204582#comment-13204582

Harsh J commented on HADOOP-8045:

Ok, thats a little odd. But the NN does exclude DNs based on # of transfer threads load. That
is what is affecting you -- error at DN or not, cause of 120 requests of write per task (sure
you want small files?). You could also raise your settings 2x and try to see if it elevates
or goes away.

In any case, I'm +1 on adding a specific closing API to MultipleOutputs to close a given named
Can you however, add it to the mapred.lib.MultipleOutputs (Stable API) as well?

Comments on existing patch btw:
* The javadoc can actually reside over the new function you've added. Something like "This
func is useful in reducers where after writing a particular key as an output, you may close
it to save on fs connections."
* Once closed, the writer must be moved out of the collection.
* New addition requires test cases, as nothing covers this API call right now. Please add
a test case that tests your new method. There are existing tests inside of TestMultipleOutputs
(Stable API - you need to add), and TestMRMultipleOutputs (Unstable, new API - your patch).

> org.apache.hadoop.mapreduce.lib.output.MultipleOutputs does not handle many files well
> --------------------------------------------------------------------------------------
>                 Key: HADOOP-8045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8045
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.21.0, 1.0.0
>         Environment: Cloudera CH3 release.
>            Reporter: Tarjei Huse
>              Labels: patch
>         Attachments: hadoop-multiple-outputs.patch
> We were tryong to use MultipleOutputs to write one file per key. This produced the error:
> exception:
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
> /user/me/part6/_temporary/
> _attempt_201202071305_0017_r_000000_2/2011-11-18-22-
> attempt_201202071305_0017_r_000000_2-r-00000
> could only be replicated to 0 nodes, instead of 1
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:
> 1520)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:
> 665)
>     at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
> 25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:396)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:
> 1157) 
> When the nr. of files processed increased over 20 on a single developer system. 
> The solution proved to be to close each RecordWriter when the reducer was finished with
a key, something that required that we extended the multiple outputs to fetch the recordwriter
- not a good solution. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message