hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Noguchi" <knogu...@yahoo-inc.com>
Subject RE: Multiple outputs and getmerge?
Date Tue, 21 Apr 2009 18:00:23 GMT
Stuart, 

I once used MultipleOutputFormat and created
   (mapred.work.output.dir)/type1/part-_____
   (mapred.work.output.dir)/type2/part-_____
    ...

And JobTracker took care of the renaming to 
   (mapred.output.dir)/type{1,2}/part-______

Would that work for you?

Koji

-----Original Message-----
From: Stuart White [mailto:stuart.white1@gmail.com] 
Sent: Monday, April 20, 2009 1:15 PM
To: core-user@hadoop.apache.org
Subject: Multiple outputs and getmerge?

I've written a MR job with multiple outputs.  The "normal" output goes
to files named part-XXXXX and my secondary output records go to files
I've chosen to name "ExceptionDocuments" (and therefore are named
"ExceptionDocuments-m-XXXXX").

I'd like to pull merged copies of these files to my local filesystem
(two separate merged files, one containing the "normal" output and one
containing the ExceptionDocuments output).  But, since hadoop lands
both of these outputs to files residing in the same directory, when I
issue "hadoop dfs -getmerge", what I get is a file that contains both
outputs.

To get around this, I have to move files around on HDFS so that my
different outputs are in different directories.

Is this the best/only way to deal with this?  It would be better if
hadoop offered the option of writing different outputs to different
output directories, or if getmerge offered the ability to specify a
file prefix for files desired to be merged.

Thanks!

Mime
View raw message