hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi" <runp...@YAHOO-INC.COM>
Subject RE: writing output files in hadoop streaming
Date Mon, 14 Jan 2008 21:06:13 GMT

One way to achieve your goal is to implement your own
OutputFormat/RecordWriter classes. 
Your reducer will emit all the key/value pairs as in the normal case.
In your record writer class can open multiple output files and dispatch
the key/value to appropriate files based on the actual values.
This way, the Hadoop framework takes care of all the issues related the
namespace and the necessary cleanup of the output files.


> -----Original Message-----
> From: Yuri Pradkin [mailto:yuri@ISI.EDU]
> Sent: Monday, January 14, 2008 12:33 PM
> To: hadoop-user@lucene.apache.org
> Subject: writing output files in hadoop streaming
> Hi,
> We've been using Hadoop streaming for the last 3-4 months and
> it all worked out fine except for one little problem:
> in some situations a hadoop reduce job gets multiple key groups
> and is desired to write out a separate binary output file for
> each group.  However, when a reduce task takes too long and
> there is spare capacity, the task may be replicated on another
> node and these two are basically racing each other.  One finishes
> cleanly and the other is terminated.  Hadoop takes care to remove
> ther terminated job's output from HDFS, but since we're writing
> files from scripts, it's up to us to separate the output of cleanly
> finished tasks from the output of tasks that are terminated
> prematurely.
> Does somebody have answers to the following questions:
> 1. Is there an easy way to tell in a script launched by the Hadoop
>    streaming, if the script was terminated before it received complete
>    input?
>    As far as I was able to ascertain, no signals are being sent to
>    unix-jobs.  They just stop receiving data from STDIN.  The only way
>    that seems to work for me was to process all input and then write
>    something to STDOUT/STDERR and see if that causes a SIGPIPE.  But
>    this is ugly, I hope there is a better solution.
> 2. Is there any good way to write multiple HDFS files from a streaming
> script
>    *and have Hadoop cleanup those files* when it decides to destroy
>    task?  If there was just one file, I could simply use STDOUT, but
> dumping
>    multiple binary files to STDOUT is not pretty.
> We are writing output files to an NFS partition shared among all
> which
> makes it all slightly more complicated because of possible file
> overwrites.
> Our current solution, which is not pretty but avoids directly
> this
> problem is to write out files with random names (created with mktemp)
> write
> to STDOUT the renaming command for this file to it's desired name.
> as a
> post-processing stage, I execute all those commands and delete the
> remaining
> temporary files as duplicates/incompletes.
> Thanks,
>   -Yuri

View raw message