hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuri Pradkin <y...@isi.edu>
Subject Re: one key per output part file
Date Thu, 03 Apr 2008 16:07:52 GMT
Here is how we (attempt to) do it:

Reducer (in streaming) writes one file for each different key it receives as input.  
Here's some example code in perl:
    my $envdir = $ENV{'mapred_output_dir'};
    my $fs = ($envdir =~ s/^file://);
    if ($fs) {
	#output goes onto NFS
        open(FILEOUT, ">$envdir/${filename}.png") or die "$0: cannot open $envdir/${filename}.png:
$!\n";
    } else {
	#output specifies DFS
        open(FILEOUT, ">/tmp/${filename}.png") or die "Cannot open /tmp/${filename}.png:
$!\n";	#or pipe to dfs -put
    }
    ... #write FILEOUT
    if ($fs) {
	#for NFS just fix permissions
        chmod 0664, "$envdir/$filename.png";
        chmod 0775, "$envdir";
    } else {
	#for HDFS -put the file
        my $hadoop = $ENV{HADOOP_HOME} . "/bin/hadoop";
        $ENV{HADOOP_HEAPSIZE}=300;
        system($hadoop,  "dfs", "-put", "/tmp/${filename}.png",  "$envdir/${filename}.png")
and
	    unlink "/tmp/${filename}.png";
    }
   
If -output option to streaming specifies an NFS directory, everything works except 
it doesn't scale.  We must use mapred_output_dir environment because it points to 
the temporary directory and you don't want 2 or more instances of the same tasks writing
to the same file.

If -output points to HDFS, however, the code above bombs while trying to -put a file
with an error something like "couldn't not reserve enough memory for java vm heap/libs"
at which point Java dies.  If anyone has any suggestions on how to fix that, I'd
appreciate it.

Thanks,

 -Yuri
 
On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote:
> Hi, I am using Hadoop streaming and I am trying to create a MapReduce that
> will generate output where a single key is found in a single output part
> file.
> Does anyone know how to ensure this condition? I want the reduce task (no
> matter how many are specified), to only receive
> key-value output from a single key each, process the key-value pairs for
> this key, write an output part-XXX file, and only
> then process the next key.
>
> Here is the task that I am trying to accomplish:
>
> Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> Output: Each part-XXX should contain the lines of T that contain the word
> from line XXX in V.
>
> Any help/ideas are appreciated.
>
> Ashish



Mime
View raw message