hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: one key per output part file
Date Wed, 02 Apr 2008 07:36:35 GMT
curious - why do we need a file per XXX?

- if further processing is going to be done in hadoop itself - then it's hard to see a reason.
One can always have multiple entries in the same hdfs file. note that it's possible to align
map task splits on sort key boundaries in pre-sorted data (it's not something that hadoop
supports natively right now - but u can write ur own inputformat to do this). meaning - that
subsequent processing that wants all entries corresponding to XXX in one group (as in a reducer)
can do so in the map phase itself (ie. - it's damned cheap and doesn't require sorting data
all over again).

- if the data needs to be exported (either to a sql db or an external file system) - then
why not do so directly from the reducer (instead of trying to create these intermediate small
files in hdfs)? data can be written to tmp tables/files and can be overwritten in case the
reducer re-runs (and then committed to final location once the job is complete)

-----Original Message-----
From: arv.andrew@gmail.com on behalf of Ashish Venugopal
Sent: Tue 4/1/2008 6:42 PM
To: core-user@hadoop.apache.org
Subject: Re: one key per output part file
This seems like a reasonable solution - but I am using Hadoop streaming and
byreducer is a perl script. Is it possible to handle side-effect files in
streaming? I havent found
anything that indicates that you can...


On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <tdunning@veoh.com> wrote:

> Try opening the desired output file in the reduce method.  Make sure that
> the output files are relative to the correct task specific directory (look
> for side-effect files on the wiki).
> On 4/1/08 5:57 PM, "Ashish Venugopal" <arv@andrew.cmu.edu> wrote:
> > Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> that
> > will generate output where a single key is found in a single output part
> > file.
> > Does anyone know how to ensure this condition? I want the reduce task
> (no
> > matter how many are specified), to only receive
> > key-value output from a single key each, process the key-value pairs for
> > this key, write an output part-XXX file, and only
> > then process the next key.
> >
> > Here is the task that I am trying to accomplish:
> >
> > Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> > Output: Each part-XXX should contain the lines of T that contain the
> word
> > from line XXX in V.
> >
> > Any help/ideas are appreciated.
> >
> > Ashish

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message