hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Venugopal" <...@andrew.cmu.edu>
Subject Re: one key per output part file
Date Wed, 02 Apr 2008 15:48:30 GMT
Thanks for this information - I might be missing something here, but can my
perl script reducer (which is run via streaming, and is not linked to HDFS
libraries) just start writing to HDFS?
I thought I would have to write it locally ie in "." for the reduce script
and then rely on the MapReduce mechanism to promote the file into the output
directory...
Thanks for all the help!

Ashish



On Wed, Apr 2, 2008 at 11:22 AM, Ted Dunning <tdunning@veoh.com> wrote:

>
>
> Writing to HDFS leaves the files as accessible as anything else, if not
> more
> so.
>
> You can retrieve a file using a URL of the form:
>
>  http://<name-server>/data/<hdfs-path>
>
> Similarly, you can list a directory using a similar URL (whose details I
> forget for the nonce).
>
> On 4/2/08 7:57 AM, "Ashish Venugopal" <arv@andrew.cmu.edu> wrote:
>
> > On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <jssarma@facebook.com>
> > wrote:
> >
> >> curious - why do we need a file per XXX?
> >>
> >> - if the data needs to be exported (either to a sql db or an external
> file
> >> system) - then why not do so directly from the reducer (instead of
> trying to
> >> create these intermediate small files in hdfs)? data can be written to
> tmp
> >> tables/files and can be overwritten in case the reducer re-runs (and
> then
> >> committed to final location once the job is complete)
> >>
> >
> > The second case (data needs to be exported) is the reason that I have.
> Each
> > of these small files is used in an external process. This seems like a
> good
> > solution - only question then is where can these files be written to
> safely?
> > Local directory? /tmp?
> >
> > Ashish
> >
> >
> >
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: arv.andrew@gmail.com on behalf of Ashish Venugopal
> >> Sent: Tue 4/1/2008 6:42 PM
> >> To: core-user@hadoop.apache.org
> >> Subject: Re: one key per output part file
> >>
> >> This seems like a reasonable solution - but I am using Hadoop streaming
> >> and
> >> byreducer is a perl script. Is it possible to handle side-effect files
> in
> >> streaming? I havent found
> >> anything that indicates that you can...
> >>
> >> Ashish
> >>
> >> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <tdunning@veoh.com> wrote:
> >>
> >>>
> >>>
> >>> Try opening the desired output file in the reduce method.  Make sure
> >> that
> >>> the output files are relative to the correct task specific directory
> >> (look
> >>> for side-effect files on the wiki).
> >>>
> >>>
> >>>
> >>> On 4/1/08 5:57 PM, "Ashish Venugopal" <arv@andrew.cmu.edu> wrote:
> >>>
> >>>> Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> >>> that
> >>>> will generate output where a single key is found in a single output
> >> part
> >>>> file.
> >>>> Does anyone know how to ensure this condition? I want the reduce task
> >>> (no
> >>>> matter how many are specified), to only receive
> >>>> key-value output from a single key each, process the key-value pairs
> >> for
> >>>> this key, write an output part-XXX file, and only
> >>>> then process the next key.
> >>>>
> >>>> Here is the task that I am trying to accomplish:
> >>>>
> >>>> Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> >>>> Output: Each part-XXX should contain the lines of T that contain the
> >>> word
> >>>> from line XXX in V.
> >>>>
> >>>> Any help/ideas are appreciated.
> >>>>
> >>>> Ashish
> >>>
> >>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message