Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of jack@yelp.com designates
 209.85.198.238 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <314098690902040720u2a0a60c2med3f7323a56647ae@mail.gmail.com>
References: <239f2f640902031949n42abfcefs491ad59c4f3721f6@mail.gmail.com>
	 <314098690902032126i43627af6i4785381356d25fb2@mail.gmail.com>
	 <314098690902032128i46a37b6flf31deb47e4220bd3@mail.gmail.com>
	 <f83401cd0902040615m6382953na996dab0d24fdc58@mail.gmail.com>
	 <314098690902040720u2a0a60c2med3f7323a56647ae@mail.gmail.com>
Date: Wed, 4 Feb 2009 16:53:11 -0800
Message-ID: <239f2f640902041653t5f41dde1uaecebe4578d92b9e@mail.gmail.com>
Subject: Re: Value-Only Reduce Output
From: Jack Stahl <jack@yelp.com>
To: core-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=000e0cd156061d19a204622154e2

--000e0cd156061d19a204622154e2
Content-Type: text/plain; charset=ISO-8859-2
Content-Transfer-Encoding: quoted-printable

My (0.18.2) reduce src looks like this:

          write(key);
          clientOut_.write('\t');
          write(val);
          clientOut_.write('\n');

which explains why avoiding the trailing tab is unavoidable.

Thanks for your help, though, Jason!

2009/2/4 jason hadoop <jason.hadoop@gmail.com>

> For your reduce, the parameter is stream.reduce.input.field.separator, if
> you are supplying a reduce class and I believe the output format is
> TextOutputFormat...
>
> It looks like you have tried the map parameter for the separator, not the
> reduce parameter.
>
> From 0.19.0 PipeReducer:
> configure:
>      reduceOutFieldSeparator =3D
> job_.get("stream.reduce.output.field.separator", "\t").getBytes("UTF-8");
>      reduceInputFieldSeparator =3D
> job_.get("stream.reduce.input.field.separator", "\t").getBytes("UTF-8");
>      this.numOfReduceOutputKeyFields =3D
> job_.getInt("stream.num.reduce.output.key.fields", 1);
>
> getInputSeparator:
>  byte[] getInputSeparator() {
>    return reduceInputFieldSeparator;
>  }
>
> reduce:
>          write(key);
> *          clientOut_.write(getInputSeparator());*
>          write(val);
>          clientOut_.write('\n');
>        } else {
>          // "identity reduce"
> *          output.collect(key, val);*
>         }
>
>
> On Wed, Feb 4, 2009 at 6:15 AM, Rasit OZDAS <rasitozdas@gmail.com> wrote:
>
> > I tried it myself, it doesn't work.
> > I've also tried   stream.map.output.field.separator   and
> > map.output.key.field.separator  parameters for this purpose, they
> > don't work either. When hadoop sees empty string, it takes default tab
> > character instead.
> >
> > Rasit
> >
> > 2009/2/4 jason hadoop <jason.hadoop@gmail.com>
> > >
> > > Ooops, you are using streaming., and I am not familar.
> > > As a terrible hack, you could set mapred.textoutputformat.separator t=
o
> > the
> > > empty string, in your configuration.
> > >
> > > On Tue, Feb 3, 2009 at 9:26 PM, jason hadoop <jason.hadoop@gmail.com>
> > wrote:
> > >
> > > > If you are using the standard TextOutputFormat, and the output
> > collector is
> > > > passed a null for the value, there will not be a trailing tab
> character
> > > > added to the output line.
> > > >
> > > > output.collect( key, null );
> > > > Will give you the behavior you are looking for if your configuratio=
n
> is
> > as
> > > > I expect.
> > > >
> > > >
> > > > On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl <jack@yelp.com> wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> I'm interested in a map-reduce flow where I output only values (no
> > keys)
> > > >> in
> > > >> my reduce step.  For example, imagine the canonical word-counting
> > program
> > > >> where I'd like my output to be an unlabeled histogram of counts
> > instead of
> > > >> (word, count) pairs.
> > > >>
> > > >> I'm using HadoopStreaming (specifically, I'm using the dumbo modul=
e
> to
> > run
> > > >> my python scripts).  When I simulate the map reduce using pipes an=
d
> > sort
> > > >> in
> > > >> bash, it works fine.   However, in Hadoop, if I output a value wit=
h
> no
> > > >> tabs,
> > > >> Hadoop appends a trailing "\t", apparently interpreting my output =
as
> a
> > > >> (value, "") KV pair.  I'd like to avoid outputing this trailing ta=
b
> if
> > > >> possible.
> > > >>
> > > >> Is there a command line option that could be use to effect this?
>  More
> > > >> generally, is there something wrong with outputing arbitrary
> strings,
> > > >> instead of key-value pairs, in your reduce step?
> > > >>
> > > >
> > > >
> >
> >
> >
> > --
> > M. Ra=BAit =D6ZDA=AA
> >
>

--000e0cd156061d19a204622154e2--