hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Stahl <j...@yelp.com>
Subject Re: Value-Only Reduce Output
Date Thu, 05 Feb 2009 00:53:11 GMT
My (0.18.2) reduce src looks like this:

          write(key);
          clientOut_.write('\t');
          write(val);
          clientOut_.write('\n');

which explains why avoiding the trailing tab is unavoidable.

Thanks for your help, though, Jason!

2009/2/4 jason hadoop <jason.hadoop@gmail.com>

> For your reduce, the parameter is stream.reduce.input.field.separator, if
> you are supplying a reduce class and I believe the output format is
> TextOutputFormat...
>
> It looks like you have tried the map parameter for the separator, not the
> reduce parameter.
>
> From 0.19.0 PipeReducer:
> configure:
>      reduceOutFieldSeparator =
> job_.get("stream.reduce.output.field.separator", "\t").getBytes("UTF-8");
>      reduceInputFieldSeparator =
> job_.get("stream.reduce.input.field.separator", "\t").getBytes("UTF-8");
>      this.numOfReduceOutputKeyFields =
> job_.getInt("stream.num.reduce.output.key.fields", 1);
>
> getInputSeparator:
>  byte[] getInputSeparator() {
>    return reduceInputFieldSeparator;
>  }
>
> reduce:
>          write(key);
> *          clientOut_.write(getInputSeparator());*
>          write(val);
>          clientOut_.write('\n');
>        } else {
>          // "identity reduce"
> *          output.collect(key, val);*
>         }
>
>
> On Wed, Feb 4, 2009 at 6:15 AM, Rasit OZDAS <rasitozdas@gmail.com> wrote:
>
> > I tried it myself, it doesn't work.
> > I've also tried   stream.map.output.field.separator   and
> > map.output.key.field.separator  parameters for this purpose, they
> > don't work either. When hadoop sees empty string, it takes default tab
> > character instead.
> >
> > Rasit
> >
> > 2009/2/4 jason hadoop <jason.hadoop@gmail.com>
> > >
> > > Ooops, you are using streaming., and I am not familar.
> > > As a terrible hack, you could set mapred.textoutputformat.separator to
> > the
> > > empty string, in your configuration.
> > >
> > > On Tue, Feb 3, 2009 at 9:26 PM, jason hadoop <jason.hadoop@gmail.com>
> > wrote:
> > >
> > > > If you are using the standard TextOutputFormat, and the output
> > collector is
> > > > passed a null for the value, there will not be a trailing tab
> character
> > > > added to the output line.
> > > >
> > > > output.collect( key, null );
> > > > Will give you the behavior you are looking for if your configuration
> is
> > as
> > > > I expect.
> > > >
> > > >
> > > > On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl <jack@yelp.com> wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> I'm interested in a map-reduce flow where I output only values (no
> > keys)
> > > >> in
> > > >> my reduce step.  For example, imagine the canonical word-counting
> > program
> > > >> where I'd like my output to be an unlabeled histogram of counts
> > instead of
> > > >> (word, count) pairs.
> > > >>
> > > >> I'm using HadoopStreaming (specifically, I'm using the dumbo module
> to
> > run
> > > >> my python scripts).  When I simulate the map reduce using pipes and
> > sort
> > > >> in
> > > >> bash, it works fine.   However, in Hadoop, if I output a value with
> no
> > > >> tabs,
> > > >> Hadoop appends a trailing "\t", apparently interpreting my output
as
> a
> > > >> (value, "") KV pair.  I'd like to avoid outputing this trailing tab
> if
> > > >> possible.
> > > >>
> > > >> Is there a command line option that could be use to effect this?
>  More
> > > >> generally, is there something wrong with outputing arbitrary
> strings,
> > > >> instead of key-value pairs, in your reduce step?
> > > >>
> > > >
> > > >
> >
> >
> >
> > --
> > M. Raşit ÖZDAŞ
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message