hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Sivachenko <trtrmi...@gmail.com>
Subject Re: Writing output from streaming task without dealing with key/value
Date Wed, 10 Sep 2014 18:12:35 GMT

On 10 сент. 2014 г., at 22:05, Rich Haase <rdhaase@gmail.com> wrote:

> In python, or any streaming program just set the output value to the empty string and
you will get something like "key"\t"".

I see, but I want to use many existing programs (like UNIX grep), and I don't want to have
and extra "\t" in the output.

Is there any way to achieve this?  Or may be it is possible to write custom XxxOutputFormat
to workaround that issue?

(something opposite to TextInputFormat: it passes input line without any modification to script's
stdin, there should be a way to write stdout to file "as is").


> On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <skgadalay@gmail.com> wrote:
> If you don't want key in the final output, you can set like this in Java.
> job.setOutputKeyClass(NullWritable.class);
> It will just print the value in the output file.
> I don't how to do it in python.
> On 9/10/14, Dmitry Sivachenko <trtrmitya@gmail.com> wrote:
> > Hello!
> >
> > Imagine the following common task: I want to process big text file
> > line-by-line using streaming interface.
> > Run unix grep command for instance.  Or some other line-by-line processing,
> > e.g. line.upper().
> > I copy file to HDFS.
> >
> > Then I run a map task on this file which reads one line, modifies it some
> > way and then writes it to the output.
> >
> > TextInputFormat suites well for reading: it's key is the offset in bytes
> > (meaningless in my case) and the value is the line itself, so I can iterate
> > over line like this (in python):
> > for line in sys.stdin:
> >   print(line.upper())
> >
> > The problem arises with TextOutputFormat:  It tries to split the resulting
> > line on mapreduce.output.textoutputformat.separator which results in extra
> > separator in output if this character is missing in the line, for instance
> > (extra TAB at the end if we stick to defaults).
> >
> > Is there any way to write the result of streaming task without any internal
> > processing so it appears exactly as the script produces it?
> >
> > If it is impossible with Hadoop, which works with key/value pairs, may be
> > there are other frameworks which work on top of HDFS which allow to do
> > this?
> >
> > Thanks in advance!
> -- 
> Kernighan's Law
> "Debugging is twice as hard as writing the code in the first place.  Therefore, if you
write the code as cleverly as possible, you are, by definition, not smart enough to debug

View raw message