hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Susheel Kumar Gadalay <skgada...@gmail.com>
Subject Re: Writing output from streaming task without dealing with key/value
Date Wed, 10 Sep 2014 18:03:15 GMT
If you don't want key in the final output, you can set like this in Java.

job.setOutputKeyClass(NullWritable.class);

It will just print the value in the output file.

I don't how to do it in python.

On 9/10/14, Dmitry Sivachenko <trtrmitya@gmail.com> wrote:
> Hello!
>
> Imagine the following common task: I want to process big text file
> line-by-line using streaming interface.
> Run unix grep command for instance.  Or some other line-by-line processing,
> e.g. line.upper().
> I copy file to HDFS.
>
> Then I run a map task on this file which reads one line, modifies it some
> way and then writes it to the output.
>
> TextInputFormat suites well for reading: it's key is the offset in bytes
> (meaningless in my case) and the value is the line itself, so I can iterate
> over line like this (in python):
> for line in sys.stdin:
>   print(line.upper())
>
> The problem arises with TextOutputFormat:  It tries to split the resulting
> line on mapreduce.output.textoutputformat.separator which results in extra
> separator in output if this character is missing in the line, for instance
> (extra TAB at the end if we stick to defaults).
>
> Is there any way to write the result of streaming task without any internal
> processing so it appears exactly as the script produces it?
>
> If it is impossible with Hadoop, which works with key/value pairs, may be
> there are other frameworks which work on top of HDFS which allow to do
> this?
>
> Thanks in advance!

Mime
View raw message