hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Sivachenko <trtrmi...@gmail.com>
Subject Writing output from streaming task without dealing with key/value
Date Wed, 10 Sep 2014 15:51:40 GMT
Hello!

Imagine the following common task: I want to process big text file line-by-line using streaming
interface.
Run unix grep command for instance.  Or some other line-by-line processing, e.g. line.upper().
I copy file to HDFS.

Then I run a map task on this file which reads one line, modifies it some way and then writes
it to the output.

TextInputFormat suites well for reading: it's key is the offset in bytes (meaningless in my
case) and the value is the line itself, so I can iterate over line like this (in python):
for line in sys.stdin:
  print(line.upper())

The problem arises with TextOutputFormat:  It tries to split the resulting line on mapreduce.output.textoutputformat.separator
which results in extra separator in output if this character is missing in the line, for instance
(extra TAB at the end if we stick to defaults).

Is there any way to write the result of streaming task without any internal processing so
it appears exactly as the script produces it?

If it is impossible with Hadoop, which works with key/value pairs, may be there are other
frameworks which work on top of HDFS which allow to do this?

Thanks in advance!
Mime
View raw message