hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From anvesh ragi <annunarc...@gmail.com>
Subject hadoop 2.4.0 streaming generic parser options using TAB as separator
Date Wed, 10 Jun 2015 05:28:28 GMT
Hello all,

I know that the tab is default input separator for fields :

stream.map.output.field.separator
stream.reduce.input.field.separator
stream.reduce.output.field.separator
mapreduce.textoutputformat.separator

but if i try to write the generic parser option :

stream.map.output.field.separator=\t (or)
stream.map.output.field.separator="\t"

to test how hadoop parses white space characters like "\t,\n" when used as
separators. I observed that hadoop reads it as \t character but not "
 " tab space itself. I checked it by printing each line in reducer (python)
as it reads using :

sys.stdout.write(str(line))

My mapper emits key/value pairs as : key value1 value2

using print (key,value1,value2,sep='\t',end='\n') command.

So I expected my reducer to read each line as : key value1 value2 too, but
instead sys.stdout.write(str(line)) printed :

key value1 value2 \\with trailing space

>From Hadoop streaming - remove trailing tab from reducer output
<http://stackoverflow.com/questions/18133290/hadoop-streaming-remove-trailing-tab-from-reducer-output>,
I understood that the trailing space is due to
mapreduce.textoutputformat.separator not being set and left as default.

So, this confirmed my assumption that hadoop considered my total map output
:

key value1 value2

as key and value as empty Text object since it read the separator from
stream.map.output.field.separator=\t as "\t" character instead of "" tab
space itself.

Please help me understand this behavior and how can I use \t as a separator
if I want to?

Thanks & Regards,
Anvesh R

Mime
View raw message