hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From anvesh ragi <annunarc...@gmail.com>
Subject hadoop 2.4.0 streaming generic parser options using TAB as separator
Date Wed, 10 Jun 2015 05:28:28 GMT
Hello all,

I know that the tab is default input separator for fields :


but if i try to write the generic parser option :

stream.map.output.field.separator=\t (or)

to test how hadoop parses white space characters like "\t,\n" when used as
separators. I observed that hadoop reads it as \t character but not "
 " tab space itself. I checked it by printing each line in reducer (python)
as it reads using :


My mapper emits key/value pairs as : key value1 value2

using print (key,value1,value2,sep='\t',end='\n') command.

So I expected my reducer to read each line as : key value1 value2 too, but
instead sys.stdout.write(str(line)) printed :

key value1 value2 \\with trailing space

>From Hadoop streaming - remove trailing tab from reducer output
I understood that the trailing space is due to
mapreduce.textoutputformat.separator not being set and left as default.

So, this confirmed my assumption that hadoop considered my total map output

key value1 value2

as key and value as empty Text object since it read the separator from
stream.map.output.field.separator=\t as "\t" character instead of "" tab
space itself.

Please help me understand this behavior and how can I use \t as a separator
if I want to?

Thanks & Regards,
Anvesh R

View raw message