hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksandr Elbakyan <ramal...@yahoo.com>
Subject Re: Issue with partitioning of data using hadoop streaming
Date Wed, 30 Apr 2014 17:24:25 GMT

Any suggestions?



I am having issue with partitioning data between mapper and reducers when the key is numeric.
When I switch it to one character string it works fine, but I have more then 26 keys so looking
to alternative way.

My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat
 " \
    -reducer " cat "

other issue I have stream.map.output.field.separator when I put it as a tab it adds space
in my data when keys are bigger or eq to 100

Any suggestion how to fix this?
View raw message