hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Hubenig <paul.hube...@gmail.com>
Subject streaming secondary sort not working?
Date Tue, 08 Jan 2013 02:38:27 GMT
hadoop jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.*.jar \

-input /export/home/phubenig/fileDataInput \

-output /export/home/phubenig/fileDataOutput \

-mapper /export/home/phubenig/fileDataJob/non_map.py \

-reducer org.apache.hadoop.mapred.lib.IdentityReducer \

-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \


num.key.fields.for.partition=1 \

stream.num.map.output.key.fields=7 \

mapred.text.key.comparator.options="-k1,1 -k2,7" \

mapred.text.key.partitioner.options="-k1,1" \
 -file /export/home/phubenig/fileDataInput/fileData.txt


Input file (tab separated):

C k d m n h b

A w g i w t l

A w f y m y h

C u r d h c b

A y q w m g k

B w b s d q g

C q j j d f b

C l n x a g f

C o r m a g p

C v w l a t f

B c l f n t u

B x t o e x p

A q m r d q i

C e i o u g l

A x m w u o i

A j p m d k r

C s t m r m t

B s w l f k y

B a f r v f x

A s z d v s h

C o x j c w r

Sorts on first key (the capital letters) but does not perform the secondary
sort on the other fields.  Does anyone see the problem?  What am I
missing?  Seems like it should work.

Thanks for your time.



#!/usr/bin/env python

import sys

for line in sys.stdin:

stripped = line.rstrip()


View raw message