hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Yang <lin.yang.ja...@gmail.com>
Subject How to split a sequence file
Date Wed, 12 Sep 2012 03:15:30 GMT

I have a sequence file written by SequenceFileOutputFormat with key/value
type of <Text, BytesWritable>, like below:

Text                             BytesWritable
id_A_01  7F2B3C687F2B3C687F2B3C68
id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
id_A_03  5F2B3C68D77F2B3C687F2B3A
id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
id_B_02  5AB23C68D73C68D76AB68D76A1
id_B_03  F2B23C68D7B23C68D7B23C68D7

If I want all the records with the same key prefix to be processed by a
same mapper, say records with key id_A_XX are processed by a mapper and
records with key id_B_XX are processed by another mapper, what should I do?

Should I implement our own InputFormat inherited from
SequenceFileInputFormat ?

Any help would be appreciated.

View raw message