hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: How to split a sequence file
Date Wed, 12 Sep 2012 04:00:03 GMT
Hey Jason,

Is the file pre-sorted? You could override the OutputFormat's
#getSplits method to return InputSplits at identified key boundaries,
as one solution - this would require reading the file up-front (at
submit-time) and building the input splits out of it.

On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <lin.yang.jason@gmail.com> wrote:
> Hi,
>
> I have a sequence file written by SequenceFileOutputFormat with key/value
> type of <Text, BytesWritable>, like below:
>
> Text                             BytesWritable
> -------------------------------------------------------------
> id_A_01  7F2B3C687F2B3C687F2B3C68
> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> id_A_03  5F2B3C68D77F2B3C687F2B3A
> ...
> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> id_B_02  5AB23C68D73C68D76AB68D76A1
> id_B_03  F2B23C68D7B23C68D7B23C68D7
>
> If I want all the records with the same key prefix to be processed by a same
> mapper, say records with key id_A_XX are processed by a mapper and records
> with key id_B_XX are processed by another mapper, what should I do?
>
> Should I implement our own InputFormat inherited from
> SequenceFileInputFormat ?
>
> Any help would be appreciated.
> --
> YANG, Lin
>



-- 
Harsh J

Mime
View raw message