hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Yang <lin.yang.ja...@gmail.com>
Subject Re: How to split a sequence file
Date Wed, 12 Sep 2012 05:57:37 GMT
hey guys,

Thanks for all your suggestions.

To wrap up, there're two ways to achieve this:
1. use multiple sequence files, then write a WholeFileInputFormat which use
each file as a split by overriding the isSeparatable();
2. Distribute records using partitioner and do the processing in reducers,
however, the shuffle would raise some network and IO cost.

BTW, As the computation could be parallelized in both Mapper and Reducer,
What's the difference btw them?

2012/9/12 Ajay Srivastava <Ajay.Srivastava@guavus.com>

> Hi Jason,
> I am wondering about use case of distributing records on the basis of key
> to mapper. If possible, could you please share your scenario ?
> Is it map only job ? Why not distribute records using partitioner and do
> the processing in reducers ?
>
>
> Regards,
> Ajay Srivastava
>
>
> On 12-Sep-2012, at 8:45 AM, Jason Yang wrote:
>
> > Hi,
> >
> > I have a sequence file written by SequenceFileOutputFormat with
> key/value type of <Text, BytesWritable>, like below:
> >
> > Text                             BytesWritable
> > -------------------------------------------------------------
> > id_A_01  7F2B3C687F2B3C687F2B3C68
> > id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
> > id_A_03  5F2B3C68D77F2B3C687F2B3A
> > ...
> > id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
> > id_B_02  5AB23C68D73C68D76AB68D76A1
> > id_B_03  F2B23C68D7B23C68D7B23C68D7
> >
> > If I want all the records with the same key prefix to be processed by a
> same mapper, say records with key id_A_XX are processed by a mapper and
> records with key id_B_XX are processed by another mapper, what should I do?
> >
> > Should I implement our own InputFormat inherited from
> SequenceFileInputFormat ?
> >
> > Any help would be appreciated.
> > --
> > YANG, Lin
> >
>
>


-- 
YANG, Lin

Mime
View raw message