hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Dyer <psyb...@gmail.com>
Subject Re: How to split a sequence file
Date Wed, 12 Sep 2012 05:26:08 GMT
If the file is pre-sorted, why not just make multiple sequence files -
1 for each split?

Then you don't have to compute InputSplits because the physical files
are already split.

On Tue, Sep 11, 2012 at 11:00 PM, Harsh J <harsh@cloudera.com> wrote:
> Hey Jason,
>
> Is the file pre-sorted? You could override the OutputFormat's
> #getSplits method to return InputSplits at identified key boundaries,
> as one solution - this would require reading the file up-front (at
> submit-time) and building the input splits out of it.
>
> On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <lin.yang.jason@gmail.com> wrote:
>> Hi,
>>
>> I have a sequence file written by SequenceFileOutputFormat with key/value
>> type of <Text, BytesWritable>, like below:
>>
>> Text                             BytesWritable
>> -------------------------------------------------------------
>> id_A_01  7F2B3C687F2B3C687F2B3C68
>> id_A_02  2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7
>> id_A_03  5F2B3C68D77F2B3C687F2B3A
>> ...
>> id_B_01  1AB23C68D73C68D76AB23C68D73C68D7
>> id_B_02  5AB23C68D73C68D76AB68D76A1
>> id_B_03  F2B23C68D7B23C68D7B23C68D7
>>
>> If I want all the records with the same key prefix to be processed by a same
>> mapper, say records with key id_A_XX are processed by a mapper and records
>> with key id_B_XX are processed by another mapper, what should I do?
>>
>> Should I implement our own InputFormat inherited from
>> SequenceFileInputFormat ?
>>
>> Any help would be appreciated.
>> --
>> YANG, Lin
>>
>
>
>
> --
> Harsh J

Mime
View raw message