hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Norbert Burger <norbert.bur...@gmail.com>
Subject Re: Splitting a big file into pieces with Hadoop Streaming?
Date Fri, 20 Mar 2009 17:21:55 GMT
If you're trying to split the results of your MR job, seems like one
natural option is to simply add another MR job which post-processes
your data.  The mapper for this second job would just emit as many
unique keys as you want splits, and the value would be your original
data.  Reducer logic would just strip away the keys.

If this is too much work and your files are simple text files, you
could always fall back to head/tail (splits on record boundaries), or
split (splits on column boundaries), or a custom awk script.

Norbert

On Fri, Mar 20, 2009 at 10:10 AM, Nick Cen <cenyongh@gmail.com> wrote:
> i have a similar problem earlyer, and i just use the split and awk to split
> the file.
>
> 2009/3/20 Akira Kitada <akitada@gmail.com>
>
>> Hi,
>>
>> Can I split a input file into pieces based on the key? (Probably the
>> hash value of the key)
>> Considering Hadoop streaming is a kind of shell pipelines,
>> it seems to be impossible to do this, but I wanted to double-check
>> this to be sure.
>>
>> Background: The output(an index file) is so large (more than 10G) that
>> it slows down my applications using that file without splitting it into
>> pieces.
>>
>> Thanks in advance.
>>
>
>
>
> --
> http://daily.appspot.com/food/
>

Mime
View raw message