hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: Best practice for batch file conversions
Date Tue, 08 Feb 2011 06:55:20 GMT
Hi,

You can use FileStreamInputFormat which returns the file stream as the
value.

https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input

You need to remember that you lose data locality by trying to manipulate the
file as a whole, but in your case, the requirement probably demands it.

Thanks and Regards,
Sonal
<https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Tue, Feb 8, 2011 at 8:59 AM, Harsh J <qwertymaniac@gmail.com> wrote:

> Extend FileInputFormat, and write your own binary-format based
> implementation of it, and make it non-splittable (isSplitable should
> return false). This way, a Mapper would get a whole file, and you
> shouldn't have block-splitting issues.
>
> On Tue, Feb 8, 2011 at 6:37 AM, felix gao <gre1600@gmail.com> wrote:
> > Hello users of hadoop,
> > I have a task to convert large binary files from one format to another.
>  I
> > am wondering what is the best practice to do this.  Basically, I am
> trying
> > to get one mapper to work on each binary file and i am not sure how to do
> > that in hadoop properly.
> > thanks,
> > Felix
>
>
>
> --
> Harsh J
> www.harshj.com
>

Mime
View raw message