hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: question about file input format
Date Thu, 18 Aug 2011 02:35:36 GMT

You'll require two things here, as you've deduced correctly:

Under InputFormat
- isSplitable -> False
- getRecordReader -> A simple implementation that reads the whole
file's bytes to an array/your-construct and passes it (as part of
next(), etc.).

For example, here's a simple record reader impl you can return
(untested, but you'll get the idea of reading whole files, and porting
to new API is easy as well): https://gist.github.com/1153161

P.s. Since you are reading whole files into memory, keep an eye out
for memory usage (the above example has a 10 MB limit per file, for
example). You could run out of memory easily if you don't handle the
cases properly.

On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu <zzhu@calpont.com> wrote:
> I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple
> tasks. I'm trying to send each whole file of the input directory to the
> mapper without splitting them line by line. How should I set the input
> format class? I know I could derive a customized FileInputFormat class
> and override the isSplitable function. But I have no idea how to
> implement around the record reader. Any suggestion or a sample code will
> be greatly appreciated.
> Thanks in advance,
> Grace

Harsh J

View raw message