hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Kozlov <ale...@cloudera.com>
Subject Re: Passing whole text file to a single map
Date Sun, 24 Jan 2010 00:30:22 GMT
By the design, the TextInputFormat will split the file into lines and pass
each one as a record.

If you override isSplittable(), it will still return a bunch of records.
 Each file will be a split.

If you want to get the context of a single file, the best way is to put the
files into a SequenceFile, one per key, which can be the file name, and read
the file as bytes.

Alternatively, you can pass a file where each line is a file name to a
mapper and open the file explicitly within the mapper.

On Sat, Jan 23, 2010 at 8:48 AM, prashant ullegaddi <
prashullegaddi@gmail.com> wrote:

> Why don't you extend FileInputFormat, and implement
> isSplittable<
> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable%28org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path%29
> >,
> so that it returns false.
>
>
> On Sat, Jan 23, 2010 at 10:05 PM, stolikp <stolikp@o2.pl> wrote:
>
> >
> > I've got some text files in my input directory and I want to pass each
> > single
> > text file (whole file not just a line) to a map (one file per one map).
> How
> > can I do this ? TextInputFormat splits text into lines and I do not want
> > this to happen.
> > I tried:
> >
> >
> http://hadoop.apache.org/common/docs/r0.20./streaming.html#How+do+I+process+files%2C+one+per+map%3F
> > but it doesn't work for me, compiler doesn't know what
> > NonSplitableTextInputFormat.class is.
> > I'm using hadoop 0.20.1
> > --
> > View this message in context:
> >
> http://old.nabble.com/Passing-whole-text-file-to-a-single-map-tp27287649p27287649.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>
>
> --
> Thanks,
> Prashant Ullegaddi,
> Search and Information Extraction Lab,
> IIIT-Hyderabad, INDIA.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message