hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Are SequenceFiles split? If so, how?
Date Mon, 20 Apr 2009 05:18:21 GMT
Yes, there can be more than one InputSplit per SequenceFile. The file will
be split more-or-less along 64 MB boundaries. (the actual "edges" of the
splits will be adjusted to hit the next block of key-value pairs, so it
might be a few kilobytes off.)

The SequenceFileInputFormat regards mapred.map.tasks (conf.setNumMapTasks())
as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
is always 100% user-controlled.) If you need exact control over the number
of map tasks, you'll need to subclass it and modify this behavior. That
having been said -- are you sure you actually need to precisely control this
value? Or is it enough to know how many splits were created?

- Aaron

On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman <b.wagman@comcast.net> wrote:

> Suppose a SequenceFile (containing keys and values that are BytesWritable)
> is used as input. Will it be divided into InputSplits?  If so, what's the
> criteria use for splitting?
> I'm interested in this because I need to control the number of map tasks
> used, which (if I understand it correctly), is equal to the number of
> InputSplits.
> thanks,
> bw

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message