hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Are SequenceFiles split? If so, how?
Date Thu, 23 Apr 2009 08:56:12 GMT
Explicitly controlling your splits will be very challenging. Taking the case
where you have expensive (X) and cheap (C) objects to process, you may have
a file where the records are lined up X C X C X C X X X X X C C C. In this
case, you'll need to scan through the whole file and build splits such that
the lengthy run of expensive objects is broken up into separate splits, but
the run of cheap objects is consolidated. I'm suspicious that you can do
this without scanning through the data (which is what often constitutes the
bulk of a time in a mapreduce program).

But how much data are you using? I would imagine that if you're operating at
the scale where Hadoop makes sense, then the high- and low-cost objects will
-- on average -- balance out and tasks will be roughly evenly proportioned.

In general, I would just dodge the problem by making sure your splits
relatively small compared to the size of your input data. If you have 5
million objects to process, then make each split be roughly equal to say
20,000 of them. Then even if some splits take long to process and others
take a short time, then one CPU may dispatch with a dozen cheap splits in
the same time where one unlucky JVM had to process a single very expensive
split. Now you haven't had to manually balance anything, and you still get
to keep all your CPUs full.

- Aaron

On Mon, Apr 20, 2009 at 11:25 PM, Barnet Wagman <b.wagman@comcast.net>wrote:

> Thanks Aaron, that really helps.  I probably do need to control the number
> of splits.  My input 'data' consists of  Java objects and their size (in
> bytes) doesn't necessarily reflect the amount of time needed for each map
> operation.   I need to ensure that I have enough map tasks so that all cpus
> are utilized and the job gets done in a reasonable amount of time.
>  (Currently I'm creating multiple input files and making them unsplitable,
> but subclassing SequenceFileInputFormat to explicitly control then number of
> splits sounds like a better approach).
> Barnet
> Aaron Kimball wrote:
>> Yes, there can be more than one InputSplit per SequenceFile. The file will
>> be split more-or-less along 64 MB boundaries. (the actual "edges" of the
>> splits will be adjusted to hit the next block of key-value pairs, so it
>> might be a few kilobytes off.)
>> The SequenceFileInputFormat regards mapred.map.tasks
>> (conf.setNumMapTasks())
>> as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
>> is always 100% user-controlled.) If you need exact control over the number
>> of map tasks, you'll need to subclass it and modify this behavior. That
>> having been said -- are you sure you actually need to precisely control
>> this
>> value? Or is it enough to know how many splits were created?
>> - Aaron
>> On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman <b.wagman@comcast.net>
>> wrote:
>>> Suppose a SequenceFile (containing keys and values that are
>>> BytesWritable)
>>> is used as input. Will it be divided into InputSplits?  If so, what's the
>>> criteria use for splitting?
>>> I'm interested in this because I need to control the number of map tasks
>>> used, which (if I understand it correctly), is equal to the number of
>>> InputSplits.
>>> thanks,
>>> bw

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message