hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amareshwari Sriramadasu <amar...@yahoo-inc.com>
Subject Re: Hadoop Streaming Semantics
Date Fri, 30 Jan 2009 03:49:02 GMT
You can use NLineInputFormat for this, which splits one line (N=1, by 
default) as one split.
So, each map task processes one line.
See 
http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

-Amareshwari
S D wrote:
> Hello,
>
> I have a clarifying question about Hadoop streaming. I'm new to the list and
> didn't see anything posted that covers my questions - my apologies if I
> overlooked a relevant post.
>
> I have an input file consisting of a list of files (one per line) that need
> to be processed independently of each other. The duration for processing
> each file is significant - perhaps an hour each. I'm using Hadoop streaming
> without a reduce function to process each file and save the results (back to
> S3 native in my case). To handle to long processing time of each file I've
> set mapred.task.timeout=0 and I have a pretty straight forward Ruby script
> reading from STDIN:
>
> STDIN.each_line do |line|
>    # Get file from contents of line
>    # Process file (long running)
> end
>
> Currently I'm using a cluster of 3 workers in which each worker can have up
> to 2 tasks running simultaneously. I've noticed that if I have a single
> input file with many lines (more than 6 given my cluster), then not all
> workers will be allocated tasks; I've noticed two workers being allocated
> one task each and the other worker sitting idly. If I split my input file
> into multiple files (at least 6) then all workers will be immediately
> allocated the maximum number of tasks that they can handle.
>
> My interpretation on this is fuzzy. It seems that Hadoop streaming will take
> separate input files and allocate a new task per file (up to the maximum
> constraint) but if given a single input file it is unclear as to whether a
> new task is allocated per file or line. My understanding of Hadoop Java is
> that (unlike Hadoop streaming) when given a single input file, the file will
> be broken up into separate lines and the maximum number of map tasks will
> automagically be allocated to handle the lines of the file (assuming the use
> of TextInputFormat).
>
> Can someone clarify this?
>
> Thanks,
> SD
>
>   


Mime
View raw message