hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From S D <sd.codewarr...@gmail.com>
Subject Re: Hadoop Streaming Semantics
Date Fri, 30 Jan 2009 21:04:07 GMT
Thanks for your response Amereshwari. I'm unclear on how to take advantage
of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
streaming jar file (contrib/streaming/hadoop-<version>-streaming.jar) to
include the NLineInputFormat class and then pass a command line
configuration param to indicate that NLineInputFormat should be used? If
this is the proper approach, can you point me to an example of what kind of
param should be specified? I appreciate your help.

Thanks,
SD

On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu <
amarsri@yahoo-inc.com> wrote:

> You can use NLineInputFormat for this, which splits one line (N=1, by
> default) as one split.
> So, each map task processes one line.
> See
> http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>
> -Amareshwari
>
> S D wrote:
>
>> Hello,
>>
>> I have a clarifying question about Hadoop streaming. I'm new to the list
>> and
>> didn't see anything posted that covers my questions - my apologies if I
>> overlooked a relevant post.
>>
>> I have an input file consisting of a list of files (one per line) that
>> need
>> to be processed independently of each other. The duration for processing
>> each file is significant - perhaps an hour each. I'm using Hadoop
>> streaming
>> without a reduce function to process each file and save the results (back
>> to
>> S3 native in my case). To handle to long processing time of each file I've
>> set mapred.task.timeout=0 and I have a pretty straight forward Ruby script
>> reading from STDIN:
>>
>> STDIN.each_line do |line|
>>   # Get file from contents of line
>>   # Process file (long running)
>> end
>>
>> Currently I'm using a cluster of 3 workers in which each worker can have
>> up
>> to 2 tasks running simultaneously. I've noticed that if I have a single
>> input file with many lines (more than 6 given my cluster), then not all
>> workers will be allocated tasks; I've noticed two workers being allocated
>> one task each and the other worker sitting idly. If I split my input file
>> into multiple files (at least 6) then all workers will be immediately
>> allocated the maximum number of tasks that they can handle.
>>
>> My interpretation on this is fuzzy. It seems that Hadoop streaming will
>> take
>> separate input files and allocate a new task per file (up to the maximum
>> constraint) but if given a single input file it is unclear as to whether a
>> new task is allocated per file or line. My understanding of Hadoop Java is
>> that (unlike Hadoop streaming) when given a single input file, the file
>> will
>> be broken up into separate lines and the maximum number of map tasks will
>> automagically be allocated to handle the lines of the file (assuming the
>> use
>> of TextInputFormat).
>>
>> Can someone clarify this?
>>
>> Thanks,
>> SD
>>
>>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message