hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amareshwari Sriramadasu <amar...@yahoo-inc.com>
Subject Re: Hadoop Streaming Semantics
Date Mon, 02 Feb 2009 04:00:46 GMT
Which version of hadoop are you using?

You can directly use -inputformat 
org.apache.hadoop.mapred.lib.NLineInputFormat for your streaming job. 
You need not include it in your streaming jar.
-Amareshwari

S D wrote:
> Thanks for your response Amereshwari. I'm unclear on how to take advantage
> of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
> streaming jar file (contrib/streaming/hadoop-<version>-streaming.jar) to
> include the NLineInputFormat class and then pass a command line
> configuration param to indicate that NLineInputFormat should be used? If
> this is the proper approach, can you point me to an example of what kind of
> param should be specified? I appreciate your help.
>
> Thanks,
> SD
>
> On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu <
> amarsri@yahoo-inc.com> wrote:
>
>   
>> You can use NLineInputFormat for this, which splits one line (N=1, by
>> default) as one split.
>> So, each map task processes one line.
>> See
>> http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>
>> -Amareshwari
>>
>> S D wrote:
>>
>>     
>>> Hello,
>>>
>>> I have a clarifying question about Hadoop streaming. I'm new to the list
>>> and
>>> didn't see anything posted that covers my questions - my apologies if I
>>> overlooked a relevant post.
>>>
>>> I have an input file consisting of a list of files (one per line) that
>>> need
>>> to be processed independently of each other. The duration for processing
>>> each file is significant - perhaps an hour each. I'm using Hadoop
>>> streaming
>>> without a reduce function to process each file and save the results (back
>>> to
>>> S3 native in my case). To handle to long processing time of each file I've
>>> set mapred.task.timeout=0 and I have a pretty straight forward Ruby script
>>> reading from STDIN:
>>>
>>> STDIN.each_line do |line|
>>>   # Get file from contents of line
>>>   # Process file (long running)
>>> end
>>>
>>> Currently I'm using a cluster of 3 workers in which each worker can have
>>> up
>>> to 2 tasks running simultaneously. I've noticed that if I have a single
>>> input file with many lines (more than 6 given my cluster), then not all
>>> workers will be allocated tasks; I've noticed two workers being allocated
>>> one task each and the other worker sitting idly. If I split my input file
>>> into multiple files (at least 6) then all workers will be immediately
>>> allocated the maximum number of tasks that they can handle.
>>>
>>> My interpretation on this is fuzzy. It seems that Hadoop streaming will
>>> take
>>> separate input files and allocate a new task per file (up to the maximum
>>> constraint) but if given a single input file it is unclear as to whether a
>>> new task is allocated per file or line. My understanding of Hadoop Java is
>>> that (unlike Hadoop streaming) when given a single input file, the file
>>> will
>>> be broken up into separate lines and the maximum number of map tasks will
>>> automagically be allocated to handle the lines of the file (assuming the
>>> use
>>> of TextInputFormat).
>>>
>>> Can someone clarify this?
>>>
>>> Thanks,
>>> SD
>>>
>>>
>>>
>>>       
>>     
>
>   


Mime
View raw message