hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raj Vishwanathan <rajv...@yahoo.com>
Subject Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
Date Fri, 10 Feb 2012 22:39:21 GMT

Here is what I understand 

The RecordReader for the MTMappert takes the input split and cycles the records among the
available threads. It also ensures that the map outputs are synchronized. 

So what Bejoy says is what will happen for the wordcount program. 


> From: "bejoy.hadoop@gmail.com" <bejoy.hadoop@gmail.com>
>To: common-user@hadoop.apache.org 
>Sent: Friday, February 10, 2012 11:15 AM
>Subject: Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
>Hi Rob
>       I'd try to answer this. From my understanding if you are using Multithreaded
mapper on word count example with TextInputFormat and imagine you have 2 threads and 2 lines
in your input split . RecordReader would read Line 1 and give it to map thread 1 and line
2 to map thread 2. So kind of identical process as defined would be happening with these two
lines in parallel. This would be the default behavior.
>Bejoy K S
>From handheld, Please excuse typos.
>-----Original Message-----
>From: Rob Stewart <robstewart57@gmail.com>
>Date: Fri, 10 Feb 2012 18:39:44 
>To: <common-user@hadoop.apache.org>
>Reply-To: common-user@hadoop.apache.org
>Subject: Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
>Thanks, this is a lot clearer. One final question...
>On 10 February 2012 14:20, Harsh J <harsh@cloudera.com> wrote:
>> Hello again,
>> On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart <robstewart57@gmail.com> wrote:
>>> OK, take word count. The <k,v> to the map is <null,"foo bar lambda
>>> beta">. The canonical Hadoop program would tokenize this line of text
>>> and output <"foo",1> and so on. How would the multithreadedmapper know
>>> how to further divide this line of text into, say: [<null,"foo
>>> bar">,<null,"lambda beta">] for 2 threads to run in parallel? Can you
>>> somehow provide an additional record reader to split the input to the
>>> map task into sub-inputs for each thread?
>> In MultithreadedMapper, the IO work is still single threaded, while
>> the map() calling post-read is multithreaded. But yes you could use a
>> mix of CombineFileInputFormat and some custom logic to have multiple
>> local splits per map task, and divide readers of them among your
>> threads. But why do all this when thats what slots at the TT are for?
>I'm still unsure how the multi-threaded mapper knows how to split the
>input value into chunks, one chunk for each thread. There is only one
>example in the Hadoop 0.23 trunk that offers an example:
>And in that source code, there is no custom logic for local splits per
>map task at all. Again, going back to the word count example. Given a
>line of text as input to a map, which comprises of 6 words. I
>specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words
>analysed by one thread, and the 3 to the other. Is what what would
>happen? i.e. - I'm unsure whether the multithreadedmapper class does
>the splitting of inputs to map tasks...
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message