hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bejoy.had...@gmail.com
Subject Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum
Date Fri, 10 Feb 2012 19:15:08 GMT
Hi Rob
       I'd try to answer this. From my understanding if you are using Multithreaded mapper
on word count example with TextInputFormat and imagine you have 2 threads and 2 lines in your
input split . RecordReader would read Line 1 and give it to map thread 1 and line 2 to map
thread 2. So kind of identical process as defined would be happening with these two lines
in parallel. This would be the default behavior.
Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Rob Stewart <robstewart57@gmail.com>
Date: Fri, 10 Feb 2012 18:39:44 
To: <common-user@hadoop.apache.org>
Reply-To: common-user@hadoop.apache.org
Subject: Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum

Thanks, this is a lot clearer. One final question...

On 10 February 2012 14:20, Harsh J <harsh@cloudera.com> wrote:
> Hello again,
>
> On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart <robstewart57@gmail.com> wrote:
>> OK, take word count. The <k,v> to the map is <null,"foo bar lambda
>> beta">. The canonical Hadoop program would tokenize this line of text
>> and output <"foo",1> and so on. How would the multithreadedmapper know
>> how to further divide this line of text into, say: [<null,"foo
>> bar">,<null,"lambda beta">] for 2 threads to run in parallel? Can you
>> somehow provide an additional record reader to split the input to the
>> map task into sub-inputs for each thread?
>
> In MultithreadedMapper, the IO work is still single threaded, while
> the map() calling post-read is multithreaded. But yes you could use a
> mix of CombineFileInputFormat and some custom logic to have multiple
> local splits per map task, and divide readers of them among your
> threads. But why do all this when thats what slots at the TT are for?

I'm still unsure how the multi-threaded mapper knows how to split the
input value into chunks, one chunk for each thread. There is only one
example in the Hadoop 0.23 trunk that offers an example:
hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/mapreduce/lib/map/TestMultithreadedMapper.java

And in that source code, there is no custom logic for local splits per
map task at all. Again, going back to the word count example. Given a
line of text as input to a map, which comprises of 6 words. I
specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words
analysed by one thread, and the 3 to the other. Is what what would
happen? i.e. - I'm unsure whether the multithreadedmapper class does
the splitting of inputs to map tasks...

Regards,
Mime
View raw message