Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: 216.145.54.172 is neither permitted
 nor denied by domain of evans@yahoo-inc.com)
From: Robert Evans <evans@yahoo-inc.com>
To: "common-dev@hadoop.apache.org" <common-dev@hadoop.apache.org>,
        "core-dev@hadoop.apache.org" <core-dev@hadoop.apache.org>
Date: Thu, 26 Jul 2012 07:42:53 -0700
Subject: Re: MultithreadedMapper
Thread-Topic: MultithreadedMapper
Thread-Index: Ac1rPPHTvnUlG2c2STiolH0DR4tqag==
Message-ID: <CC36BED8.422A%evans@yahoo-inc.com>
In-Reply-To: <34213805.post@talk.nabble.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.2.3.120616
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

In general multithreaded does not get you much in traditional Map/Reduce.
If you want the mappers to run faster you can drop the split size and get
a similar result, because you get more parallelism.  This is the use case
that we have typically concentrated on.  About the only time that
MultiThreaded mapper makes a lot of since is if there is a lot of
computation associated with each key/value pair.  Your process is very
compute bound, and not I/O bound.  Wordcount is typically going to be I/O
bound.  I am not aware of any work that is being done to reduce lock
contention in these cases.  If you want to file a generic JIRA for the
lock contention that would be great.

My gut feeling is that the reason the lock is so course is because the
InputFormats themselves are not thread safe.  Perhaps the simplest thing
you could do is to change it so that each thread gets its own "split" of
the actual split, and then if one finishes early there could be some logic
to try and share a "split" among a limited number of threads. But like
with anything in performance never trust your gut, so please profile it
before doing any code changes.

--Bobby Evans

On 7/26/12 12:47 AM, "kenyh" <ken.yihan1990@gmail.com> wrote:

>
>Multithread Mapreduce introduces multithread execution in map task. In
>hadoop
>1.0.2, MultithreadedMapper implements multithread execution in mapper
>function. But I found that synchronization is needed for record
>reading(read
>the input Key and Value) and result output. This contention brings heavy
>overhead in performance, which increase 50MB wordcount task execution from
>40 seconds to 1 minute. I wonder if there are any optimization about the
>multithread mapper to decrease the contention of input reading and
>output?=20
>--=20
>View this message in context:
>http://old.nabble.com/MultithreadedMapper-tp34213805p34213805.html
>Sent from the Hadoop core-dev mailing list archive at Nabble.com.
>