Return-Path: X-Original-To: apmail-hadoop-common-dev-archive@www.apache.org Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8AF9CCD7B for ; Thu, 26 Jul 2012 14:43:47 +0000 (UTC) Received: (qmail 53337 invoked by uid 500); 26 Jul 2012 14:43:43 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 53202 invoked by uid 500); 26 Jul 2012 14:43:43 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 53182 invoked by uid 99); 26 Jul 2012 14:43:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jul 2012 14:43:43 +0000 X-ASF-Spam-Status: No, hits=0.4 required=5.0 tests=NO_RDNS_DOTCOM_HELO,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: 216.145.54.172 is neither permitted nor denied by domain of evans@yahoo-inc.com) Received: from [216.145.54.172] (HELO mrout2.yahoo.com) (216.145.54.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jul 2012 14:43:36 +0000 Received: from SP1-EX07CAS04.ds.corp.yahoo.com (sp1-ex07cas04.corp.sp1.yahoo.com [216.252.116.155]) by mrout2.yahoo.com (8.14.4/8.14.4/y.out) with ESMTP id q6QEgvgS033856; Thu, 26 Jul 2012 07:42:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=yahoo-inc.com; s=cobra; t=1343313778; bh=Q9/4+LMioJQgINAkJl5sI7iMGlRgnPConqMnELrY8xg=; h=From:To:Date:Subject:Message-ID:In-Reply-To:Content-Type: Content-Transfer-Encoding:MIME-Version; b=k08edz8vU/lK4/amitek5/z3A6cqlfuRRpUEakTkxvDA05HtxmVf5asvKh37SEM2h EGshBM90F6OFEJsSouHW5iI/BJR9el7215PFJ+yC2bz3aKlYUB4bI6cD702K2JoeU9 WxQe3wFDEahZINTvyo7MT4R2Ve69O4zlIUPDrxHc= Received: from SP1-EX07VS02.ds.corp.yahoo.com ([216.252.116.135]) by SP1-EX07CAS04.ds.corp.yahoo.com ([216.252.116.158]) with mapi; Thu, 26 Jul 2012 07:42:57 -0700 From: Robert Evans To: "common-dev@hadoop.apache.org" , "core-dev@hadoop.apache.org" Date: Thu, 26 Jul 2012 07:42:53 -0700 Subject: Re: MultithreadedMapper Thread-Topic: MultithreadedMapper Thread-Index: Ac1rPPHTvnUlG2c2STiolH0DR4tqag== Message-ID: In-Reply-To: <34213805.post@talk.nabble.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.2.3.120616 acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Milter-Version: master.31+4-gbc07cd5+ X-CLX-ID: 313778000 In general multithreaded does not get you much in traditional Map/Reduce. If you want the mappers to run faster you can drop the split size and get a similar result, because you get more parallelism. This is the use case that we have typically concentrated on. About the only time that MultiThreaded mapper makes a lot of since is if there is a lot of computation associated with each key/value pair. Your process is very compute bound, and not I/O bound. Wordcount is typically going to be I/O bound. I am not aware of any work that is being done to reduce lock contention in these cases. If you want to file a generic JIRA for the lock contention that would be great. My gut feeling is that the reason the lock is so course is because the InputFormats themselves are not thread safe. Perhaps the simplest thing you could do is to change it so that each thread gets its own "split" of the actual split, and then if one finishes early there could be some logic to try and share a "split" among a limited number of threads. But like with anything in performance never trust your gut, so please profile it before doing any code changes. --Bobby Evans On 7/26/12 12:47 AM, "kenyh" wrote: > >Multithread Mapreduce introduces multithread execution in map task. In >hadoop >1.0.2, MultithreadedMapper implements multithread execution in mapper >function. But I found that synchronization is needed for record >reading(read >the input Key and Value) and result output. This contention brings heavy >overhead in performance, which increase 50MB wordcount task execution from >40 seconds to 1 minute. I wonder if there are any optimization about the >multithread mapper to decrease the contention of input reading and >output?=20 >--=20 >View this message in context: >http://old.nabble.com/MultithreadedMapper-tp34213805p34213805.html >Sent from the Hadoop core-dev mailing list archive at Nabble.com. >