Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A194210413 for ; Mon, 30 Sep 2013 01:58:57 +0000 (UTC) Received: (qmail 88698 invoked by uid 500); 30 Sep 2013 01:58:52 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 88403 invoked by uid 500); 30 Sep 2013 01:58:52 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 88396 invoked by uid 99); 30 Sep 2013 01:58:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Sep 2013 01:58:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of idryman@gmail.com designates 209.85.220.47 as permitted sender) Received: from [209.85.220.47] (HELO mail-pa0-f47.google.com) (209.85.220.47) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Sep 2013 01:58:48 +0000 Received: by mail-pa0-f47.google.com with SMTP id kp14so5189712pab.34 for ; Sun, 29 Sep 2013 18:58:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:message-id:mime-version:subject:date:references :to:in-reply-to; bh=2xmEWj4H9Ds8NtrD3qPK4E7ObSZ4dyTYCYTTmq8NMrg=; b=p24UpRoi3lrLsZZ9dPTrN6kJDCN4oJi4UaTF87QHTB84ztwU/vW1le4VhNz6q+tTQM 7tKCBKQtkTZNLSyNcC/b1rBc+4UgeaEVdrzzRrUOYbcyHKiS4yWbEVPux3eqkbLaAR33 lzffYaRfgrG68dXiQ1rgNkXvJpO9Ol0saX4Q1tCoaVNbAMfRjybjC3aruYs2h5gd84T2 Ns5Tfrw9/pk4RZYjDgwAiqnVS3S1YuajxDqO4aMO0InoojB8ueevFmp2KdVuaibxGBF8 2c9t8wQNKMshh5zPD81y48/8aokSkOpJTPvOQBHV+Uw1UU0YoA+UfUWtCUVHbt7SbcxM iNew== X-Received: by 10.68.195.36 with SMTP id ib4mr21239270pbc.56.1380506307595; Sun, 29 Sep 2013 18:58:27 -0700 (PDT) Received: from [192.168.1.74] (108-205-154-230.lightspeed.irvnca.sbcglobal.net. [108.205.154.230]) by mx.google.com with ESMTPSA id go4sm23346494pbb.15.1969.12.31.16.00.00 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 29 Sep 2013 18:58:26 -0700 (PDT) From: Felix Chern Content-Type: multipart/alternative; boundary="Apple-Mail=_63F223D0-EEB0-41C0-B97F-A966456D768A" Message-Id: <6484396B-C229-47B1-8105-73F69235C5E0@gmail.com> Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: All datanodes are bad IOException when trying to implement multithreading serialization Date: Sun, 29 Sep 2013 18:58:23 -0700 References: To: user@hadoop.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1508) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_63F223D0-EEB0-41C0-B97F-A966456D768A Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 The number of mappers usually is same as the number of the files you fed = to it. To reduce the number you can use CombineFileInputFormat. I recently wrote an article about it. You can take a look if this fits = your needs. = http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using= -combinefileinputformat-1/ Felix On Sep 29, 2013, at 6:45 PM, yunming zhang = wrote: > I am actually trying to reduce the number of mappers because my = application takes up a lot of memory (in the order of 1-2 GB ram per = mapper). I want to be able to use a few mappers but still maintain good = CPU utilization through multithreading within a single mapper. = Multithreaded Mapper does't work because it duplicates in memory data = structures. >=20 > Thanks >=20 > Yunming >=20 >=20 > On Sun, Sep 29, 2013 at 6:59 PM, Sonal Goyal = wrote: > Wouldn't you rather just change your split size so that you can have = more mappers work on your input? What else are you doing in the mappers? > Sent from my iPad >=20 > On Sep 30, 2013, at 2:22 AM, yunming zhang = wrote: >=20 >> Hi,=20 >>=20 >> I was playing with Hadoop code trying to have a single Mapper support = reading a input split using multiple threads. I am getting All datanodes = are bad IOException, and I am not sure what is the issue.=20 >>=20 >> The reason for this work is that I suspect my computation was slow = because it takes too long to create the Text() objects from inputsplit = using a single thread. I tried to modify the LineRecordReader (since I = am mostly using TextInputFormat) to provide additional methods to = retrieve lines from the input split getCurrentKey2(), = getCurrentValue2(), nextKeyValue2(). I created a second = FSDataInputStream, and second LineReader object for getCurrentKey2(), = getCurrentValue2() to read from. Essentially I am trying to open the = input split twice with different start points (one in the very = beginning, the other in the middle of the split) to read from input = split in parallel using two threads. =20 >>=20 >> In the org.apache.hadoop.mapreduce.mapper.run() method, I modified it = to read simultaneously using getCurrentKey() and getCurrentKey2() using = Thread 1 and Thread 2 (both threads running at the same tim >> Thread 1: >> while(context.nextKeyValue()){ >> map(context.getCurrentKey(), = context.getCurrentValue(), context); >> } >>=20 >> Thread 2: >> while(context.nextKeyValue2()){ >> map(context.getCurrentKey2(), = context.getCurrentValue2(), context); >> //System.out.println("two iter"); >> } >>=20 >> However, this causes me to see the All Datanodes are bad exception. I = think I made sure that I closed the second file. I have attached a copy = of my LineRecordReader file to show what I changed trying to enable two = simultaneous read to the input split.=20 >>=20 >> I have modified other = files(org.apache.hadoop.mapreduce.RecordReader.java, mapred.MapTask.java = ....) just to enable Mapper.run to call = LinRecordReader.getCurrentKey2() and other access methods for the second = file.=20 >>=20 >>=20 >> I would really appreciate it if anyone could give me a bit advice or = just point me to a direction as to where the problem might be,=20 >>=20 >> Thanks >>=20 >> Yunming=20 >>=20 >> >=20 --Apple-Mail=_63F223D0-EEB0-41C0-B97F-A966456D768A Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 The = number of mappers usually is same as the number of the files you fed to = it.
To reduce the number you can use = CombineFileInputFormat.
I recently wrote an article about it. = You can take a look if this fits your needs.


Felix

On Sep 29, 2013, at 6:45 = PM, yunming zhang <zhangyunming1990@gmail.com&= gt; wrote:

I am actually trying to reduce the number = of mappers because my application takes up a lot of memory (in the order = of 1-2 GB ram per mapper).  I want to be able to use a few mappers = but still maintain good CPU utilization through multithreading within a = single mapper. Multithreaded Mapper does't work because it duplicates in = memory data structures.

Thanks

Yunming


On Sun, Sep 29, 2013 at 6:59 PM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:
Wouldn't you rather just change your split size so = that you can have more mappers work on your input? What else are you = doing in the mappers?
Sent from my iPad

On Sep 30, 2013, at 2:22 = AM, yunming zhang <zhangyunming1990@gmail.com> = wrote:

Hi, 

I = was playing with Hadoop code trying to have a single Mapper support = reading a input split using multiple threads. I am getting All datanodes = are bad IOException, and I am not sure what is the issue. 

The reason for this work is that I suspect my = computation was slow because it takes too long to create the Text() = objects from inputsplit using a single thread. I tried to modify = the LineRecordReader (since I am mostly using TextInputFormat) to = provide additional methods to retrieve lines from the input split =  getCurrentKey2(), getCurrentValue2(), nextKeyValue2(). I created a = second FSDataInputStream, and = second LineReader object for getCurrentKey2(), getCurrentValue2() to read from. = Essentially I am trying to open the input split twice with different = start points (one in the very beginning, the other in the middle of the = split) to read from input split in parallel using two threads. =  

In the = org.apache.hadoop.mapreduce.mapper.run() method, I modified it to read = simultaneously using getCurrentKey() and getCurrentKey2() using Thread 1 = and Thread 2 (both threads running at the same tim
      Thread = 1:
      =  while(context.nextKeyValue()){
                  = map(context.getCurrentKey(), context.getCurrentValue(), = context);
      =   }

      Thread = 2:
        = while(context.nextKeyValue2()){
        =         map(context.getCurrentKey2(), = context.getCurrentValue2(), context);
                = //System.out.println("two iter");
        = }

However, this causes me to see the All = Datanodes are bad exception. I think I made sure that I closed the = second file. I have attached a copy of my LineRecordReader file to show = what I changed trying to enable two simultaneous read to the input = split. 

I have modified other = files(org.apache.hadoop.mapreduce.RecordReader.java, mapred.MapTask.java = ....)  just to enable Mapper.run to call = LinRecordReader.getCurrentKey2() and other access methods for the second = file. 


I would really appreciate it if = anyone could give me a bit advice or just point me to a direction as to = where the problem might be, 

= Thanks

Yunming 

<LineRecordReader.java>


= --Apple-Mail=_63F223D0-EEB0-41C0-B97F-A966456D768A--