hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Chern <idry...@gmail.com>
Subject Re: All datanodes are bad IOException when trying to implement multithreading serialization
Date Mon, 30 Sep 2013 01:58:23 GMT
The number of mappers usually is same as the number of the files you fed to it.
To reduce the number you can use CombineFileInputFormat.
I recently wrote an article about it. You can take a look if this fits your needs.

http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/

Felix

On Sep 29, 2013, at 6:45 PM, yunming zhang <zhangyunming1990@gmail.com> wrote:

> I am actually trying to reduce the number of mappers because my application takes up
a lot of memory (in the order of 1-2 GB ram per mapper).  I want to be able to use a few mappers
but still maintain good CPU utilization through multithreading within a single mapper. Multithreaded
Mapper does't work because it duplicates in memory data structures.
> 
> Thanks
> 
> Yunming
> 
> 
> On Sun, Sep 29, 2013 at 6:59 PM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:
> Wouldn't you rather just change your split size so that you can have more mappers work
on your input? What else are you doing in the mappers?
> Sent from my iPad
> 
> On Sep 30, 2013, at 2:22 AM, yunming zhang <zhangyunming1990@gmail.com> wrote:
> 
>> Hi, 
>> 
>> I was playing with Hadoop code trying to have a single Mapper support reading a input
split using multiple threads. I am getting All datanodes are bad IOException, and I am not
sure what is the issue. 
>> 
>> The reason for this work is that I suspect my computation was slow because it takes
too long to create the Text() objects from inputsplit using a single thread. I tried to modify
the LineRecordReader (since I am mostly using TextInputFormat) to provide additional methods
to retrieve lines from the input split  getCurrentKey2(), getCurrentValue2(), nextKeyValue2().
I created a second FSDataInputStream, and second LineReader object for getCurrentKey2(), getCurrentValue2()
to read from. Essentially I am trying to open the input split twice with different start points
(one in the very beginning, the other in the middle of the split) to read from input split
in parallel using two threads.  
>> 
>> In the org.apache.hadoop.mapreduce.mapper.run() method, I modified it to read simultaneously
using getCurrentKey() and getCurrentKey2() using Thread 1 and Thread 2 (both threads running
at the same tim
>>       Thread 1:
>>        while(context.nextKeyValue()){
>>                   map(context.getCurrentKey(), context.getCurrentValue(), context);
>>         }
>> 
>>       Thread 2:
>>         while(context.nextKeyValue2()){
>>                 map(context.getCurrentKey2(), context.getCurrentValue2(), context);
>>                 //System.out.println("two iter");
>>         }
>> 
>> However, this causes me to see the All Datanodes are bad exception. I think I made
sure that I closed the second file. I have attached a copy of my LineRecordReader file to
show what I changed trying to enable two simultaneous read to the input split. 
>> 
>> I have modified other files(org.apache.hadoop.mapreduce.RecordReader.java, mapred.MapTask.java
....)  just to enable Mapper.run to call LinRecordReader.getCurrentKey2() and other access
methods for the second file. 
>> 
>> 
>> I would really appreciate it if anyone could give me a bit advice or just point me
to a direction as to where the problem might be, 
>> 
>> Thanks
>> 
>> Yunming 
>> 
>> <LineRecordReader.java>
> 


Mime
View raw message