hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars George <l...@worldlingo.com>
Subject Re: how to handle large volume reduce input value in mapreduce program?
Date Tue, 29 Sep 2009 10:46:08 GMT
Hi HB,

I'd say do a Put for each objid. Then leave it to the flush of the write 
buffer to have them written out. So change to code to:

  byte[] family = Bytes.toBytes("oid");
  for (Writable objid : values) {
    Put put = new Put(((ImmutableBytesWritable) key).get());
    put.add(family, Bytes.toBytes(((Text) objid).toString()),
      Bytes.toBytes(((Text) objid).toString()));
    context.write((ImmutableBytesWritable) key, put);
    context.progress();
  }
}


The first version will not work at all as your List becomes to large 
obviously. So you will have to go the route you chose in attempt #2, 
i.e. save them as columns with qualifiers.

BTW, I threw in a context.progress() for good measure as otherwise you 
may have your tasks time out. In earlier versions of Hadoop the 
"write()" call may also have triggered an update, but for Hadoop 0.20 
you must call the "progress()" to report a life signal.

HTH,
Lars


Yin_Hongbin@emc.com wrote:
> Hi, all
>
>  
>
> I am a newbie to hadoop and just begin to play it recent days. I am
> trying to write a mapreduce program to parse a large dataset (about 20G)
> to abstract object id and store to HBase table. The issue is there is
> one keyword which associates with several million object id. Here is my
> first reduce program.
>
>  
>
>  
>
> <program1>
>
> public class MyReducer extends TableReducer<Writable, Writable,
> Writable> {
>
>  
>
>     @Override
>
>     public void reduce(Writable key, Iterable<Writable> objectids,
> Context context)
>
>            throws IOException, InterruptedException {
>
>           
>
>           Set<String> objectIDs = new HashSet<String>();
>
>        Put put = new Put(((ImmutableBytesWritable) key).get());
>
>        byte[] family = Bytes.toBytes("oid");
>
>         for (Writable objid : objectids) {
>
>               objectIDs.add(((Text)objid)).toString());
>
>         }                 
>
>           put.add(family, null, Bytes.toBytes(objectIDs.toString());
>
> context.write((ImmutableBytesWritable) key, put);
>
>  
>
>     }
>
> }
>
>  
>
> In this program, the reduce failed because of the java heap "out of
> memory" issue. A rough counting show that the several million object id
> consumes about 900M heap if loading them all into a Set at one time. So
> I implements the reduce in another way:
>
>  
>
> <program2>
>
> public class IndexReducer extends TableReducer<Writable, Writable,
> Writable> {
>
>     @Override
>
>     public void reduce(Writable key, Iterable<Writable> values, Context
> context)
>
>            throws IOException, InterruptedException {
>
>  
>
>        Put put = new Put(((ImmutableBytesWritable) key).get());
>
>        byte[] family = Bytes.toBytes("oid");
>
>        for (Writable objid : values) {
>
>            put.add(family, Bytes.toBytes(((Text) objid).toString()),
> Bytes
>
>                   .toBytes(((Text) objid).toString()));
>
>        }
>
>        context.write((ImmutableBytesWritable) key, put);
>
>     }
>
> }
>
>  
>
> This time, the reduce still failed as a result of "reduce time out"
> issue. I doubled the reduce time-out. Then, "Out of memory" happened.
> Error log shows the put.add() throws "Out of memory" error.
>
>  
>
>  
>
> By the way, there are totally 18 datanode in the hadoop/hbase
> environment. The number of reduce tasks is 50.
>
>  
>
> So, my question is how to handle large volume reduce input value in
> mapreduce program. Increase memory? I don't think it is a reasonable
> option. Increase reduce task number?.........
>
>  
>
> Sigh, I totally have no any clue. What's your suggestion?
>
>  
>
>  
>
> Best Regards, 
> HB
>
>  
>
>
>   

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message