hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Best way to write to multiple tables in one map-only job
Date Wed, 05 Oct 2011 01:49:50 GMT

One other option...
Your map() method has a null writeable and you handle the put() to the table(s) yourself within
the map() method.
You can also set the autoflush within your job too.


> Date: Tue, 4 Oct 2011 16:20:25 +0200
> From: christopher.dorner@gmail.com
> To: user@hbase.apache.org
> Subject: Re: Best way to write to multiple tables in one map-only job
> 
> Thank you for the hint.
> 
> What about autoflush then? Is that also something i can set using the 
> config on job setup? Or does it onyl work with an HTable instance? 
> Somehow i can't really find the right information :)
> 
> Regards,
> Christopher
> 
> Am 03.10.2011 19:20, schrieb Jean-Daniel Cryans:
> > Option a) and b) are the same since MultiTableOutputFormat internally
> > uses multiple HTables. See for yourself:
> >
> > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.java
> >
> > Also you can set the write buffer but setting
> > hbase.client.write.buffer on the configuration that your pass in the
> > job setup.
> >
> > Using HTablePool in a single threaded application doesn't offer more
> > than just storage for your HTables.
> >
> > Hope that helps,
> >
> > J-D
> >
> > On Sat, Oct 1, 2011 at 4:05 AM, Christopher Dorner
> > <christopher.dorner@gmail.com>  wrote:
> >> Hallo,
> >>
> >> i am building a RDF Store using HBase and experimenting with different index
> >> tables and Schema Designs.
> >>
> >> For the input, i have a File where each line is a RDF triple in N3 Format.
> >>
> >> I need to write to multiple Tables since i need to build several index
> >> tables. For the sake of reducing IO and not reading the file a few times i
> >> want to do that in one Map-Only Job. Later the file will contain a few
> >> million triples.
> >>
> >> I am experimenting in Pseudo-Distributed-Mode so far but will be able to run
> >> it on our cluster soon.
> >> Storing the data in the Tables does not need to be speed-optimized at any
> >> cost, but i just want to do it as simple and fast as possible.
> >>
> >>
> >> What is the best way to write to more than 1 table in one Map-Task?
> >>
> >> a)
> >> I can either use "MultiTableOutputFormat.class" and write in map() using:
> >> Put put = new Put(key);
> >> put.add(kv);
> >> context.write(tableName, put);
> >>
> >> Can i write to e.g. 6 Tables in this way by creating a new Put for each
> >> table?
> >>
> >> But how can i turn off autoFlush and set writeBufferSize in this case?
> >> Because i think autoflush is not that good in this case of putting lots of
> >> values.
> >>
> >>
> >> b)
> >> I can use an instance of HTable in the Mapper class. Then i can set
> >> autoFlush and writeBufferSize and write to the table using:
> >> HTable table = new HTable(config, tableName);
> >> table.put(put);
> >>
> >> But it is recommended to use only one instance of HTable, so i would need to
> >> do
> >> table = new Table();
> >> for each table i want to write to. Is that still fine with 6 tables?
> >> I stumbled upon HTablePool. Is this for these scenarios?
> >>
> >>
> >> Thank You and Regards,
> >> Christopher
> >>
> 
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message