hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Dorner <christopher.dor...@gmail.com>
Subject Best way to write to multiple tables in one map-only job
Date Sat, 01 Oct 2011 11:05:38 GMT
Hallo,

i am building a RDF Store using HBase and experimenting with different 
index tables and Schema Designs.

For the input, i have a File where each line is a RDF triple in N3 Format.

I need to write to multiple Tables since i need to build several index 
tables. For the sake of reducing IO and not reading the file a few times 
i want to do that in one Map-Only Job. Later the file will contain a few 
million triples.

I am experimenting in Pseudo-Distributed-Mode so far but will be able to 
run it on our cluster soon.
Storing the data in the Tables does not need to be speed-optimized at 
any cost, but i just want to do it as simple and fast as possible.


What is the best way to write to more than 1 table in one Map-Task?

a)
I can either use "MultiTableOutputFormat.class" and write in map() using:
Put put = new Put(key);
put.add(kv);
context.write(tableName, put);

Can i write to e.g. 6 Tables in this way by creating a new Put for each 
table?

But how can i turn off autoFlush and set writeBufferSize in this case? 
Because i think autoflush is not that good in this case of putting lots 
of values.


b)
I can use an instance of HTable in the Mapper class. Then i can set 
autoFlush and writeBufferSize and write to the table using:
HTable table = new HTable(config, tableName);
table.put(put);

But it is recommended to use only one instance of HTable, so i would 
need to do
table = new Table();
for each table i want to write to. Is that still fine with 6 tables?
I stumbled upon HTablePool. Is this for these scenarios?


Thank You and Regards,
Christopher

Mime
View raw message