hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joep Rottinghuis <jrottingh...@gmail.com>
Subject Re: hbase puts in map tasks don't seem to run in parallel
Date Mon, 04 Jun 2012 00:26:28 GMT
This is probably more of an user@hbase.apache.org topic than common-user.

To answer your question, you will want to pre-split the table like so: http://hbase.apache.org/book/perf.writing.html

Cheers,

Joep

Sent from my iPhone

On Jun 3, 2012, at 3:45 PM, Jonathan Bishop <jbishop.rwc@gmail.com> wrote:

> Thanks Joep,
> 
> My table is empty when I start and will consist of 18M rows when completed
> 
> So I guess I need to understand how to pick row keys such that the regions
> will be on that mappers node. Any advice would be appreciated.
> 
> BTW, I do notice that the region servers of other nodes become busy, but
> only after a large number of rows have been processed - say 10%. It would
> be better if I could deliberately control which regions/regionserver were
> going to be used though, to prevent the network traffic of sending rows to
> regionservers on other nodes.
> 
> Jon
> 
> On Sun, Jun 3, 2012 at 12:02 PM, Joep Rottinghuis <jrottinghuis@gmail.com>wrote:
> 
>> How large is your table?
>> If it is newly created and still almost empty then it will probably
>> consist of only one region, which will be hosted on one region server.
>> 
>> Even as the table grows and gets split into multiple regions, you will
>> have to split your mappers in such a way that each writes to the key ranges
>> corresponding to the regions hosted locally on the corresponding region
>> sever.
>> 
>> Cheers,
>> 
>> Joep
>> 
>> Sent from my iPhone
>> 
>> On Jun 2, 2012, at 6:25 PM, Jonathan Bishop <jbishop.rwc@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> I am new to hadoop and hbase, but have spent the last few weeks learning
>> as
>>> much as I can...
>>> 
>>> I am attempting to create an hbase table during a hadoop job by simply
>>> doing puts to a table from each map task. I am hoping that each map task
>>> will use the regionserver on its node so that all 10 of my nodes are
>>> putting values into the table at the same time.
>>> 
>>> Here is my map class below. The Node class is a simple data structure
>> which
>>> knows how to parse a line of input and create a Put for hbase.
>>> 
>>> When I run this I see that only one region server is active for the
>> table I
>>> am creating. I know that my input file is split among all 10 of my data
>>> nodes, and I know that if I do not do puts to the hbase table everything
>>> runs in a parallel on all 10 machines. It is only when I start doing
>> hbase
>>> puts that the run times go way up.
>>> 
>>> Thanks,
>>> 
>>> Jon
>>> 
>>> public static class MapClass extends Mapper<Object, Text, IntWritable,
>>> Node> {
>>> HTableInterface table = null;
>>> @Override
>>> protected void setup(Context context) throws IOException,
>>> InterruptedException {
>>> String tableName = context.getConfiguration().get(TABLE);
>>> table = new HTable(tableName);
>>> }
>>> @Override
>>> public void map(Object key, Text value, Context context) throws
>>> IOException, InterruptedException {
>>> Node node = null;
>>> try {
>>> node = Node.parseNode(value.toString());
>>> } catch (ParseException e) {
>>> throw new IOException();
>>> }
>>> Put put = node.getPut();
>>> table.put(put);
>>> }
>>> @Override
>>> protected void cleanup(Context context) throws IOException,
>>> InterruptedException {
>>> table.close();
>>> }
>>> }
>> 

Mime
View raw message