hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean McNamara <Sean.McNam...@Webtrends.com>
Subject Re: Parallel reading advice
Date Wed, 28 Nov 2012 17:37:29 GMT
Hi J-D

Really good questions.  I will check for a misconfiguration.

> I'm not sure what you're talking about here. Which master

I am using http://spark-project.org/ , so the master I am referring to is
really the spark driver.  Spark can read from a hadoop InputFormat and
populate itself that way, but you don't have control over which
slave/worker data will land on using it.  My goal is to use spark to reach
in for slices of data that are in HBase, and be able to perform set
operations on the data in parallel using spark.  Being able to load a
partition onto the right node is important. This is so that I don't have
to reshuffle the data, just to get it onto the right node that handles a
particular data partition.

> BTW why can't you keep the connections around?

The spark api is totally functional, AFAIK it's not possible to setup a
connection and keep it around (I am asking on that mailing list to be

> Since this is something done within the HBase client, doing it
>externally sounds terribly tacky

Yup.  The reason I am entertaining this route is that using an InputFormat
with spark I was able to load in way more data, and it was all sub second.
 Since moving to having the spark slaves handle pulling in their data (not
using the InputFormat) it seems slower for some reason.  I figured it
might be because using an InputFormat the slaves were told what to load,
vs. each of the 40 slaves having to do more work to find what to load.
Perhaps my assumption is wrong?  Thoughts?

I really appreciate your insights.  Thanks!

On 11/28/12 3:10 AM, "Jean-Daniel Cryans" <jdcryans@apache.org> wrote:

>On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara
>> I have a table who's keys are prefixed with a byte to help distribute
>> keys so scans don't hotspot.
>> I also have a bunch of slave processes that work to scan the prefix
>> partitions in parallel.  Currently each slave sets up their own hbase
>> connection, scanner, etc..  Most of the slave processes finish their
>> and return within 2-3 seconds.  It tends to take the same amount of time
>> regardless of if there's lots of data, or very little.  So I think that
>> sec overhead is there because each slave will setup a new connection on
>> each request (I am unable to reuse connections in the slaves).
>2 secs sounds way too high. I recommend you check into this and see where
>the time is spent as you may find underlying issues lis misconfiguration.
>> I'm wondering if I could remove some of that overhead by using the
>> (which can reuse it's hbase connection) to determine the splits, and
>> delegating that information out to each slave. I think I could possible
>> TableInputFormat/TableRecordReader to accomplish this?  Would this route
>> make sense?
>I'm not sure what you're talking about here. Which master? HBase's or
>there's something in your infrastructure that's also called "master"? Then
>I'm not sure what your are trying to achieve by "determine the splits",
>mean finding the regions you need to contact from your slaves? Since this
>is something done within the HBase client, doing it externally sounds
>terribly hacky. BTW why can't you keep the connections around? Is it a
>problem of JVMs being re-spawned? If so, there are techniques you can use
>to keep them around for reuse and then you would also benefit from reusing
>Hope this helps,

View raw message