hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean McNamara <Sean.McNam...@Webtrends.com>
Subject Re: Parallel reading advice
Date Wed, 28 Nov 2012 22:25:06 GMT
Turns out there is a way to reuse the connection in Spark.  I was also
forgetting to call setCaching (that was the primary reason). So it's very
fast now and I have the data where I need it.

The first request still takes 2-3 seconds to setup and see data
(regardless of how much), but after that it's super fast.


On 11/28/12 10:37 AM, "Sean McNamara" <Sean.McNamara@Webtrends.com> wrote:

>Hi J-D
>Really good questions.  I will check for a misconfiguration.
>> I'm not sure what you're talking about here. Which master
>I am using http://spark-project.org/ , so the master I am referring to is
>really the spark driver.  Spark can read from a hadoop InputFormat and
>populate itself that way, but you don't have control over which
>slave/worker data will land on using it.  My goal is to use spark to reach
>in for slices of data that are in HBase, and be able to perform set
>operations on the data in parallel using spark.  Being able to load a
>partition onto the right node is important. This is so that I don't have
>to reshuffle the data, just to get it onto the right node that handles a
>particular data partition.
>> BTW why can't you keep the connections around?
>The spark api is totally functional, AFAIK it's not possible to setup a
>connection and keep it around (I am asking on that mailing list to be
>> Since this is something done within the HBase client, doing it
>>externally sounds terribly tacky
>Yup.  The reason I am entertaining this route is that using an InputFormat
>with spark I was able to load in way more data, and it was all sub second.
> Since moving to having the spark slaves handle pulling in their data (not
>using the InputFormat) it seems slower for some reason.  I figured it
>might be because using an InputFormat the slaves were told what to load,
>vs. each of the 40 slaves having to do more work to find what to load.
>Perhaps my assumption is wrong?  Thoughts?
>I really appreciate your insights.  Thanks!
>On 11/28/12 3:10 AM, "Jean-Daniel Cryans" <jdcryans@apache.org> wrote:
>>On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara
>>> I have a table who's keys are prefixed with a byte to help distribute
>>> keys so scans don't hotspot.
>>> I also have a bunch of slave processes that work to scan the prefix
>>> partitions in parallel.  Currently each slave sets up their own hbase
>>> connection, scanner, etc..  Most of the slave processes finish their
>>> and return within 2-3 seconds.  It tends to take the same amount of
>>> regardless of if there's lots of data, or very little.  So I think that
>>> sec overhead is there because each slave will setup a new connection on
>>> each request (I am unable to reuse connections in the slaves).
>>2 secs sounds way too high. I recommend you check into this and see where
>>the time is spent as you may find underlying issues lis misconfiguration.
>>> I'm wondering if I could remove some of that overhead by using the
>>> (which can reuse it's hbase connection) to determine the splits, and
>>> delegating that information out to each slave. I think I could possible
>>> TableInputFormat/TableRecordReader to accomplish this?  Would this
>>> make sense?
>>I'm not sure what you're talking about here. Which master? HBase's or
>>there's something in your infrastructure that's also called "master"?
>>I'm not sure what your are trying to achieve by "determine the splits",
>>mean finding the regions you need to contact from your slaves? Since this
>>is something done within the HBase client, doing it externally sounds
>>terribly hacky. BTW why can't you keep the connections around? Is it a
>>problem of JVMs being re-spawned? If so, there are techniques you can use
>>to keep them around for reuse and then you would also benefit from
>>Hope this helps,

View raw message