Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 23255DFEE for ; Wed, 28 Nov 2012 22:25:37 +0000 (UTC) Received: (qmail 83750 invoked by uid 500); 28 Nov 2012 22:25:35 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 83689 invoked by uid 500); 28 Nov 2012 22:25:35 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 83681 invoked by uid 99); 28 Nov 2012 22:25:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 22:25:34 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Sean.McNamara@webtrends.com designates 216.64.169.22 as permitted sender) Received: from [216.64.169.22] (HELO pdxsmtp01.WebTrends.dmz) (216.64.169.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 22:25:28 +0000 Received: from PDXEXMAIL01.webtrends.corp (Not Verified[10.61.2.16]) by pdxsmtp01.WebTrends.dmz with MailMarshal (v6,8,4,9558) (using TLS: SSLv23) id ; Wed, 28 Nov 2012 22:25:07 +0000 Received: from PDXEXMAIL02.WebTrends.corp ([169.254.4.41]) by PDXEXMAIL01.webtrends.corp ([169.254.3.222]) with mapi id 14.02.0318.001; Wed, 28 Nov 2012 22:25:07 +0000 From: Sean McNamara To: "user@hbase.apache.org" Subject: Re: Parallel reading advice Thread-Topic: Parallel reading advice Thread-Index: AQHNzTGDusvh4gA0606VLODIVM8WQ5f/BroAgAAHm4CAAFBcAA== Date: Wed, 28 Nov 2012 22:25:06 +0000 Message-ID: <012039977044474D9CFA6A36A4D1FF66F16474@PDXEXMAIL02.webtrends.corp> In-Reply-To: <012039977044474D9CFA6A36A4D1FF66F15E0D@PDXEXMAIL02.webtrends.corp> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.61.2.4] Content-Type: text/plain; charset="us-ascii" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Turns out there is a way to reuse the connection in Spark. I was also forgetting to call setCaching (that was the primary reason). So it's very fast now and I have the data where I need it. The first request still takes 2-3 seconds to setup and see data (regardless of how much), but after that it's super fast. Sean On 11/28/12 10:37 AM, "Sean McNamara" wrote: >Hi J-D > >Really good questions. I will check for a misconfiguration. > > >> I'm not sure what you're talking about here. Which master > >I am using http://spark-project.org/ , so the master I am referring to is >really the spark driver. Spark can read from a hadoop InputFormat and >populate itself that way, but you don't have control over which >slave/worker data will land on using it. My goal is to use spark to reach >in for slices of data that are in HBase, and be able to perform set >operations on the data in parallel using spark. Being able to load a >partition onto the right node is important. This is so that I don't have >to reshuffle the data, just to get it onto the right node that handles a >particular data partition. > > >> BTW why can't you keep the connections around? > >The spark api is totally functional, AFAIK it's not possible to setup a >connection and keep it around (I am asking on that mailing list to be >sure). > > >> Since this is something done within the HBase client, doing it >>externally sounds terribly tacky > >Yup. The reason I am entertaining this route is that using an InputFormat >with spark I was able to load in way more data, and it was all sub second. > Since moving to having the spark slaves handle pulling in their data (not >using the InputFormat) it seems slower for some reason. I figured it >might be because using an InputFormat the slaves were told what to load, >vs. each of the 40 slaves having to do more work to find what to load. >Perhaps my assumption is wrong? Thoughts? > > >I really appreciate your insights. Thanks! > > > > > >On 11/28/12 3:10 AM, "Jean-Daniel Cryans" wrote: > >>Inline. >> >>J-D >> >>On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara >>wrote: >> >>> I have a table who's keys are prefixed with a byte to help distribute >>>the >>> keys so scans don't hotspot. >>> >>> I also have a bunch of slave processes that work to scan the prefix >>> partitions in parallel. Currently each slave sets up their own hbase >>> connection, scanner, etc.. Most of the slave processes finish their >>>scan >>> and return within 2-3 seconds. It tends to take the same amount of >>>time >>> regardless of if there's lots of data, or very little. So I think that >>>2 >>> sec overhead is there because each slave will setup a new connection on >>> each request (I am unable to reuse connections in the slaves). >>> >> >>2 secs sounds way too high. I recommend you check into this and see where >>the time is spent as you may find underlying issues lis misconfiguration. >> >> >>> >>> I'm wondering if I could remove some of that overhead by using the >>>master >>> (which can reuse it's hbase connection) to determine the splits, and >>>then >>> delegating that information out to each slave. I think I could possible >>>use >>> TableInputFormat/TableRecordReader to accomplish this? Would this >>>route >>> make sense? >>> >> >>I'm not sure what you're talking about here. Which master? HBase's or >>there's something in your infrastructure that's also called "master"? >>Then >>I'm not sure what your are trying to achieve by "determine the splits", >>you >>mean finding the regions you need to contact from your slaves? Since this >>is something done within the HBase client, doing it externally sounds >>terribly hacky. BTW why can't you keep the connections around? Is it a >>problem of JVMs being re-spawned? If so, there are techniques you can use >>to keep them around for reuse and then you would also benefit from >>reusing >>connections. >> >>Hope this helps, >> >>J-D >