Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B45DCF563 for ; Tue, 9 Apr 2013 01:53:35 +0000 (UTC) Received: (qmail 40149 invoked by uid 500); 9 Apr 2013 01:53:33 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 40097 invoked by uid 500); 9 Apr 2013 01:53:33 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 40087 invoked by uid 99); 9 Apr 2013 01:53:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Apr 2013 01:53:33 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jtaylor@salesforce.com designates 64.18.3.86 as permitted sender) Received: from [64.18.3.86] (HELO exprod8og103.obsmtp.com) (64.18.3.86) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 09 Apr 2013 01:53:28 +0000 Received: from exsfm-hub5.internal.salesforce.com ([204.14.239.233]) by exprod8ob103.postini.com ([64.18.7.12]) with SMTP ID DSNKUWN0hOlFUMJrXcdiOjd4JMNrYgOz4K5r@postini.com; Mon, 08 Apr 2013 18:53:08 PDT Received: from [10.0.54.31] (10.0.54.31) by exsfm-hub5.internal.salesforce.com (10.1.127.5) with Microsoft SMTP Server (TLS) id 8.3.279.5; Mon, 8 Apr 2013 18:53:07 -0700 Message-ID: <51637483.8020501@salesforce.com> Date: Mon, 8 Apr 2013 18:53:07 -0700 From: James Taylor User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: Subject: Re: Essential column family performance References: <51610C9B.5090705@salesforce.com> <5161BD07.6090704@salesforce.com> <1365393179.99772.YahooMailNeo@web140604.mail.bf1.yahoo.com> <51630094.7050606@salesforce.com> In-Reply-To: Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Good idea, Sergey. We'll rerun with larger non essential column family values and see if there's a crossover point. One other difference for us is that we're using FAST_DIFF encoding. We'll try with no encoding too. Our table has 20 million rows across four regions servers. Regarding the parallelization we do, we run multiple scans in parallel instead of one single scan over the table. We use the region boundaries of the table to divide up the work evenly, adding a start/stop key for each scan that corresponds to the region boundaries. Our client then does a final merge/aggregation step (i.e. adding up the count it gets back from the scan for each region). On 04/08/2013 01:34 PM, Sergey Shelukhin wrote: > IntegrationTestLazyCfLoading uses randomly distributed keys with the > following condition for filtering: > 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey > is hex string of MD5 key. > Then, there are 2 "lazy" CFs, each of which has a value of 4-64k. > This test also showed significant improvement IIRC, so random distribution > and high %%ge of values selected should not be a problem as such. > > My hunch would be that the additional cost of seeks/merging the results > from two CFs outweights the benefit of lazy loading on such small values > for the "lazy" CF with lots of data selected. This feature definitely makes > no sense if you are selecting all values, because then extra work is being > done for no benefit (everything is read anyway). > So the use cases would be larger "lazy" CFs or/and low percentage of values > selected. > > Can you try to increase the 2nd CF values' size and rerun the test? > > > On Mon, Apr 8, 2013 at 10:38 AM, James Taylor wrote: > >> In the TestJoinedScanners.java, is the 40% randomly distributed or >> sequential? >> >> In our test, the % is randomly distributed. Also, our custom filter does >> the same thing that SingleColumnValueFilter does. On the client-side, we'd >> execute the query in parallel, through multiple scans along the region >> boundaries. Would that have a negative impact on performance for this >> "essential column family" feature? >> >> Thanks, >> >> James >> >> >> On 04/08/2013 10:10 AM, Anoop John wrote: >> >>> Agree here. The effectiveness depends on what % of data satisfies the >>> condition, how it is distributed across HFile blocks. We will get >>> performance gain when the we will be able to skip some HFile blocks (from >>> non essential CFs). Can test with different HFile block size (lower >>> value)? >>> >>> -Anoop- >>> >>> >>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu wrote: >>> >>> I made the following change in TestJoinedScanners.java: >>>> - int flag_percent = 1; >>>> + int flag_percent = 40; >>>> >>>> The test took longer but still favors joined scanner. >>>> I got some new results: >>>> >>>> 2013-04-08 07:46:06,959 INFO [main] regionserver.** >>>> TestJoinedScanners(157): >>>> Slow scanner finished in 7.424388 seconds, got 2050 rows >>>> ... >>>> 2013-04-08 07:46:12,010 INFO [main] regionserver.** >>>> TestJoinedScanners(157): >>>> Joined scanner finished in 5.05063 seconds, got 2050 rows >>>> >>>> 2013-04-08 07:46:18,358 INFO [main] regionserver.** >>>> TestJoinedScanners(157): >>>> Slow scanner finished in 6.348517 seconds, got 2050 rows >>>> ... >>>> 2013-04-08 07:46:22,946 INFO [main] regionserver.** >>>> TestJoinedScanners(157): >>>> Joined scanner finished in 4.587545 seconds, got 2050 rows >>>> >>>> Looks like effectiveness of joined scanner is affected by distribution of >>>> data. >>>> >>>> Cheers >>>> >>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl wrote: >>>> >>>> Looking at the joined scanner test code, it sets it up such that 1% of >>>> the >>>> >>>>> rows match, which would somewhat be in line with James' results. >>>>> >>>>> In my own testing a while ago I found a 100% improvement with 0% match. >>>>> >>>>> >>>>> -- Lars >>>>> >>>>> >>>>> >>>>> ______________________________**__ >>>>> From: Ted Yu >>>>> To: user@hbase.apache.org >>>>> Sent: Sunday, April 7, 2013 4:13 PM >>>>> Subject: Re: Essential column family performance >>>>> >>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for >>>>> your >>>>> reference. >>>>> >>>>> On my MacBook, I got the following results from the test: >>>>> >>>>> 2013-04-07 16:08:17,474 INFO [main] >>>>> >>>> regionserver.**TestJoinedScanners(157): >>>> >>>>> Slow scanner finished in 7.973822 seconds, got 100 rows >>>>> ... >>>>> 2013-04-07 16:08:17,946 INFO [main] >>>>> >>>> regionserver.**TestJoinedScanners(157): >>>> >>>>> Joined scanner finished in 0.47235 seconds, got 100 rows >>>>> >>>>> Cheers >>>>> >>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu wrote: >>>>> >>>>> Looking at >>>>>> https://issues.apache.org/**jira/secure/attachment/** >>>> 12564340/5416-0.94-v3.txt >>>> , >>>> >>>>> I found that it didn't contain TestJoinedScanners which shows >>>>> >>>>>> difference in scanner performance: >>>>>> >>>>>> LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " + >>>>>> Double.toString(timeSec) >>>>>> >>>>>> + " seconds, got " + Long.toString(rows_count/2) + " rows"); >>>>>> >>>>>> The test uses SingleColumnValueFilter: >>>>>> >>>>>> SingleColumnValueFilter filter = new SingleColumnValueFilter( >>>>>> >>>>>> cf_essential, col_name, CompareFilter.CompareOp.EQUAL, >>>>>> >>>>> flag_yes); >>>>> It is possible that the custom filter you were using would exhibit >>>>>> different access pattern compared to SingleColumnValueFilter. e.g. does >>>>>> your filter utilize hint ? >>>>>> It would be easier for me and other people to reproduce the issue you >>>>>> experienced if you put your scenario in some test similar to >>>>>> TestJoinedScanners. >>>>>> >>>>>> Will take a closer look at the code Monday. >>>>>> >>>>>> Cheers >>>>>> >>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor >>>>> wrote: >>>>>> >>>>>> Yes, on 0.94.6. We have our own custom filter derived from FilterBase, >>>>>> so >>>>>> filterIfMissing isn't the issue - the results of the scan are correct. >>>>>>> I can see that if the essential column family has more data compared >>>>>>> >>>>>> to >>>>> the non essential column family that the results would eventually even >>>>>> out. >>>>>> I was hoping to always be able to enable the essential column family >>>>>>> feature. Is there an inherent reason why performance would degrade >>>>>>> >>>>>> like >>>>> this? Does it boil down to a single sequential scan versus many seeks? >>>>>>> Thanks, >>>>>>> >>>>>>> James >>>>>>> >>>>>>> >>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote: >>>>>>> >>>>>>> James: >>>>>>>> Your test was based on 0.94.6.1, right ? >>>>>>>> >>>>>>>> What Filter were you using ? >>>>>>>> >>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here ? >>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?** >>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.** >>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-**** >>>>>>>> 13541229< >>>>>>>> >>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?** >>>> focusedCommentId=13541229&**page=com.atlassian.jira.** >>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229 >>>> >>>>> BTW the use case Max Lapan tried to address has non essential column >>>>>>>> family >>>>>>>> carrying considerably more data compared to essential column family. >>>>>>>> >>>>>>>> Cheers >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor < >>>>>>>> >>>>>>> jtaylor@salesforce.com >>>>> wrote: >>>>>>>> Hello, >>>>>>>> >>>>>>>>> We're doing some performance testing of the essential column family >>>>>>>>> feature, and we're seeing some performance degradation when >>>>>>>>> >>>>>>>> comparing >>>>> with >>>>>>>>> and without the feature enabled: >>>>>>>>> >>>>>>>>> Performance of scan relative >>>>>>>>> % of rows selected to not enabling the feature >>>>>>>>> --------------------- ------------------------------******-- >>>>>>>>> >>>>>>>>> 100% 1.0x >>>>>>>>> 80% 2.0x >>>>>>>>> 60% 2.3x >>>>>>>>> 40% 2.2x >>>>>>>>> 20% 1.5x >>>>>>>>> 10% 1.0x >>>>>>>>> 5% 0.67x >>>>>>>>> 0% 0.30% >>>>>>>>> >>>>>>>>> In our scenario, we have two column families. The key value from the >>>>>>>>> essential column family is used in the filter, while the key value >>>>>>>>> >>>>>>>> from >>>>>> the >>>>>>>>> other, non essential column family is returned by the scan. Each row >>>>>>>>> contains values for both key values, with the values being >>>>>>>>> >>>>>>>> relatively >>>>> narrow (less than 50 bytes). In this scenario, the only time we're >>>>>>>>> seeing a >>>>>>>>> performance gain is when less than 10% of the rows are selected. >>>>>>>>> >>>>>>>>> Is this a reasonable test? Has anyone else measured this? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> James >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>