hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: Essential column family performance
Date Mon, 08 Apr 2013 17:38:28 GMT
In the TestJoinedScanners.java, is the 40% randomly distributed or 
sequential?

In our test, the % is randomly distributed. Also, our custom filter does 
the same thing that SingleColumnValueFilter does.  On the client-side, 
we'd execute the query in parallel, through multiple scans along the 
region boundaries. Would that have a negative impact on performance for 
this "essential column family" feature?

Thanks,

     James

On 04/08/2013 10:10 AM, Anoop John wrote:
> Agree here. The effectiveness depends on what % of data satisfies the
> condition, how it is distributed across HFile blocks. We will get
> performance gain when the we will be able to skip some HFile blocks (from
> non essential CFs). Can test with different HFile block size (lower value)?
>
> -Anoop-
>
>
> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> I made the following change in TestJoinedScanners.java:
>>
>> -      int flag_percent = 1;
>> +      int flag_percent = 40;
>>
>> The test took longer but still favors joined scanner.
>> I got some new results:
>>
>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.TestJoinedScanners(157):
>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>> ...
>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.TestJoinedScanners(157):
>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>
>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.TestJoinedScanners(157):
>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>> ...
>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.TestJoinedScanners(157):
>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>
>> Looks like effectiveness of joined scanner is affected by distribution of
>> data.
>>
>> Cheers
>>
>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <larsh@apache.org> wrote:
>>
>>> Looking at the joined scanner test code, it sets it up such that 1% of
>> the
>>> rows match, which would somewhat be in line with James' results.
>>>
>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>>   From: Ted Yu <yuzhihong@gmail.com>
>>> To: user@hbase.apache.org
>>> Sent: Sunday, April 7, 2013 4:13 PM
>>> Subject: Re: Essential column family performance
>>>
>>> I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
>>> reference.
>>>
>>> On my MacBook, I got the following results from the test:
>>>
>>> 2013-04-07 16:08:17,474 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>> ...
>>> 2013-04-07 16:08:17,946 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>
>>> Cheers
>>>
>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>
>>>> Looking at
>>>>
>> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
>> ,
>>> I found that it didn't contain TestJoinedScanners which shows
>>>> difference in scanner performance:
>>>>
>>>>     LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>> Double.toString(timeSec)
>>>>
>>>>        + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>
>>>> The test uses SingleColumnValueFilter:
>>>>
>>>>      SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>
>>>>          cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>> flag_yes);
>>>> It is possible that the custom filter you were using would exhibit
>>>> different access pattern compared to SingleColumnValueFilter. e.g. does
>>>> your filter utilize hint ?
>>>> It would be easier for me and other people to reproduce the issue you
>>>> experienced if you put your scenario in some test similar to
>>>> TestJoinedScanners.
>>>>
>>>> Will take a closer look at the code Monday.
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
>>>> wrote:
>>>>
>>>>> Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
>>> so
>>>>> filterIfMissing isn't the issue - the results of the scan are correct.
>>>>>
>>>>> I can see that if the essential column family has more data compared
>> to
>>>>> the non essential column family that the results would eventually even
>>> out.
>>>>> I was hoping to always be able to enable the essential column family
>>>>> feature. Is there an inherent reason why performance would degrade
>> like
>>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> James
>>>>>
>>>>>
>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>
>>>>>> James:
>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>
>>>>>> What Filter were you using ?
>>>>>>
>>>>>> If you used SingleColumnValueFilter, have you seen my comment here
?
>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>>>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>>>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
>> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
>>>>>> BTW the use case Max Lapan tried to address has non essential column
>>>>>> family
>>>>>> carrying considerably more data compared to essential column family.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>> jtaylor@salesforce.com
>>>>>>> wrote:
>>>>>>   Hello,
>>>>>>> We're doing some performance testing of the essential column
family
>>>>>>> feature, and we're seeing some performance degradation when
>> comparing
>>>>>>> with
>>>>>>> and without the feature enabled:
>>>>>>>
>>>>>>>                             Performance of scan relative
>>>>>>> % of rows selected        to not enabling the feature
>>>>>>> ---------------------    ------------------------------****--
>>>>>>>
>>>>>>> 100%                            1.0x
>>>>>>>    80%                            2.0x
>>>>>>>    60%                            2.3x
>>>>>>>    40%                            2.2x
>>>>>>>    20%                            1.5x
>>>>>>>    10%                            1.0x
>>>>>>>     5%                            0.67x
>>>>>>>     0%                            0.30%
>>>>>>>
>>>>>>> In our scenario, we have two column families. The key value from
the
>>>>>>> essential column family is used in the filter, while the key
value
>>> from
>>>>>>> the
>>>>>>> other, non essential column family is returned by the scan. Each
row
>>>>>>> contains values for both key values, with the values being
>> relatively
>>>>>>> narrow (less than 50 bytes). In this scenario, the only time
we're
>>>>>>> seeing a
>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>
>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>


Mime
View raw message