hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Essential column family performance
Date Sun, 07 Apr 2013 23:03:36 GMT
Looking at
https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt,
I found that it didn't contain TestJoinedScanners which shows
difference in scanner performance:

    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
Double.toString(timeSec)

      + " seconds, got " + Long.toString(rows_count/2) + " rows");

The test uses SingleColumnValueFilter:

    SingleColumnValueFilter filter = new SingleColumnValueFilter(

        cf_essential, col_name, CompareFilter.CompareOp.EQUAL, flag_yes);
It is possible that the custom filter you were using would exhibit
different access pattern compared to SingleColumnValueFilter. e.g. does
your filter utilize hint ?
It would be easier for me and other people to reproduce the issue you
experienced if you put your scenario in some test similar to
TestJoinedScanners.

Will take a closer look at the code Monday.

Cheers

On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com>wrote:

> Yes, on 0.94.6. We have our own custom filter derived from FilterBase, so
> filterIfMissing isn't the issue - the results of the scan are correct.
>
> I can see that if the essential column family has more data compared to
> the non essential column family that the results would eventually even out.
> I was hoping to always be able to enable the essential column family
> feature. Is there an inherent reason why performance would degrade like
> this? Does it boil down to a single sequential scan versus many seeks?
>
> Thanks,
>
> James
>
>
> On 04/07/2013 07:44 AM, Ted Yu wrote:
>
>> James:
>> Your test was based on 0.94.6.1, right ?
>>
>> What Filter were you using ?
>>
>> If you used SingleColumnValueFilter, have you seen my comment here ?
>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>
>> BTW the use case Max Lapan tried to address has non essential column
>> family
>> carrying considerably more data compared to essential column family.
>>
>> Cheers
>>
>>
>>
>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <jtaylor@salesforce.com
>> >wrote:
>>
>>  Hello,
>>> We're doing some performance testing of the essential column family
>>> feature, and we're seeing some performance degradation when comparing
>>> with
>>> and without the feature enabled:
>>>
>>>                            Performance of scan relative
>>> % of rows selected        to not enabling the feature
>>> ---------------------    ------------------------------****--
>>>
>>> 100%                            1.0x
>>>   80%                            2.0x
>>>   60%                            2.3x
>>>   40%                            2.2x
>>>   20%                            1.5x
>>>   10%                            1.0x
>>>    5%                            0.67x
>>>    0%                            0.30%
>>>
>>> In our scenario, we have two column families. The key value from the
>>> essential column family is used in the filter, while the key value from
>>> the
>>> other, non essential column family is returned by the scan. Each row
>>> contains values for both key values, with the values being relatively
>>> narrow (less than 50 bytes). In this scenario, the only time we're
>>> seeing a
>>> performance gain is when less than 10% of the rows are selected.
>>>
>>> Is this a reasonable test? Has anyone else measured this?
>>>
>>> Thanks,
>>>
>>> James
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message