hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Essential column family performance
Date Mon, 08 Apr 2013 18:07:49 GMT

I think that JM brings up a good point. 
Keep in mind that RLL in HBase is not the same when you think of Row Level Locking in transactional
systems. 
Depending on the use case... you can keep things in separate tables and not worry about the
issues w CF's.

So when you think about your design... separate tables may be a valid design. 

IMHO I think more thought is needed before using CFs.

The Essential column family sounds like its more beneficial for edge cases and not so much
for the primary use case. 
Again, IMHO if you're using it for your primary use case, then I think you should rethink
your schema design. 

To Ted's point, by keeping like data within CFs, it makes it easier when processing data within
a M/R framework since your scanner will work against the CFs in the table. 

Yet, I have to ask why you would filter on one CF when pulling data from a second? Why not
duplicate the data and store in both?  Again, this is highly dependent on the use case.

Just saying...


On Apr 8, 2013, at 12:23 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Currently atomicity support in HBase is for single table, single region.
> 
> If user chooses separate tables, it might be harder to implement the
> business logic.
> 
> On Mon, Apr 8, 2013 at 10:19 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
> 
>> Something I'm not getting, why not using separate tables instead of
>> CFs for a single table? Simply name your table tablename_cfname then
>> you get ride of the CF# limitation?
>> 
>> Or is there big pros to have CFs?
>> 
>> JM
>> 
>> 2013/4/8 Anoop John <anoop.hbase@gmail.com>:
>>> Agree here. The effectiveness depends on what % of data satisfies the
>>> condition, how it is distributed across HFile blocks. We will get
>>> performance gain when the we will be able to skip some HFile blocks (from
>>> non essential CFs). Can test with different HFile block size (lower
>> value)?
>>> 
>>> -Anoop-
>>> 
>>> 
>>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>> 
>>>> I made the following change in TestJoinedScanners.java:
>>>> 
>>>> -      int flag_percent = 1;
>>>> +      int flag_percent = 40;
>>>> 
>>>> The test took longer but still favors joined scanner.
>>>> I got some new results:
>>>> 
>>>> 2013-04-08 07:46:06,959 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:12,010 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>> 
>>>> 2013-04-08 07:46:18,358 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:22,946 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>> 
>>>> Looks like effectiveness of joined scanner is affected by distribution
>> of
>>>> data.
>>>> 
>>>> Cheers
>>>> 
>>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <larsh@apache.org> wrote:
>>>> 
>>>>> Looking at the joined scanner test code, it sets it up such that 1% of
>>>> the
>>>>> rows match, which would somewhat be in line with James' results.
>>>>> 
>>>>> In my own testing a while ago I found a 100% improvement with 0%
>> match.
>>>>> 
>>>>> 
>>>>> -- Lars
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>> From: Ted Yu <yuzhihong@gmail.com>
>>>>> To: user@hbase.apache.org
>>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>>> Subject: Re: Essential column family performance
>>>>> 
>>>>> I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for
>> your
>>>>> reference.
>>>>> 
>>>>> On my MacBook, I got the following results from the test:
>>>>> 
>>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>> regionserver.TestJoinedScanners(157):
>>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>>>> ...
>>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>> regionserver.TestJoinedScanners(157):
>>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>> 
>>>>>> Looking at
>>>>>> 
>>>>> 
>>>> 
>> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
>>>> ,
>>>>> I found that it didn't contain TestJoinedScanners which shows
>>>>>> difference in scanner performance:
>>>>>> 
>>>>>>   LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>>> Double.toString(timeSec)
>>>>>> 
>>>>>>      + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>> 
>>>>>> The test uses SingleColumnValueFilter:
>>>>>> 
>>>>>>    SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>> 
>>>>>>        cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>> flag_yes);
>>>>>> It is possible that the custom filter you were using would exhibit
>>>>>> different access pattern compared to SingleColumnValueFilter. e.g.
>> does
>>>>>> your filter utilize hint ?
>>>>>> It would be easier for me and other people to reproduce the issue
>> you
>>>>>> experienced if you put your scenario in some test similar to
>>>>>> TestJoinedScanners.
>>>>>> 
>>>>>> Will take a closer look at the code Monday.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <
>> jtaylor@salesforce.com
>>>>>> wrote:
>>>>>> 
>>>>>>> Yes, on 0.94.6. We have our own custom filter derived from
>> FilterBase,
>>>>> so
>>>>>>> filterIfMissing isn't the issue - the results of the scan are
>> correct.
>>>>>>> 
>>>>>>> I can see that if the essential column family has more data
>> compared
>>>> to
>>>>>>> the non essential column family that the results would eventually
>> even
>>>>> out.
>>>>>>> I was hoping to always be able to enable the essential column
>> family
>>>>>>> feature. Is there an inherent reason why performance would degrade
>>>> like
>>>>>>> this? Does it boil down to a single sequential scan versus many
>> seeks?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> James
>>>>>>> 
>>>>>>> 
>>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>> 
>>>>>>>> James:
>>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>> 
>>>>>>>> What Filter were you using ?
>>>>>>>> 
>>>>>>>> If you used SingleColumnValueFilter, have you seen my comment
>> here ?
>>>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>>>>>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>>>>>>> 
>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
>>>>> 
>>>> 
>> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
>>>>>> 
>>>>>>>> 
>>>>>>>> BTW the use case Max Lapan tried to address has non essential
>> column
>>>>>>>> family
>>>>>>>> carrying considerably more data compared to essential column
>> family.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>> jtaylor@salesforce.com
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hello,
>>>>>>>>> We're doing some performance testing of the essential
column
>> family
>>>>>>>>> feature, and we're seeing some performance degradation
when
>>>> comparing
>>>>>>>>> with
>>>>>>>>> and without the feature enabled:
>>>>>>>>> 
>>>>>>>>>                           Performance of scan relative
>>>>>>>>> % of rows selected        to not enabling the feature
>>>>>>>>> ---------------------    ------------------------------****--
>>>>>>>>> 
>>>>>>>>> 100%                            1.0x
>>>>>>>>>  80%                            2.0x
>>>>>>>>>  60%                            2.3x
>>>>>>>>>  40%                            2.2x
>>>>>>>>>  20%                            1.5x
>>>>>>>>>  10%                            1.0x
>>>>>>>>>   5%                            0.67x
>>>>>>>>>   0%                            0.30%
>>>>>>>>> 
>>>>>>>>> In our scenario, we have two column families. The key
value from
>> the
>>>>>>>>> essential column family is used in the filter, while
the key
>> value
>>>>> from
>>>>>>>>> the
>>>>>>>>> other, non essential column family is returned by the
scan. Each
>> row
>>>>>>>>> contains values for both key values, with the values
being
>>>> relatively
>>>>>>>>> narrow (less than 50 bytes). In this scenario, the only
time
>> we're
>>>>>>>>> seeing a
>>>>>>>>> performance gain is when less than 10% of the rows are
selected.
>>>>>>>>> 
>>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> James
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 


Mime
View raw message