hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Spiegelberg (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
Date Mon, 12 Jul 2010 21:14:51 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887537#action_12887537
] 

Nicolas Spiegelberg commented on HBASE-2794:
--------------------------------------------

IRC conversation about this...

 krispyjala: nspiegelberg: but is the test you want related to HBASE-2794 or just using bloom
filter in general (e.g. when to use it and when not to)?
 [1:41pm] nspiegelberg: it's related to 2794
 [1:42pm] nspiegelberg: an easy example of why you need good measurements is the case of calling
bloom.contains() for 100 row+col in a 1% false positive bloom.  You are getting almost 100%
false positives then, so the bloom is an obvious perf drop
 [1:43pm] krispyjala: nspiegelberg: ok i think i understand
 [1:44pm] krispyjala: nspiegelberg: but wait 100% false positive?
 [1:46pm] nspiegelberg: right, so io.hfile.bloom.error.rate == .01 by default.  so 1%
 [1:46pm] krispyjala: ok
 [1:46pm] krispyjala: how does that add up to 100% for 100 lookups?
 [1:46pm] nspiegelberg: therefore, if you call bloom.contains() 5 times and OR the result,
the false positive rate is 5%
 [1:49pm] nspiegelberg: krispyjala: so a simple example. call bloom.contains() 10 times =
10% error rate = (10ms/seek * 10%) + time(bloom.contains)
 [1:50pm] krispyjala: nspiegelberg: but is it really OR'ing all of them? In the code if even
one column lookup returns true we return true and don't look up any other columns
 [1:51pm] nspiegelberg: right, that's the same thing as ORing them
 [1:51pm] nspiegelberg: logical OR =>  ||
 [1:52pm] krispyjala: nspiegelberg: but the point is we're probably not looking up 100 columns
every time for that operation, even theoretically yes we do a logical OR
 [1:52pm] krispyjala: if we hit true on the 5th column, we quit the loop and return right
away
 [1:53pm] nspiegelberg: the only way you win with blooms is if all bloom.contains() return
false and you don't have to do the lookup
 [1:53pm] krispyjala: yes
 [1:53pm] nspiegelberg: so, you're right, we do an average of 50 lookups per false positive
in this case.
 [1:54pm] nspiegelberg: I'm just saying, what is the cost of those 50 lookups?  If 1ms, then
every HFile seek costs 11ms with blooms enabled versus 10 ms without using them
 [1:55pm] krispyjala: but wait i thought the code was to determine whether to add the StoreScanner
to the list or not...or are you saying then that the point is in the case of 100 columns we
should just not even bother doing bloom multicolumn check because perhaps it's better to just
load it than wasting time with the 100 lookups (potentially)
 [1:55pm] nspiegelberg: exactly
 [1:55pm] krispyjala: nspiegelberg: lol ok got it
 [1:56pm] krispyjala: but realistically, who does gets on 100 columns? I don't know the HBase
internals well yet (that's why i picked the noob ticket lol)...wouldn't it be better to just
do a get on the row?
 [1:57pm] nspiegelberg: never under-estimate the naivete of users
 [1:57pm] krispyjala: nspiegelberg: sigh lol, i guess that's why the bloom is off by default?
 [1:58pm] nspiegelberg: yes
 [1:58pm] nspiegelberg: so, it's obvious that you never want to run bloom code with 101 columns
+ 1% error rate
 [1:58pm] krispyjala: correct
 [1:59pm] nspiegelberg: so, really it's just timing testBloomPerf with various lookup counts
on various size blooms
 [2:00pm] krispyjala: nspiegelberg: this talk has helped me think about how to test like you
said
 [2:00pm] • St^Ack hopes the above good-stuff(tm) 'lesson' makes it back into the issue....
 [2:00pm] nspiegelberg: looks like ryan didn't give you any concrete numbers, so you might
have to just start with some assumptions (like, don't use blooms if avg key < 1KB) and
run with that
 [2:01pm] krispyjala: nspiegelberg: and perhaps once we kind of know where the tradeoff is,
would it be wrong to limit in the code saying if there are more than say 10 column lookups
might as well just return true?
 [2:01pm] krispyjala: cuz it's not worth looking up in bloom at that point
 [2:01pm] nspiegelberg: I think that's exactly what we need to do
 [2:01pm] krispyjala: whatever the threshold is
 [2:02pm] nspiegelberg: if we pretend that the cost of bloom.contains() == 0, then maybe we
want to say if (column.count * error.rate > 10%) return true;
 [2:02pm] dj_ryan: well it's hard to say where the tradeoff goes
 [2:02pm] krispyjala: pastebin? lol jk
 [2:02pm] dj_ryan: but the hard number is 9 bits/item
 [2:03pm] dj_ryan: you can then calculate how much ram you are spending on blooms
 [2:03pm] dj_ryan: and decide if its worth it
 [2:03pm] nspiegelberg: the hard # for 1% error rate blooms is 9 bits/item
 [2:03pm] dj_ryan: we never implemented blooms because it seemed 12gb of ram would be better
off caching
 [2:03pm] krispyjala: dj_ryan: so your suggestion the onus is on the user and not hbase code
 [2:03pm] nspiegelberg: with .1% error rate, it's ~12 bits/item
 [2:04pm] krispyjala: or should we allow customizations of the limits? then we don't need
to come up with the "recommended" threshold
 [2:05pm] dj_ryan: well
 [2:05pm] dj_ryan: it is up to the user
 [2:05pm] nspiegelberg: I think the onus for figuring out whether to use blooms or not is
on the user, but we should still have a 'this is too stupid' early exit
 [2:05pm] dj_ryan: i mean maybe we could put a lot of metrics to detect when a bloom filter
might be useful
 [2:05pm] dj_ryan: but im not sure that's worth it 
 [2:06pm] krispyjala: dj_ryan: yes, but right now they can either just turn it on or off,
and with my patch they will be forced to look up all the columns if they have more than one
 [2:07pm] nspiegelberg: krispyjala: I think just running testBloomPerf on a couple different
sizes will give you a goo timing measurement.  Unless my initial thoughts are off, you can
probably just get away with saying: (column.count * error.rate > 10%) return true;
 [2:07pm] krispyjala: nspiegelberg: yeah I agree we should have the early exit strat
 [2:08pm] krispyjala: nspiegelberg: ok i will do some testing this evening on it
 [2:08pm] nspiegelberg: then, when somebody asks why you chose 10%, you can say that it obviously
makes sense when below 10% and they should run some numbers for you if they want to pump it
up 
 

> ROWCOL bloom filter not used if multiple columns within same family are requested in
a Get
> ------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2794
>                 URL: https://issues.apache.org/jira/browse/HBASE-2794
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>
> Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
> {code}
>         switch(bloomFilterType) {
>           case ROW:
>             key = row;
>             break;
>           case ROWCOL:
>             if (columns.size() == 1) {
>               byte[] col = columns.first();
>               key = Bytes.add(row, col);
>               break;
>             }
>             //$FALL-THROUGH$
>           default:
>             return true;
>         }
> {code}
> If columns.size > 1, then we currently don't take advantage of the bloom filter. 
We should optimize this to check bloom for each of columns and if none of the columns are
present in the bloom avoid opening the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message