hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igor Kuzmitshov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6618) Implement FuzzyRowFilter with ranges support
Date Fri, 28 Feb 2014 12:45:26 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915732#comment-13915732
] 

Igor Kuzmitshov commented on HBASE-6618:
----------------------------------------

Looking at the description above that rule ????(0001 - 0999) means <any 4 bytes><any
4 bytes value between "0001" and "0999">, I thought that the value in the fixed part is
checked as whole, but the code actually checks its bytes in isolation, so the rule is actually
????0(0 - 9)(0 - 9)(1 - 9).

It's fine for ranges like this, but let's take another: ??(53 - 97). I would expect aa68 to
satisfy the rule, but in the proposed implementation it doesn't (because bytes are checked
in isolation and 8 is outside the range \[3, 7\]). Could you clarify if this is the intended
behaviour?

If yes, i.e. aa68 should not satisfy rule ??(53 - 97):
It would be nice to make it more clear in the description that all bytes are checked in isolation
and there are actually no n-bytes values. In this case, there is a bug: for rule ??(50 - 97)
and value MM58 (where M is max byte \xFF), satisfies() returns SatisfiesCode.NO_NEXT because
nextRowKeyCandidateExists is only updated for non-fixed positions. It should return NEXT_EXISTS,
because MM60 should be the next key.

If no, i.e. aa68 should satisfy rule ??(53 - 97):
In this case, satisfy() should be fixed. I made a patch with the fix and can add it if needed.
It also has a small optimisation when there is no need to check less significant bytes. For
example: for range \[120, 500\] and key 345, it will compare the first byte (3) only, as it's
clear that the whole value is in the range.

In any case, tests might include testing satisfy() with ranges (the current patch only adds
tests for getNextForFuzzyRule() with ranges).

> Implement FuzzyRowFilter with ranges support
> --------------------------------------------
>
>                 Key: HBASE-6618
>                 URL: https://issues.apache.org/jira/browse/HBASE-6618
>             Project: HBase
>          Issue Type: New Feature
>          Components: Filters
>            Reporter: Alex Baranau
>            Assignee: Alex Baranau
>            Priority: Minor
>             Fix For: 0.99.0
>
>         Attachments: HBASE-6618-algo-desc-bits.png, HBASE-6618-algo.patch, HBASE-6618.patch,
HBASE-6618_2.path, HBASE-6618_3.path
>
>
> Apart from current ability to specify fuzzy row filter e.g. for <userId_actionId>
format as ????_0004 (where 0004 - actionId) it would be great to also have ability to specify
the "fuzzy range" , e.g. ????_0004, ..., ????_0099.
> See initial discussion here: http://search-hadoop.com/m/WVLJdX0Z65
> Note: currently it is possible to provide multiple fuzzy row rules to existing FuzzyRowFilter,
but in case when the range is big (contains thousands of values) it is not efficient.
> Filter should perform efficient fast-forwarding during the scan (this is what distinguishes
it from regex row filter).
> While such functionality may seem like a proper fit for custom filter (i.e. not including
into standard filter set) it looks like the filter may be very re-useable. We may judge based
on the implementation that will hopefully be added.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message