hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugeny Morozov <emoro...@griddynamics.com>
Subject Re: Custom Filter and SEEK_NEXT_USING_HINT issue
Date Sun, 20 Jan 2013 21:22:24 GMT
Ted, thanks for the question.
There are results of investigation.

It seems I am mistaken. I thought that scanners are assigned to each
regions to scan (and do that in parallel) and that means each scanner
should start from the beginning of its region and then fall down to the
required record.

But currently we have 256 splits in the table by the first byte of values:
start - end
NA  - \x01
\x01 - \x02
\x02 - \x03
...
\xFE - \xFF
\xFF - NA

And it turns out that the values I've seen are the values from different
regions, except two last values - they both reside in just one region:
AAAA1Q7iQ9JA : [0  <-- that's the value's first byte (meaning particular
region here)
AQAAnA96rxTg : [1
AgAADQWPSIDw : [2
...
EwAAEwqVQrTw : [19
FAAACQqVQrTw : [20
FQAAIAqVQrTw : [21
FgAAeAWPSIDw : [22
FwAAAw33Zb9Q : [23
F7dt8QWPSIDw : [23

1. I still don't get, why it skips required value.
2. The only explanation to have such an output I've found is that scanning
is  searching regions one by one until it found the value. Should it be so?
Shouldn't it start from the beginning (if there is no setStartRow) (and in
parallel for all regions at once) and in second step (after filter's
getHint method) know exactly where to go?


On Sat, Jan 19, 2013 at 5:16 PM, Ted <yuzhihong@gmail.com> wrote:

> In your original email you said the first key looked like start key of a
> region, can you verify that ?
>
> Thanks
>
> On Jan 19, 2013, at 1:36 AM, Eugeny Morozov <emorozov@griddynamics.com>
> wrote:
>
> > Ted,
> >
> > that is correct.
> > HBase 0.92.x and we use part of the patch 6509.
> >
> > I use the filter as a custom filter, it lives in separate jar file and
> goes
> > to HBase's classpath. I did not patch HBase.
> > Moreover I do not use protobuf's descriptions that comes with the filter
> in
> > patch. Only two classes I have - FuzzyRowFilter itself and its test
> class.
> >
> > And it works perfectly on small dataset like 100 rows (1 region). But
> when
> > my dataset is more than 10mln (260 regions), it somehow loosing rows. I'm
> > not sure, but it seems to me it is not fault of the filter.
> >
> >
> > On Sat, Jan 19, 2013 at 3:56 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> >> To my knowledge CDH-4.1.2 is based on HBase 0.92.x
> >>
> >> Looks like you were using patch from HBASE-6509 which was integrated to
> >> trunk only.
> >> Please confirm.
> >>
> >> Copying Alex who wrote the patch.
> >>
> >> Cheers
> >>
> >> On Fri, Jan 18, 2013 at 3:28 PM, Eugeny Morozov
> >> <emorozov@griddynamics.com>wrote:
> >>
> >>> Hi, folks!
> >>>
> >>> HBase, Hadoop, etc version is CDH-4.1.2
> >>>
> >>> I'm using custom FuzzyRowFilter, which I get from
> >>
> http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/and
> >>> suddenly after quite a time we found that it starts loosing data.
> >>>
> >>> Basically the idea of FuzzyRowFilter is that it tries to find key that
> >> has
> >>> been provided and if there is no such a key - but more exists in table
> -
> >> it
> >>> returns SEEK_NEXT_USING_HINT. And in getNextKeyHint(...) it builds
> >> required
> >>> key. As I understand, HBase in this key will fast-forward to required
> >> key -
> >>> it must be similar or same as to get Scan with setStartRow.
> >>>
> >>> I'm trying to find key F7dt8QWPSIDw, it is definitely in HBase - I'm
> able
> >>> to get it using Scan.setStartRow.
> >>> For FuzzyFilter I'm using empty Scan - I didn't specify start row, stop
> >> row
> >>> or anything related.
> >>> That's what happening:
> >>>
> >>> Fzzy: AAAA1Q7iQ9JA
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: AQAAnA96rxTg
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: AgAADQWPSIDw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: AwAA-Q33Zb9Q
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: BAAAOg8oyu7A
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: BQAA9gqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: BgABZQ7iQ9JA
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: BwAAbgrpAojg
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: CAAAUQWPSIDw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: CQABVgqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: CgAAOQ7iQ9JA
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: CwAALwqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: DAAAMwWPSIDw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: DQAADgjqzsIQ
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: DgAAOgCcWv9g
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: DwAAKg7iQ9JA
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: EAAAugqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: EQAAJAqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: EgAABgIOMBgg
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: EwAAEwqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: FAAACQqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: FQAAIAqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: FgAAeAWPSIDw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: FwAAAw33Zb9Q
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: F7dt8QWPSIDw
> >>>
> >>> It's obvious that my FuzzyRowFilter knows what to search and every time
> >> it
> >>> repeats its question.
> >>> The very first key - I suppose is just the first key of a region where
> my
> >>> key is located.
> >>> The very last key - is the key that is already bigger than what I'm
> >> trying
> >>> to find - that's the reason why FuzzyFilter stopped there.
> >>>
> >>> Do you know any issue with SEEK_NEXT_USING_HINT? I've searched, but
> >>> unsuccessfully.
> >>> Do you have any idea how to explain these many trials?
> >>>
> >>> Thanks in advance.
> >>> --
> >>> Evgeny Morozov
> >>> Developer Grid Dynamics
> >>> Skype: morozov.evgeny
> >>> www.griddynamics.com
> >>> emorozov@griddynamics.com
> >
> >
> >
> > --
> > Evgeny Morozov
> > Developer Grid Dynamics
> > Skype: morozov.evgeny
> > www.griddynamics.com
> > emorozov@griddynamics.com
>



-- 
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
emorozov@griddynamics.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message