hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop Sam John <anoo...@huawei.com>
Subject RE: Custom Filter and SEEK_NEXT_USING_HINT issue
Date Mon, 21 Jan 2013 08:59:28 GMT
> I suppose if scanning process has started at once on
all regions, then I would find in log files at least one value per region,
but I have found one value per region only for those regions, that resides
before the particular one.

@Eugeny -  FuzzyFilter like any other filter works at the server side. The scanning from client
side will be like sequential starting from the 1st region (Region with empty startkey or the
corresponding region which contains the startkey whatever you mentioned in your scan). From
client, request will go to RS for scanning a region. Once that region is over the next region
will be contacted for scan(from client) and so on.  There is no parallel scanning of multiple
regions from client side.  [This is when using a HTable scan APIs]

When MR used for scanning, we will be doing parallel scans from all the regions. Here will
be having mappers per region.  But the normal scan from client side will be sequential on
the regions not parallel.

-Anoop-
________________________________________
From: Eugeny Morozov [emorozov@griddynamics.com]
Sent: Monday, January 21, 2013 1:46 PM
To: user@hbase.apache.org
Cc: Alex Baranau
Subject: Re: Custom Filter and SEEK_NEXT_USING_HINT issue

Finally, the mystery has been solved.

Small remark before I explain everything.

The situation with only region is absolutely the same:
Fzzy: AAAA1Q7iQ9JA
Next fzzy: F7dtxwqVQ_Pw  <-- the value I'm trying to find.
Fzzy: F7dt8QWPSIDw
Somehow FuzzyRowFilter has just omit my value here.


So, the explanation.
In javadoc for FuzzyRowFilter question mark is used as substitution for
unknown value. Of course it's possible to use anything including zero
instead of question mark.
For quite some time we used literals to encode our keys. Literals like
you've seen already: AAAA1Q7iQ9JA or F7dt8QWPSIDw. But that's Base64 form
of just 8 bytes, which requires 1.5 times more space. So we've decided to
store raw version - just  byte[8]. But unfortunately the symbol '?' is
exactly in the middle of the byte (according to ascii table
http://www.asciitable.com/), which means with FuzzyRowFilter we skip half
of values in some cases. In the same time question mark is exactly before
any letter that could be used in key.

Despite the fact we have integration tests - that's just a coincidence we
haven't such an example in there.

So, as an advice - always use zero instead of question mark for
FuzzyRowFilter.

Thank's to everyone!

P.S. But the question with region scanning order is still here. I do not
understand why with FuzzyFilter it goes from one region to another until it
stops at the value. I suppose if scanning process has started at once on
all regions, then I would find in log files at least one value per region,
but I have found one value per region only for those regions, that resides
before the particular one.


On Mon, Jan 21, 2013 at 4:22 AM, Michael Segel <michael_segel@hotmail.com>wrote:

> If its the same class and its not a patch, then the first class loaded
> wins.
>
> So if you have a Class Foo and HBase has a Class Foo, your code will never
> see the light of day.
>
> Perhaps I'm stating the obvious but its something to think about when
> working w Hadoop.
>
> On Jan 19, 2013, at 3:36 AM, Eugeny Morozov <emorozov@griddynamics.com>
> wrote:
>
> > Ted,
> >
> > that is correct.
> > HBase 0.92.x and we use part of the patch 6509.
> >
> > I use the filter as a custom filter, it lives in separate jar file and
> goes
> > to HBase's classpath. I did not patch HBase.
> > Moreover I do not use protobuf's descriptions that comes with the filter
> in
> > patch. Only two classes I have - FuzzyRowFilter itself and its test
> class.
> >
> > And it works perfectly on small dataset like 100 rows (1 region). But
> when
> > my dataset is more than 10mln (260 regions), it somehow loosing rows. I'm
> > not sure, but it seems to me it is not fault of the filter.
> >
> >
> > On Sat, Jan 19, 2013 at 3:56 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> >> To my knowledge CDH-4.1.2 is based on HBase 0.92.x
> >>
> >> Looks like you were using patch from HBASE-6509 which was integrated to
> >> trunk only.
> >> Please confirm.
> >>
> >> Copying Alex who wrote the patch.
> >>
> >> Cheers
> >>
> >> On Fri, Jan 18, 2013 at 3:28 PM, Eugeny Morozov
> >> <emorozov@griddynamics.com>wrote:
> >>
> >>> Hi, folks!
> >>>
> >>> HBase, Hadoop, etc version is CDH-4.1.2
> >>>
> >>> I'm using custom FuzzyRowFilter, which I get from
> >>>
> >>>
> >>
> http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/and
> >>> suddenly after quite a time we found that it starts loosing data.
> >>>
> >>> Basically the idea of FuzzyRowFilter is that it tries to find key that
> >> has
> >>> been provided and if there is no such a key - but more exists in table
> -
> >> it
> >>> returns SEEK_NEXT_USING_HINT. And in getNextKeyHint(...) it builds
> >> required
> >>> key. As I understand, HBase in this key will fast-forward to required
> >> key -
> >>> it must be similar or same as to get Scan with setStartRow.
> >>>
> >>> I'm trying to find key F7dt8QWPSIDw, it is definitely in HBase - I'm
> able
> >>> to get it using Scan.setStartRow.
> >>> For FuzzyFilter I'm using empty Scan - I didn't specify start row, stop
> >> row
> >>> or anything related.
> >>> That's what happening:
> >>>
> >>> Fzzy: AAAA1Q7iQ9JA
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: AQAAnA96rxTg
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: AgAADQWPSIDw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: AwAA-Q33Zb9Q
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: BAAAOg8oyu7A
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: BQAA9gqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: BgABZQ7iQ9JA
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: BwAAbgrpAojg
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: CAAAUQWPSIDw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: CQABVgqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: CgAAOQ7iQ9JA
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: CwAALwqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: DAAAMwWPSIDw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: DQAADgjqzsIQ
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: DgAAOgCcWv9g
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: DwAAKg7iQ9JA
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: EAAAugqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: EQAAJAqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: EgAABgIOMBgg
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: EwAAEwqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: FAAACQqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: FQAAIAqVQrTw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: FgAAeAWPSIDw
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: FwAAAw33Zb9Q
> >>> Next fzzy: F7dtxwqVQ_Pw
> >>> Fzzy: F7dt8QWPSIDw
> >>>
> >>> It's obvious that my FuzzyRowFilter knows what to search and every time
> >> it
> >>> repeats its question.
> >>> The very first key - I suppose is just the first key of a region where
> my
> >>> key is located.
> >>> The very last key - is the key that is already bigger than what I'm
> >> trying
> >>> to find - that's the reason why FuzzyFilter stopped there.
> >>>
> >>> Do you know any issue with SEEK_NEXT_USING_HINT? I've searched, but
> >>> unsuccessfully.
> >>> Do you have any idea how to explain these many trials?
> >>>
> >>> Thanks in advance.
> >>> --
> >>> Evgeny Morozov
> >>> Developer Grid Dynamics
> >>> Skype: morozov.evgeny
> >>> www.griddynamics.com
> >>> emorozov@griddynamics.com
> >>>
> >>
> >
> >
> >
> > --
> > Evgeny Morozov
> > Developer Grid Dynamics
> > Skype: morozov.evgeny
> > www.griddynamics.com
> > emorozov@griddynamics.com
>
>


--
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
emorozov@griddynamics.com
Mime
View raw message