hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From anil gupta <anilgupt...@gmail.com>
Subject Re: Pagination with HBase - getting previous page of data
Date Sun, 03 Feb 2013 17:39:19 GMT
Inline...
On Sun, Feb 3, 2013 at 9:25 AM, Toby Lazar <tlazar@gmail.com> wrote:

> Quick question - if you perform the pagination client-side and just
> call scanner.iterator().next()
> to get to the necessary results, doesn't this add unecessary network
> traffic of the unused results?


Anil: It depends on the solution. If 95% your scans are limited to a single
region then there wont be unnecessary Network I/O.

>  If you want results 100-120, does the
> client need to first read results 1-100 over the network?


Anil: If you do a simple scan and you want result 100-120 then i would say
yes. Maybe you only get 100-120 by using pagination filter or writing some
custom filter or coprocessor. As, i have mentioned earlier in this post
that we wont be allowing the user to jump to100-120 directly. So, first the
user needs to go through 1-100 results. Hence, i will know the rowkey of
100th results and "rowkey of 100th results" will become my startKey for
100-120 results. So, no unnecessary network I/O.

>  Couldn't a
> filter help prevent some of that unneeded traffic?  Or, is the data only
> transferred when inspecting the result object?
>

Anil: Filters might help reduce unnecessary traffic. It all depends on your
use case.

>
> Thanks,
>
> Toby
> On Sun, Feb 3, 2013 at 11:07 AM, Anoop John <anoop.hbase@gmail.com> wrote:
>
> > >lets say for a scan setCaching is
> > 10 and scan is done across two regions. 9 Results(satisfying the filter)
> > are in Region1 and 10 Results(satisfying the filter) are in Region2. Then
> > will this scan return 19 (9+10) results?
> >
> > @Anil.
> > No it will return 10 results only not 19. The client here takes into
> > account the no# of results got from previous region. But a filter is
> > different. The filter has no logic to do at the client side. It fully
> > executed at server side. This is the way it is designed. Personally I
> would
> > prefer to do the pagination by app alone by using plain scan with caching
> > (to avoid so many RPCs) and app level logic.
> >
> > -Anoop-
> >
> > On Sat, Feb 2, 2013 at 1:32 PM, anil gupta <anilgupta84@gmail.com>
> wrote:
> >
> > > Hi Anoop,
> > >
> > > Please find my reply inline.
> > >
> > > Thanks,
> > > Anil
> > >
> > > On Wed, Jan 30, 2013 at 3:31 AM, Anoop Sam John <anoopsj@huawei.com>
> > > wrote:
> > >
> > > > @Anil
> > > >
> > > > >I could not understand that why it goes to multiple regionservers
in
> > > > parallel. Why it cannot guarantee results <= page size( my guess: due
> > to
> > > > multiple RS scans)? If you have used it then maybe you can explain
> the
> > > > behaviour?
> > > >
> > > > Scan from client side never go to multiple RS in parallel. Scan from
> > > > HTable API will be sequential with one region after the other. For
> > every
> > > > region it will open up scanner in the RS and do next() calls. The
> > filter
> > > > will be instantiated at server side per region level ...
> > > >
> > > > When u need 100 rows in the page and you created a Scan at client
> side
> > > > with the filter and suppose there are 2 regions, 1st the scanner is
> > > opened
> > > > at for region1 and scan is happening. It will ensure that max 100
> rows
> > > will
> > > > be retrieved from that region.  But when the region boundary crosses
> > and
> > > > client automatically open up scanner for the region2, there also it
> > will
> > > > pass filter with max 100 rows and so from there also max 100 rows can
> > > > come..  So over all at the client side we can not guartee that the
> scan
> > > > created will only scan 100 rows as a whole from the table.
> > > >
> > >
> > > I agree with other people on this email chain that the 2nd region
> should
> > > only return (100 - no. of rows returned by Region1), if possible.
> > >
> > > When the region boundary crosses and client automatically open up
> scanner
> > > for the region2, why doesnt the scanner in Region2 knows that some of
> the
> > > rows are already fetched by region1. Do you mean to say that by
> default,
> > > for a scan spanning multiple regions, every region has it's own count
> of
> > > no.of rows that its going to return? i.e. lets say for a scan
> setCaching
> > is
> > > 10 and scan is done across two regions. 9 Results(satisfying the
> filter)
> > > are in Region1 and 10 Results(satisfying the filter) are in Region2.
> Then
> > > will this scan return 19 (9+10) results?
> > >
> > > >
> > > > I think I am making it clear.   I have not PageFilter at all.. I am
> > just
> > > > explaining as per the knowledge on scan flow and the general filter
> > > usage.
> > > >
> > > > "This is because the filter is applied separately on different region
> > > > servers. It does however optimize the scan of individual HRegions by
> > > making
> > > > sure that the page size is never exceeded locally. "
> > > >
> > > > I guess it need to be saying that   "This is because the filter is
> > > applied
> > > > separately on different regions".
> > > >
> > > > -Anoop-
> > > >
> > > > ________________________________________
> > > > From: anil gupta [anilgupta84@gmail.com]
> > > > Sent: Wednesday, January 30, 2013 1:33 PM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: Pagination with HBase - getting previous page of data
> > > >
> > > > Hi Mohammad,
> > > >
> > > > You are most welcome to join the discussion. I have never used
> > PageFilter
> > > > so i don't really have concrete input.
> > > > I had a look at
> > > >
> > > >
> > >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html
> > > > I could not understand that why it goes to multiple regionservers in
> > > > parallel. Why it cannot guarantee results <= page size( my guess: due
> > to
> > > > multiple RS scans)? If you have used it then maybe you can explain
> the
> > > > behaviour?
> > > >
> > > > Thanks,
> > > > Anil
> > > >
> > > > On Tue, Jan 29, 2013 at 7:32 PM, Mohammad Tariq <dontariq@gmail.com>
> > > > wrote:
> > > >
> > > > > I'm kinda hesitant to put my leg in between the pros ;)But, does
it
> > > sound
> > > > > sane to use PageFilter for both rows and columns and having some
> > > > additional
> > > > > logic to handle the 'nth' page logic?It'll help us in both kind of
> > > > paging.
> > > > >
> > > > > On Wednesday, January 30, 2013, Jean-Marc Spaggiari <
> > > > > jean-marc@spaggiari.org>
> > > > > wrote:
> > > > > > Hi Anil,
> > > > > >
> > > > > > I think it really depend on the way you want to use the
> pagination.
> > > > > >
> > > > > > Do you need to be able to jump to page X? Are you ok if you
miss
> a
> > > > > > line or 2? Is your data growing fastly? Or slowly? Is it ok
if
> your
> > > > > > page indexes are a day old? Do you need to paginate over 300
> > colums?
> > > > > > Or just 1? Do you need to always have the exact same number
of
> > > entries
> > > > > > in each page?
> > > > > >
> > > > > > For my usecase I need to be able to jump to the page X and I
> don't
> > > > > > have any content. I have hundred of millions lines. Only the
> rowkey
> > > > > > matter for me and I'm fine if sometime I have 50 entries
> displayed,
> > > > > > and sometime only 45. So I'm thinking about calculating which
row
> > is
> > > > > > the first one for each page, and store that separatly. Then
I
> just
> > > > > > need to run the MR daily.
> > > > > >
> > > > > > It's not a perfect solution I agree, but this might do the job
> for
> > > me.
> > > > > > I'm totally open to all other idea which might do the job to.
> > > > > >
> > > > > > JM
> > > > > >
> > > > > > 2013/1/29, anil gupta <anilgupta84@gmail.com>:
> > > > > >> Yes, your suggested solution only works on RowKey based
> > pagination.
> > > It
> > > > > will
> > > > > >> fail when you start filtering on the basis of columns.
> > > > > >>
> > > > > >> Still, i would say it's comparatively easier to maintain
this at
> > > > > >> Application level rather than creating tables for pagination.
> > > > > >>
> > > > > >> What if you have 300 columns in your schema. Will you create
300
> > > > tables?
> > > > > >> What about handling of pagination when filtering is done
based
> on
> > > > > multiple
> > > > > >> columns ("and" and "or" conditions)?
> > > > > >>
> > > > > >> On Tue, Jan 29, 2013 at 1:08 PM, Jean-Marc Spaggiari <
> > > > > >> jean-marc@spaggiari.org> wrote:
> > > > > >>
> > > > > >>> No, no killer solution here ;)
> > > > > >>>
> > > > > >>> But I'm still thinking about that because I might have
to
> > implement
> > > > > >>> some pagination options soon...
> > > > > >>>
> > > > > >>> As you are saying, it's only working on the row-key,
but if you
> > > want
> > > > > >>> to do the same-thing on non-rowkey, you might have to
create a
> > > > > >>> secondary index table...
> > > > > >>>
> > > > > >>> JM
> > > > > >>>
> > > > > >>> 2013/1/27, anil gupta <anilgupta84@gmail.com>:
> > > > > >>> > That's alright..I thought that you have come-up
with a killer
> > > > > solution.
> > > > > >>> So,
> > > > > >>> > got curious to hear your ideas. ;)
> > > > > >>> > It seems like your below mentioned solution will
not work on
> > > > > filtering
> > > > > >>> > on
> > > > > >>> > non row-key columns since when you are deciding
the page
> > numbers
> > > > you
> > > > > >>> > are
> > > > > >>> > only considering rowkey.
> > > > > >>> >
> > > > > >>> > Thanks,
> > > > > >>> > Anil
> > > > > >>> >
> > > > > >>> > On Fri, Jan 25, 2013 at 6:58 PM, Jean-Marc Spaggiari
<
> > > > > >>> > jean-marc@spaggiari.org> wrote:
> > > > > >>> >
> > > > > >>> >> Hi Anil,
> > > > > >>> >>
> > > > > >>> >> I don't have a solution. I never tought about
that ;) But I
> > was
> > > > > >>> >> thinking about something like you create a
2nd table where
> you
> > > > place
> > > > > >>> >> the raw number (4 bytes) then the raw key.
You go directly
> to
> > a
> > > > > >>> >> specific page, you query by the number, found
the key, and
> you
> > > > know
> > > > > >>> >> where to start you scan in the main table.
> > > > > >>> >>
> > > > > >>> >> The issue is properly the number for each lines
since with a
> > MR
> > > > you
> > > > > >>> >> don't know where you are from the beginning.
But you can
> built
> > > > > >>> >> something where you store the line number from
the beginning
> > of
> > > > the
> > > > > >>> >> region, then when all regions are parsed you
can reconstruct
> > the
> > > > > total
> > > > > >>> >> numbering... That should work...
> > > > > >>> >>
> > > > > >>> >> JM
> > > > > >>> >>
> > > > > >>> >> 2013/1/25, anil gupta <anilgupta84@gmail.com>:
> > > > > >>> >> > Inline...
> > > > > >>> >> >
> > > > > >>> >> > On Fri, Jan 25, 2013 at 9:17 AM, Jean-Marc
Spaggiari <
> > > > > >>> >> > jean-marc@spaggiari.org> wrote:
> > > > > >>> >> >
> > > > > >>> >> >> Hi Anil,
> > > > > >>> >> >>
> > > > > >>> >> >> The issue is that all the other sub-sequent
page start
> > should
> > > > be
> > > > > >>> moved
> > > > > >>> >> >> too...
> > > > > >>> >> >>
> > > > > >>> >> > Yes, this is a possibility. Hence the
Developer has to
> take
> > > care
> > > > > of
> > > > > >>> >> > this
> > > > > >>> >> > case. It might also be possible that the
pageSize is not a
> > > hard
> > > > > >>> >> > limit
> > > > > >>> >> > on
> > > > > >>> >> > number of results(more like a hint or
suggestion on
> size). I
> > > > would
> > > > > >>> >> > say
> > > > > >>> >> > it
> > > > > >>> >> > varies by use case.
> > > > > >>> >> >
> > > > > >>> >> >>
> > > > > >>> >> >> so if you want to jump directly to
page n, you might be
> > > totally
> > > > > >>> >> >> shifted because of all the data inserted
in the
> meantime...
> > > > > >>> >> >>
> > > > > >>> >> >> If you want a real complete pagination
feature, you might
> > > want
> > > > to
> > > > > >>> have
> > > > > >>> >> >> a coproccessor or a MR updating another
table refering to
> > the
> > > > > >>> >> >> pages....
> > > > > >>> >> >>
> > > > > >>> >> > Well, the solution depends on the use
case. I will be
> doing
> > > > > >>> >> > pagination
> > > > > >
> > > > >
> > > > > --
> > > > > Warm Regards,
> > > > > Tariq
> > > > > https://mtariq.jux.com/
> > > > > cloudfront.blogspot.com
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Regards,
> > > > Anil Gupta
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Anil Gupta
> > >
> >
>



-- 
Thanks & Regards,
Anil Gupta

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message