hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Pagination with HBase - getting previous page of data
Date Wed, 30 Jan 2013 12:18:25 GMT
Hi Anoop,

So does it mean the scanner can send back LIMIT*2-1 lines max? Reading
100 rows from the 2nd region is using extra time and resources. Why
not ask for only the number of missing lines?

JM

2013/1/30, Anoop Sam John <anoopsj@huawei.com>:
> @Anil
>
>>I could not understand that why it goes to multiple regionservers in
> parallel. Why it cannot guarantee results <= page size( my guess: due to
> multiple RS scans)? If you have used it then maybe you can explain the
> behaviour?
>
> Scan from client side never go to multiple RS in parallel. Scan from HTable
> API will be sequential with one region after the other. For every region it
> will open up scanner in the RS and do next() calls. The filter will be
> instantiated at server side per region level ...
>
> When u need 100 rows in the page and you created a Scan at client side with
> the filter and suppose there are 2 regions, 1st the scanner is opened at for
> region1 and scan is happening. It will ensure that max 100 rows will be
> retrieved from that region.  But when the region boundary crosses and client
> automatically open up scanner for the region2, there also it will pass
> filter with max 100 rows and so from there also max 100 rows can come..  So
> over all at the client side we can not guartee that the scan created will
> only scan 100 rows as a whole from the table.
>
> I think I am making it clear.   I have not PageFilter at all.. I am just
> explaining as per the knowledge on scan flow and the general filter usage.
>
> "This is because the filter is applied separately on different region
> servers. It does however optimize the scan of individual HRegions by making
> sure that the page size is never exceeded locally. "
>
> I guess it need to be saying that   "This is because the filter is applied
> separately on different regions".
>
> -Anoop-
>
> ________________________________________
> From: anil gupta [anilgupta84@gmail.com]
> Sent: Wednesday, January 30, 2013 1:33 PM
> To: user@hbase.apache.org
> Subject: Re: Pagination with HBase - getting previous page of data
>
> Hi Mohammad,
>
> You are most welcome to join the discussion. I have never used PageFilter
> so i don't really have concrete input.
> I had a look at
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html
> I could not understand that why it goes to multiple regionservers in
> parallel. Why it cannot guarantee results <= page size( my guess: due to
> multiple RS scans)? If you have used it then maybe you can explain the
> behaviour?
>
> Thanks,
> Anil
>
> On Tue, Jan 29, 2013 at 7:32 PM, Mohammad Tariq <dontariq@gmail.com> wrote:
>
>> I'm kinda hesitant to put my leg in between the pros ;)But, does it sound
>> sane to use PageFilter for both rows and columns and having some
>> additional
>> logic to handle the 'nth' page logic?It'll help us in both kind of
>> paging.
>>
>> On Wednesday, January 30, 2013, Jean-Marc Spaggiari <
>> jean-marc@spaggiari.org>
>> wrote:
>> > Hi Anil,
>> >
>> > I think it really depend on the way you want to use the pagination.
>> >
>> > Do you need to be able to jump to page X? Are you ok if you miss a
>> > line or 2? Is your data growing fastly? Or slowly? Is it ok if your
>> > page indexes are a day old? Do you need to paginate over 300 colums?
>> > Or just 1? Do you need to always have the exact same number of entries
>> > in each page?
>> >
>> > For my usecase I need to be able to jump to the page X and I don't
>> > have any content. I have hundred of millions lines. Only the rowkey
>> > matter for me and I'm fine if sometime I have 50 entries displayed,
>> > and sometime only 45. So I'm thinking about calculating which row is
>> > the first one for each page, and store that separatly. Then I just
>> > need to run the MR daily.
>> >
>> > It's not a perfect solution I agree, but this might do the job for me.
>> > I'm totally open to all other idea which might do the job to.
>> >
>> > JM
>> >
>> > 2013/1/29, anil gupta <anilgupta84@gmail.com>:
>> >> Yes, your suggested solution only works on RowKey based pagination. It
>> will
>> >> fail when you start filtering on the basis of columns.
>> >>
>> >> Still, i would say it's comparatively easier to maintain this at
>> >> Application level rather than creating tables for pagination.
>> >>
>> >> What if you have 300 columns in your schema. Will you create 300
>> >> tables?
>> >> What about handling of pagination when filtering is done based on
>> multiple
>> >> columns ("and" and "or" conditions)?
>> >>
>> >> On Tue, Jan 29, 2013 at 1:08 PM, Jean-Marc Spaggiari <
>> >> jean-marc@spaggiari.org> wrote:
>> >>
>> >>> No, no killer solution here ;)
>> >>>
>> >>> But I'm still thinking about that because I might have to implement
>> >>> some pagination options soon...
>> >>>
>> >>> As you are saying, it's only working on the row-key, but if you want
>> >>> to do the same-thing on non-rowkey, you might have to create a
>> >>> secondary index table...
>> >>>
>> >>> JM
>> >>>
>> >>> 2013/1/27, anil gupta <anilgupta84@gmail.com>:
>> >>> > That's alright..I thought that you have come-up with a killer
>> solution.
>> >>> So,
>> >>> > got curious to hear your ideas. ;)
>> >>> > It seems like your below mentioned solution will not work on
>> filtering
>> >>> > on
>> >>> > non row-key columns since when you are deciding the page numbers
>> >>> > you
>> >>> > are
>> >>> > only considering rowkey.
>> >>> >
>> >>> > Thanks,
>> >>> > Anil
>> >>> >
>> >>> > On Fri, Jan 25, 2013 at 6:58 PM, Jean-Marc Spaggiari <
>> >>> > jean-marc@spaggiari.org> wrote:
>> >>> >
>> >>> >> Hi Anil,
>> >>> >>
>> >>> >> I don't have a solution. I never tought about that ;) But I
was
>> >>> >> thinking about something like you create a 2nd table where
you
>> >>> >> place
>> >>> >> the raw number (4 bytes) then the raw key. You go directly
to a
>> >>> >> specific page, you query by the number, found the key, and
you
>> >>> >> know
>> >>> >> where to start you scan in the main table.
>> >>> >>
>> >>> >> The issue is properly the number for each lines since with
a MR
>> >>> >> you
>> >>> >> don't know where you are from the beginning. But you can built
>> >>> >> something where you store the line number from the beginning
of
>> >>> >> the
>> >>> >> region, then when all regions are parsed you can reconstruct
the
>> total
>> >>> >> numbering... That should work...
>> >>> >>
>> >>> >> JM
>> >>> >>
>> >>> >> 2013/1/25, anil gupta <anilgupta84@gmail.com>:
>> >>> >> > Inline...
>> >>> >> >
>> >>> >> > On Fri, Jan 25, 2013 at 9:17 AM, Jean-Marc Spaggiari <
>> >>> >> > jean-marc@spaggiari.org> wrote:
>> >>> >> >
>> >>> >> >> Hi Anil,
>> >>> >> >>
>> >>> >> >> The issue is that all the other sub-sequent page start
should
>> >>> >> >> be
>> >>> moved
>> >>> >> >> too...
>> >>> >> >>
>> >>> >> > Yes, this is a possibility. Hence the Developer has to
take care
>> of
>> >>> >> > this
>> >>> >> > case. It might also be possible that the pageSize is not
a hard
>> >>> >> > limit
>> >>> >> > on
>> >>> >> > number of results(more like a hint or suggestion on size).
I
>> >>> >> > would
>> >>> >> > say
>> >>> >> > it
>> >>> >> > varies by use case.
>> >>> >> >
>> >>> >> >>
>> >>> >> >> so if you want to jump directly to page n, you might
be totally
>> >>> >> >> shifted because of all the data inserted in the meantime...
>> >>> >> >>
>> >>> >> >> If you want a real complete pagination feature, you
might want
>> >>> >> >> to
>> >>> have
>> >>> >> >> a coproccessor or a MR updating another table refering
to the
>> >>> >> >> pages....
>> >>> >> >>
>> >>> >> > Well, the solution depends on the use case. I will be
doing
>> >>> >> > pagination
>> >
>>
>> --
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>
>
>
> --
> Thanks & Regards,
> Anil Gupta

Mime
View raw message