hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: How to get Last 1000 records from 1 millions records
Date Thu, 25 Aug 2016 11:33:41 GMT
Proper data structure at client side can avoid sorting. 

For example:
https://docs.oracle.com/javase/7/docs/api/java/util/LinkedList.html#addFirst(E)

> On Aug 25, 2016, at 2:45 AM, ramkrishna vasudevan <ramkrishna.s.vasudevan@gmail.com>
wrote:
> 
> And reading thro the mail chain as Ted suggested if you setReversedScan as
> True and reverse your stop and start Row you can just do a count in your
> Row filter filter till 10k is reached and then just skip all the other
> results.
> 
> In the other logic that I had said you may have to do a sort before
> returning the collected result. In the reverse scan case too if you need
> the result in lexographical order you may need to sort it in the client
> side.
> 
> Regards
> Ram
> 
> On Thu, Aug 25, 2016 at 3:11 PM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
> 
>> Hi Manjeet
>> 
>> For your first question regarding fetching last 1000 records
>> 
>> First in your scan you set your start Row with the bytes corresponding to (
>> A_9811111111_)
>> and let the end byte be the byte representation of  A_9811111111 + 1 . I
>> mean add +1 to the last byte of what comes out of  (A_9811111111_). So
>> this will ensure you scan only the rows corresponding to  (A_9811111111_).
>> 
>> Just thinking the first thing that I can see is that it may be easier to
>> do this with CPs than Filters. Because filters deals with per cell or that
>> row. Adding the results and maintaing the last 10k records may be
>> difficult. I have to see in detail if possible.
>> 
>> Do you know the number of columns you have?  If there are multiple columns
>> then it is quite tricky. But if you have only one column per row then or
>> you want only the row keys
>> 
>> You can implement an User Coprocessor and in that you can implement
>> preStoreScannerOpen(). Take for eg.  you have only one family so in that
>> case in you preStoreScannerOpen you will create your own StoreScanner and
>> in the StoreScanner.next() you can
>> just skip all KeyValues and during that process keep collecting your
>> cells. Ensure you keep collecting the cells row wise by adding to a list.
>> You will have to have only the latest 10000 cells in the list any time.
>> 
>> Every time keep checking if the row has reached the stopRow that is set in
>> the scan (so may be it moves to A_9811111112_).
>> Once you see this condition you may have to replace the list given by the
>> StoreScanner.next() call with the list that you have collected and send it
>> to the client.
>> I have not yet tried it but it can give you an idea with CPs.
>> 
>> With filters am not sure as I said as I need to read the flow and see if
>> there are any such APIs to mimic the above.
>> 
>> PS. Don't take this as a working algo. There may be reasons why it may not
>> work but you can see and read about CPs to see if something like above can
>> work out.
>> 
>> Regards
>> Ram
>> 
>> 
>> 
>> 
>> On Thu, Aug 25, 2016 at 2:16 PM, Manjeet Singh <manjeet.chandhok@gmail.com
>>> wrote:
>> 
>>> Hi All
>>> 
>>> I have one another question for same case
>>> 
>>> below is my sample Hbase data  as we all know that hbase store data on the
>>> basis of rowkey (sorted)
>>> below is IP as we can see 2.168.129.81_1 is in last what I am expecting it
>>> shuld come just after 1.168.129.81_2
>>> 
>>> 
>>> 
>>> 1.168.129.81_0
>>> column=c2:D_com.stackoverflow/questions/4, timestamp=1472104396288,
>>> value=4
>>> 1.168.129.81_1
>>> column=c2:D_com.stackoverflow/questions/1, timestamp=1472104396288,
>>> value=1
>>> 1.168.129.81_1
>>> column=c2:D_com.stackoverflow/questions/2, timestamp=1472104396288,
>>> value=2
>>> 1.168.129.81_2
>>> column=c2:D_com.stackoverflow/questions/0, timestamp=1472104396288,
>>> value=0
>>> 192.168.129.81_1
>>> column=c2:D_com.stackoverflow/questions/2, timestamp=1472104386671,
>>> value=2
>>> 192.168.129.81_1
>>> column=c2:D_com.stackoverflow/questions/4, timestamp=1472104386671,
>>> value=4
>>> 192.168.129.81_2
>>> column=c2:D_com.stackoverflow/questions/1, timestamp=1472104386671,
>>> value=1
>>> 192.168.129.81_3
>>> column=c2:D_com.stackoverflow/questions/0, timestamp=1472104386671,
>>> value=0
>>> 192.168.129.81_3
>>> column=c2:D_com.stackoverflow/questions/3, timestamp=1472104386671,
>>> value=3
>>> 2.168.129.81_1
>>> column=c2:D_com.stackoverflow/questions/0, timestamp=1472104404609,
>>> value=0
>>> 2.168.129.81_1
>>> column=c2:D_com.stackoverflow/questions/1, timestamp=1472104404609,
>>> value=1
>>> 2.168.129.81_1
>>> column=c2:D_com.stackoverflow/questions/2, timestamp=1472104404609,
>>> value=2
>>> 2.168.129.81_3
>>> column=c2:D_com.stackoverflow/questions/4, timestamp=1472104404609,
>>> value=4
>>> 
>>> 
>>> 
>>> On Thu, Aug 25, 2016 at 12:36 PM, Manjeet Singh <
>>> manjeet.chandhok@gmail.com>
>>> wrote:
>>> 
>>>> I am using some logical salt say I have mobile number in my row key so I
>>>> am using some algo and fitting this mobile number into some ASCII char
>>>> So each time I know what will be the salt so its clear to me and it will
>>>> never change the order
>>>> example
>>>> if based on my algo I get A for 9811111111
>>>> so each time it will always return me A for 9811111111
>>>> so if I have my row key Like
>>>> A_9811111111_101
>>>> A_9811111111_102
>>>> A_9811111111_103
>>>> A_9811111111_104
>>>> A_9811111111_105
>>>> A_9811111111_106
>>>> A_9811111111_107
>>>> A_9811111111_108
>>>> 
>>>> it will sort my row key in same manner as showing above now these are
>>>> millions of record now i want to get last 10000 records
>>>> is their any way to get it, my concern is to perform all calcuation on
>>>> server side not client side.
>>>> 
>>>> 
>>>> Thanks
>>>> Manjeet
>>>> 
>>>> 
>>>> On Thu, Aug 25, 2016 at 1:06 AM, Esteban Gutierrez <
>>> esteban@cloudera.com>
>>>> wrote:
>>>> 
>>>>> As long as new rows are added to the latest region that "might" work.
>>> But
>>>>> if the table is using hashed keys or rows are added randomly to the
>>> table
>>>>> then retrieving the last million will be trickier and you will have to
>>>>> scan
>>>>> based on timestamp (if not modified) and then filter one more time.
>>>>> 
>>>>> esteban.
>>>>> 
>>>>> 
>>>>> --
>>>>> Cloudera, Inc.
>>>>> 
>>>>> 
>>>>>> On Wed, Aug 24, 2016 at 12:31 PM, Ted Yu <yuzhihong@gmail.com>
wrote:
>>>>>> 
>>>>>> The following API should help in your case:
>>>>>> 
>>>>>>  public Scan setReversed(boolean reversed) {
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Wed, Aug 24, 2016 at 12:05 PM, Manjeet Singh <
>>>>>> manjeet.chandhok@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi all
>>>>>>> 
>>>>>>> Hbase didnt provide sorting on column but rowkey store in sorted
>>> form
>>>>>>> like small value first and greater value last
>>>>>>> 
>>>>>>> example
>>>>>>> 1
>>>>>>> 2
>>>>>>> 3
>>>>>>> 4
>>>>>>> 5
>>>>>>> 6
>>>>>>> 7
>>>>>>> and so on
>>>>>>> 
>>>>>>> Assume I have 1 Miilions record but i want to look last 1000
>>> records
>>>>> only
>>>>>>> Is their any way to do this? I don't want to perform any
>>> calculation
>>>>> on
>>>>>>> client side so may be any filter can help on it?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Manjeet
>>>>>>> 
>>>>>>> --
>>>>>>> luv all
>>>> 
>>>> 
>>>> 
>>>> --
>>>> luv all
>>> 
>>> 
>>> 
>>> --
>>> luv all
>> 
>> 

Mime
View raw message