hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Ruchovets <oruchov...@gmail.com>
Subject Re: how to model data based on "time bucket"
Date Wed, 30 Jan 2013 09:57:04 GMT
Hi Rodrigo.
    Using solution with 2 tables : one main and one as index.
I have ~50 Million records , in my case I need scan all table and as a
result I will have 50 Millions scans and It will kill all performance.

Is there any other approach to model my usecase using hbase?

Thanks
Oleg.


On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro <
rodriguinho@jusbrasil.com.br> wrote:

> In the approach that i mentioned, you would need a table to retrieve the
> time of a certain event(if this information can retrieve in another way,
> you may ignore this table). It would be like you posted:
> event_id | time
> =============
> event1 | 10:07
> event2 | 10:10
> event3 | 10:12
> event4 | 10:20
>
> And a secundary table would be like:
> rowkey
> ===========
> 10:07:event1
> 10:10:event2
> 10:12:event3
> 10:20:event4
>
> That way, for your first example, you first retrieve the time of the
> "event1" on the main table, and then scan starting from his position on the
> secondary table("10:07:event1"), until the end of the window.
> In this case(T=7) the scan will range ["10:07:event1", "10:05").
>
> As Michel Segel mentioned, there is a hotspot problem on insertion using
> this approach alone.
> Using multiples buckets(could be a hash from the eventId) would distribute
> it better, but requires to scan on all buckets from the second table to get
> all events of the window of time.
>
> Assuming you use 3 buckets, it would look like:
> rowkey
> ===========
> *1_*10:07:event1
> *2_*10:10:event2
> *3_*10:12:event3
> *2_*10:20:event4
>
> The scans would be: ["*1*_10:07:event1", "1_10:15"), ["*2*_10:07:event1",
> "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then combine the
> results.
>
> Hope it helps.
>
> On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <oruchovets@gmail.com
> >wrote:
>
> > Hi Rodrigo.
> >   Can you please explain in more details your solution.You said that I
> will
> > have another table. How many table will I have? Will I have 2 tables?
> What
> > will be the schema of the tables?
> >
> > I try to explain what I try to achive:
> >     I have ~50 million records like {time|event}. I want to put the data
> in
> > Hbase in such way :
> >     events of time X and all events what was after event X during time
> > T minutes (for example during 7 minutes).
> > So I will be able to scan all table and get groups like:
> >
> >   {event1:10:02} corresponds to events {event2:10:03} , {event3:10:05} ,
> > {event4:10:06}
> >   {event2:10:30} correnponds to events {events5:10:32} , {event3:10:33} ,
> > {event3:10:36}.
> >
> > Thanks
> > Oleg.
> >
> >
> > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro <
> > rodriguinho@jusbrasil.com.br> wrote:
> >
> > > You can use another table as a index, using a rowkey like
> > > '{time}:{event_id}', and then scan in the range ["10:07", "10:15").
> > >
> > > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets <oruchovets@gmail.com
> > > >wrote:
> > >
> > > > Hi ,
> > > >
> > > > I have such row data structure:
> > > >
> > > > event_id | time
> > > > =============
> > > > event1 | 10:07
> > > > event2 | 10:10
> > > > event3 | 10:12
> > > >
> > > > event4 | 10:20
> > > > event5 | 10:23
> > > > event6 | 10:25
> > > >
> > > >
> > > > Numbers of records is 50-100 million.
> > > >
> > > >
> > > > Question:
> > > >
> > > > I need to find group of events starting form eventX and enters to the
> > > time
> > > > window bucket = T.
> > > >
> > > >
> > > > For example: if T=7 munutes.
> > > > Starting from event event1- {event1, event2 , event3} were detected
> > > durint
> > > > 7 minutes.
> > > >
> > > > Starting from event event2- {event2 , event3} were detected durint 7
> > > > minutes.
> > > >
> > > > Starting from event event4 - {event4, event5 , event6} were detected
> > > during
> > > > 7 minutes.
> > > > Is there a way to model the data in hbase to get?
> > > >
> > > > Thanks
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *Rodrigo Pereira Ribeiro*
> > > Software Developer
> > > www.jusbrasil.com.br
> > >
> >
>
>
>
> --
>
> *Rodrigo Pereira Ribeiro*
> Software Developer
> T (71) 3033-6371
> C (71) 8612-5847
> rodriguinho@jusbrasil.com.br
> www.jusbrasil.com.br
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message