hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Ruchovets <oruchov...@gmail.com>
Subject Re: how to model data based on "time bucket"
Date Mon, 28 Jan 2013 17:45:52 GMT
I think I didn't explain correct.
    I want to read from 2 table in context of 1 mapreduce job.
I mean I want to read one key from main table and scan range from another
in the same mapreduce job.I only found MultiTableOutputFormat and there is
no MultiTableInputFormat. Is there any workaround to read from 2 tables
from one mapreduce?
   By the way I can use bulkloading to prevent hotspots and it gives
capabilities of  fast scan.

Thansk
Oleg.


On Mon, Jan 28, 2013 at 7:24 PM, Rodrigo Ribeiro <
rodriguinho@jusbrasil.com.br> wrote:

> Yes, it's possible,
> Check this solution:
>
> http://stackoverflow.com/questions/11353911/extending-hadoops-tableinputformat-to-scan-with-a-prefix-used-for-distribution
>
> On Mon, Jan 28, 2013 at 2:07 PM, Oleg Ruchovets <oruchovets@gmail.com
> >wrote:
>
> >  Yes.
> > This is very interesting approach.
> >
> >        Is it possible to read from main key and scan from another using
> > map/reduce? I don't want to read from single client. I use hbase version
> > 0.94.2.21.
> >
> > Thanks
> > Oleg.
> >
> >
> > On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro <
> > rodriguinho@jusbrasil.com.br> wrote:
> >
> > > In the approach that i mentioned, you would need a table to retrieve
> the
> > > time of a certain event(if this information can retrieve in another
> way,
> > > you may ignore this table). It would be like you posted:
> > > event_id | time
> > > =============
> > > event1 | 10:07
> > > event2 | 10:10
> > > event3 | 10:12
> > > event4 | 10:20
> > >
> > > And a secundary table would be like:
> > > rowkey
> > > ===========
> > > 10:07:event1
> > > 10:10:event2
> > > 10:12:event3
> > > 10:20:event4
> > >
> > > That way, for your first example, you first retrieve the time of the
> > > "event1" on the main table, and then scan starting from his position on
> > the
> > > secondary table("10:07:event1"), until the end of the window.
> > > In this case(T=7) the scan will range ["10:07:event1", "10:05").
> > >
> > > As Michel Segel mentioned, there is a hotspot problem on insertion
> using
> > > this approach alone.
> > > Using multiples buckets(could be a hash from the eventId) would
> > distribute
> > > it better, but requires to scan on all buckets from the second table to
> > get
> > > all events of the window of time.
> > >
> > > Assuming you use 3 buckets, it would look like:
> > > rowkey
> > > ===========
> > > *1_*10:07:event1
> > > *2_*10:10:event2
> > > *3_*10:12:event3
> > > *2_*10:20:event4
> > >
> > > The scans would be: ["*1*_10:07:event1", "1_10:15"),
> ["*2*_10:07:event1",
> > > "2_10:15"), and ["*3*_10:07:event1", "3_10:15"), you can then combine
> the
> > > results.
> > >
> > > Hope it helps.
> > >
> > > On Mon, Jan 28, 2013 at 12:49 PM, Oleg Ruchovets <oruchovets@gmail.com
> > > >wrote:
> > >
> > > > Hi Rodrigo.
> > > >   Can you please explain in more details your solution.You said that
> I
> > > will
> > > > have another table. How many table will I have? Will I have 2 tables?
> > > What
> > > > will be the schema of the tables?
> > > >
> > > > I try to explain what I try to achive:
> > > >     I have ~50 million records like {time|event}. I want to put the
> > data
> > > in
> > > > Hbase in such way :
> > > >     events of time X and all events what was after event X during
> time
> > > > T minutes (for example during 7 minutes).
> > > > So I will be able to scan all table and get groups like:
> > > >
> > > >   {event1:10:02} corresponds to events {event2:10:03} ,
> {event3:10:05}
> > ,
> > > > {event4:10:06}
> > > >   {event2:10:30} correnponds to events {events5:10:32} ,
> > {event3:10:33} ,
> > > > {event3:10:36}.
> > > >
> > > > Thanks
> > > > Oleg.
> > > >
> > > >
> > > > On Mon, Jan 28, 2013 at 5:17 PM, Rodrigo Ribeiro <
> > > > rodriguinho@jusbrasil.com.br> wrote:
> > > >
> > > > > You can use another table as a index, using a rowkey like
> > > > > '{time}:{event_id}', and then scan in the range ["10:07", "10:15").
> > > > >
> > > > > On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets <
> > oruchovets@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Hi ,
> > > > > >
> > > > > > I have such row data structure:
> > > > > >
> > > > > > event_id | time
> > > > > > =============
> > > > > > event1 | 10:07
> > > > > > event2 | 10:10
> > > > > > event3 | 10:12
> > > > > >
> > > > > > event4 | 10:20
> > > > > > event5 | 10:23
> > > > > > event6 | 10:25
> > > > > >
> > > > > >
> > > > > > Numbers of records is 50-100 million.
> > > > > >
> > > > > >
> > > > > > Question:
> > > > > >
> > > > > > I need to find group of events starting form eventX and enters
to
> > the
> > > > > time
> > > > > > window bucket = T.
> > > > > >
> > > > > >
> > > > > > For example: if T=7 munutes.
> > > > > > Starting from event event1- {event1, event2 , event3} were
> detected
> > > > > durint
> > > > > > 7 minutes.
> > > > > >
> > > > > > Starting from event event2- {event2 , event3} were detected
> durint
> > 7
> > > > > > minutes.
> > > > > >
> > > > > > Starting from event event4 - {event4, event5 , event6} were
> > detected
> > > > > during
> > > > > > 7 minutes.
> > > > > > Is there a way to model the data in hbase to get?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Rodrigo Pereira Ribeiro*
> > > > > Software Developer
> > > > > www.jusbrasil.com.br
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *Rodrigo Pereira Ribeiro*
> > > Software Developer
> > > T (71) 3033-6371
> > > C (71) 8612-5847
> > > rodriguinho@jusbrasil.com.br
> > > www.jusbrasil.com.br
> > >
> >
>
>
>
> --
>
> *Rodrigo Pereira Ribeiro*
> Software Developer
> www.jusbrasil.com.br
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message