hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leon Mergen <l...@solatis.com>
Subject Re: Splits and MapReduce
Date Tue, 15 May 2012 14:53:21 GMT
Hello Himanish,

Thanks for the advice. It looks like they are using a compound key of a
"metric id" in addition to the timestamp:

http://opentsdb.net/schema.html

This sounds like a good solution for their use case but, unfortunately, we
have a lot of MapReduce jobs which *only* filter based on the timestamp,
and thus would result in a big table scan. However, I did find this little
gem:

https://bugzilla.mozilla.org/show_bug.cgi?id=566340

It looks like the Mozilla Sorocco project ran into a similar issue, and
they have chosen to use a salt for their row keys: prepend the timestamp
with the first digit of an OOID to ensure a certain amount of parallelism
when writing.

What are the thoughts of the experts here about this solution ?


Regards,

Leon Mergen




On Tue, May 15, 2012 at 4:28 PM, Himanish Kushary <himanish@gmail.com>wrote:

> Hi,
>
> You could take a look into  *OpenTSDB* . I think they are addressing some
> of the issues that you mention here.
>
> Thanks
>
>
> On Tue, May 15, 2012 at 10:09 AM, Leon Mergen <leon@solatis.com> wrote:
>
> > Hello all,
> >
> > We are currently orienting on HBase as a possible way to store our log
> data
> > in a structured way, and I want to verify a few things I was not able to
> > find online. Specifically, what we are trying to achieve:
> >
> >  * be able to quickly search for logs within a specific time range;
> >  * limit the amount of maps in our mapreduce jobs to only those areas
> we're
> > interested in.
> >
> > As I understand it, there is a tradeoff:
> >
> > * if you use a timestamp as a split key, be prepared for a tradeoff: a
> > single region server can become a hotspot. This is bad when writing data
> at
> > a high load;
> > * if we do not have the timestamp as the first key of the splitkeys, a
> > MapReduce job will have to do a TableScan and have a huge amount of maps.
> >
> > Is there a known solution / workaround for this problem that people have
> > used? Since our timespan queries are usually limited based on days, we
> were
> > considering adding a new table for each day, but that looked like a bit
> of
> > an ugly hack.
> >
> > Any ideas / suggestions about this ?
> >
> > Regards,
> >
> > Leon Mergen
> >
>
>
>
> --
> Thanks & Regards
> Himanish
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message