hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From T Vinod Gupta <tvi...@readypulse.com>
Subject Re: size and column count recommendations for rows in hbase
Date Tue, 10 Jan 2012 19:32:10 GMT
Thanks St.Ack and Kisalay.
In my case, I have primary users and people who interact with my primary
users. Lets call them secondary users.
Kisalay, you are right and I already have the primary user, metric name and
timestamp in my row key. did you mean having the secondary user also part
of the row key as the suffix? if yes, i might consider that.
St. Ack - yeah i have all secondary users in the same CF. even if i add new
CFs, most of the data is in the secondary users data. so it will all stack
up in the new CF.

Thanks


On Tue, Jan 10, 2012 at 11:20 AM, kisalay <kisalay@gmail.com> wrote:

> would it make sense to convert your fat table into a tall table by keeping
> the source of the metric as part of the row key (may be as the suffix ? ).
> For accessing all the metrics associated with a particular user, metric and
> time, u will be resorting to prefix match on ur key.
> Also all the keys for a particular user, metric and time will fall in
> adjacent regions.
>
>
>
> On Tue, Jan 10, 2012 at 11:41 PM, Stack <stack@duboce.net> wrote:
>
> > On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <tvinod@readypulse.com>
> > wrote:
> > > i was scanning through different questions that people asked in this
> > > mailing list regarding choosing the right schema so that map reduce
> jobs
> > > can be run appropriately and hot regions avoided due to sequential
> > > accesses.
> > > somewhere, i got the impression that it is ok for a row to have
> millions
> > of
> > > columns and/or have large volume of data per region. but then my map
> > reduce
> > > job to copy rows failed due to row size being too large (121MB). so
> now i
> > > am confused about whats the recommended way. does it mean that default
> > > region size and other configuration parameters need to be tweaked?
> > >
> >
> > Yeah, if you request all of the row, its going to try and give it to
> > you even if millions of columns.  You can ask the scan to give you
> > back a bounded number of columns per iteration so you read through the
> > big row a piece at a time.
> >
> > > in my use case, my system is receiving lots of metrics for different
> > users
> > > and i need to maintain daily counters for each of them. it is at day
> > > granularity and not a typical TSD series. my row key has user id,
> metric
> > > name as prefix and day timestamp as suffix. and i keep incrementing the
> > > values. the scale issue happens because i store information about the
> > > source of the metric too. e.g. i store the id of the person who
> mentioned
> > > my user in a tweet.. I am storing all that information in different
> > columns
> > > of the same row. so the pattern here is variable - you can have a
> million
> > > people tweet about someone and just 2 people tweet about someone else
> on
> > a
> > > given day. is it a bad idea to use columns here? i did it this way
> > because
> > > it makes it easy for a different process to run later and aggregate
> > > information such as list all people who mentioned my user during a
> given
> > > date range.
> > >
> >
> > All in one column family?  Would it make sense to have more than one CF?
> >
> > St.Ack
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message