hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: size and column count recommendations for rows in hbase
Date Tue, 10 Jan 2012 18:11:56 GMT
On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <tvinod@readypulse.com> wrote:
> i was scanning through different questions that people asked in this
> mailing list regarding choosing the right schema so that map reduce jobs
> can be run appropriately and hot regions avoided due to sequential
> accesses.
> somewhere, i got the impression that it is ok for a row to have millions of
> columns and/or have large volume of data per region. but then my map reduce
> job to copy rows failed due to row size being too large (121MB). so now i
> am confused about whats the recommended way. does it mean that default
> region size and other configuration parameters need to be tweaked?

Yeah, if you request all of the row, its going to try and give it to
you even if millions of columns.  You can ask the scan to give you
back a bounded number of columns per iteration so you read through the
big row a piece at a time.

> in my use case, my system is receiving lots of metrics for different users
> and i need to maintain daily counters for each of them. it is at day
> granularity and not a typical TSD series. my row key has user id, metric
> name as prefix and day timestamp as suffix. and i keep incrementing the
> values. the scale issue happens because i store information about the
> source of the metric too. e.g. i store the id of the person who mentioned
> my user in a tweet.. I am storing all that information in different columns
> of the same row. so the pattern here is variable - you can have a million
> people tweet about someone and just 2 people tweet about someone else on a
> given day. is it a bad idea to use columns here? i did it this way because
> it makes it easy for a different process to run later and aggregate
> information such as list all people who mentioned my user during a given
> date range.

All in one column family?  Would it make sense to have more than one CF?


View raw message