hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Brown <tombrow...@gmail.com>
Subject Add client complexity or use a coprocessor?
Date Mon, 09 Apr 2012 16:48:18 GMT
To whom it may concern,

Ignoring the complexities of gathering the data, assume that I will be
tracking millions of unique viewers. Updates from each of our millions
of clients are gathered in a centralized platform and spread among a
group of machines for processing and inserting into HBase (assume that
this group can be scaled horizontally). The data is stored in an OLAP
cube format and one of the metrics I'm tracking across various
attributes is viewership (how many people from Y are watching X).

I'm writing this to ask for your thoughts as to the most appropriate
way to structure my data so I can count unique TV viewers (assume a
service like netflix or hulu).

Here are the solutions I'm considering:

1. Store each unique user ID as the cell name within the cube(s) it
occurs. This has the advantage of having 100% accuracy, but the
downside is the enormous space required to store each unique cell.
Consuming this data is also problematic as the only way to provide a
viewership count is by counting each cell. To save the overhead of
sending each cell over the network, counting them could be done by a
coprocessor on the region server, but that still doesn't avoid the
overhead of reading each cell from the disk. I'm also not sure what
happens if a single row is larger than an entire region (48 bytes per
user ID * 10,000,000 users = 480GB).

2. Store a byte array that allows estimating unique viewers (with a
small margin of error*). Add a co-processor for updating this column
so I can guarantee the updates to a specific OLAP cell will be atomic.
The main benefit from this path is that there the nodes that update
HBase can be less complex. Another benefit I see is that the I can
just add more HBase regions as scale requires. However, I'm not sure
if I can use a coprocessor the way I want; Can I observe updates to a
particular table and replace the provided data with my own? (The
client calls "put" with the actual user ID, my co-processor replaces
it with a computed value, so the actual user ID never gets stored in
HBase).

3. Store a byte array that allows estimating unique viewers (with a
small margin of error*). Re-arrange my architecture so that each OLAP
cell is only updated by a single node. The main benefit from this
would be that I don't need to worry about atomic operations in HBase
since all updates for a single cell will be atomic and in serial. The
biggest downside is that I believe it will add significant complexity
to my overall architecture.


Thanks for your time, and I look forward to hearing your thoughts.

Sincerely,
Tom Brown

*(For information about the byte array mentioned in #2 and #3, see:
http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html)

Mime
View raw message