hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kireet Reddy <kir...@feedly.com>
Subject talk list table
Date Mon, 15 Apr 2013 13:09:10 GMT
I are planning to create a "scheduled task list" table in our hbase cluster. Essentially we
will define a table with key timestamp and then the row contents will be all the tasks that
need to be processed within that second (or whatever time period). I am trying to do the "reasonably
wide rows" design mentioned in the hbasecon opentsdb talk. A couple of questions:

1. Should we use append or put to create tasks? Since these rows will not live forever, storage
space in not a concern, read/write performance is more important. As concurrency increases
I would guess the row lock may become an issue in append? Can appends be batched by the client
or do they execute immediately?

2. I am a little worried about hotspots. This basic design may cause issues in terms of the
table's performance. Many tasks will execute and reschedule themselves using the same interval,
t + 1 hour for example. So many the writes may all go to the same block.  Also, we have a
lot of other data so I am worried it may impact performance of unrelated data if the region
server gets too busy servicing the task list table. I can think of 2 strategies to avoid this.
One would be to create N different tables and read/write tasks to them randomly. This may
spread load across servers, but there is no guarantee hbase will place the tables on different
region servers, correct? The other would be to prefix the timestamp row key with a random
leading byte. Then when reading from the task list table, consumers could scan from any/all
possible values of the random byte + current timestamp to obtain tasks. Both strategies seem
like they could spread out load, but at the cost of more work/complexity to read tasks from
the table. Do either of those approaches make sense? 

On the read side, it seems like a similar problem exists in that all consumers will be reading
rows based on the current timestamp. Is this good because the block will very likely be cached
or bad because the region server may become overloaded? I have a feeling the answer is going
to be "it depends". :)

I did see the previous posts on queues and the tips there - use zookeeper for coordination,
schedule major compactions, etc. Sorry if these questions are basic, I am pretty new to hbase.
View raw message