accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <>
Subject RowID format tradeoffs
Date Sun, 06 Apr 2014 07:16:21 GMT

I'm looking for advice re. the best way to structure my row IDs.
Monotonically increasing IDs have the very appealing property that I can
quickly scan all recently-ingested unprocessed rows, particularly because I
maintain a "checkpoint" of the most-recently processed row.

Of course, the problem with increasing IDs is that it's the lowest-order
bits which are changing, which (I think?) means it's less optimal for
distributing data across my cluster. I guess that the ways to get around
this are to either reverse the ID or to define partitions, and use the
partition ID as the high-order bits of the row id? Reversing the ID will
destroy the property I describe above; I guess that using partitions may
preserve it as long as I use a BatchScanner, but would a BatchScanner play
nicely with AccumuloInputFormat? So many questions.

Anyways, I think there's a pretty good chance that I'm missing something
obvious in this analysis. For instance, if it's easy to "rebalance" the
data across my tablet servers periodically, then I'd probably just stick
with increasing IDs.

Very interested to hear your advice, or the pros and cons of any of these


View raw message