accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <>
Subject Re: RowID format tradeoffs
Date Fri, 11 Apr 2014 16:56:34 GMT
Hi, Chris,

Thanks for your response, sorry it's taken me so long to process it.

I guess there needs to be some sort of relationship between the number of
shards and the number of tablet servers, right? Would you typically set
numShards to be the greatest # of tablet servers you'd anticipate needing,
and then maintain a mapping on the client side to say, "OK, right now my
optimal range set is
do you see what I'm getting at? Because it seems like the alternative is to
re-encode all your row IDs as the number of shards changes, and I'd
*really* like my row ids to be immutable.

Right now we're using a distributed service very similar to Snowflake[1]
for row ID generation. It's not exactly monotonically increasing, but
nearly so.



On Sun, Apr 6, 2014 at 7:48 AM, Christopher <> wrote:

> You could try sharding:
> If your RowID is ingest date (to achieve ability to scan over recently
> ingested data, as you describe), you could use RowID of
> "ShardID_IngestDate" instead, where:
> ShardID = hash(row) % numShards
> This will result in numShards number of rows for each IngestDate, and
> is chosen by you to be a value appropriate to your cluster. You can
> pre-split your cluster, for each ShardID, for better ingest and query.
> As for AccumuloInputFormat, it uses a regular scanner internally, but
> it supports multiple ranges, just like the BatchScanner, creating
> separate mappers for each one. All you need to do is query numShards
> number of ranges.
> Note: It sounds like you're currently using a 1-up increasing value
> for the current RowID. You may want to consider using IngestDate as I
> described above (to whatever degree of precision you need, as in
> YYYYMMddHHmmss...) or similar. This allows you to avoid maintaining a
> counter synchronized across your ingest, and could help you scale your
> ingest by parallelizing it with fewer concurrency issues. It also
> gives you the ability to analyze your ingest performance over time. It
> also allows you to do queries like "data processed in the last day".
> However, you'll lose the ability to do "last 100 rows processed". The
> sharding approach would work with either though.
> --
> Christopher L Tubbs II
> On Sun, Apr 6, 2014 at 3:16 AM, Russ Weeks <>
> wrote:
> > Hi,
> >
> > I'm looking for advice re. the best way to structure my row IDs.
> > Monotonically increasing IDs have the very appealing property that I can
> > quickly scan all recently-ingested unprocessed rows, particularly
> because I
> > maintain a "checkpoint" of the most-recently processed row.
> >
> > Of course, the problem with increasing IDs is that it's the lowest-order
> > bits which are changing, which (I think?) means it's less optimal for
> > distributing data across my cluster. I guess that the ways to get around
> > this are to either reverse the ID or to define partitions, and use the
> > partition ID as the high-order bits of the row id? Reversing the ID will
> > destroy the property I describe above; I guess that using partitions may
> > preserve it as long as I use a BatchScanner, but would a BatchScanner
> play
> > nicely with AccumuloInputFormat? So many questions.
> >
> > Anyways, I think there's a pretty good chance that I'm missing something
> > obvious in this analysis. For instance, if it's easy to "rebalance" the
> data
> > across my tablet servers periodically, then I'd probably just stick with
> > increasing IDs.
> >
> > Very interested to hear your advice, or the pros and cons of any of these
> > approaches.
> >
> > Thanks,
> > -Russ

View raw message