accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <>
Subject Re: RowID format tradeoffs
Date Sun, 06 Apr 2014 14:48:19 GMT
You could try sharding:

If your RowID is ingest date (to achieve ability to scan over recently
ingested data, as you describe), you could use RowID of
"ShardID_IngestDate" instead, where:

ShardID = hash(row) % numShards

This will result in numShards number of rows for each IngestDate, and
is chosen by you to be a value appropriate to your cluster. You can
pre-split your cluster, for each ShardID, for better ingest and query.

As for AccumuloInputFormat, it uses a regular scanner internally, but
it supports multiple ranges, just like the BatchScanner, creating
separate mappers for each one. All you need to do is query numShards
number of ranges.

Note: It sounds like you're currently using a 1-up increasing value
for the current RowID. You may want to consider using IngestDate as I
described above (to whatever degree of precision you need, as in
YYYYMMddHHmmss...) or similar. This allows you to avoid maintaining a
counter synchronized across your ingest, and could help you scale your
ingest by parallelizing it with fewer concurrency issues. It also
gives you the ability to analyze your ingest performance over time. It
also allows you to do queries like "data processed in the last day".
However, you'll lose the ability to do "last 100 rows processed". The
sharding approach would work with either though.

Christopher L Tubbs II

On Sun, Apr 6, 2014 at 3:16 AM, Russ Weeks <> wrote:
> Hi,
> I'm looking for advice re. the best way to structure my row IDs.
> Monotonically increasing IDs have the very appealing property that I can
> quickly scan all recently-ingested unprocessed rows, particularly because I
> maintain a "checkpoint" of the most-recently processed row.
> Of course, the problem with increasing IDs is that it's the lowest-order
> bits which are changing, which (I think?) means it's less optimal for
> distributing data across my cluster. I guess that the ways to get around
> this are to either reverse the ID or to define partitions, and use the
> partition ID as the high-order bits of the row id? Reversing the ID will
> destroy the property I describe above; I guess that using partitions may
> preserve it as long as I use a BatchScanner, but would a BatchScanner play
> nicely with AccumuloInputFormat? So many questions.
> Anyways, I think there's a pretty good chance that I'm missing something
> obvious in this analysis. For instance, if it's easy to "rebalance" the data
> across my tablet servers periodically, then I'd probably just stick with
> increasing IDs.
> Very interested to hear your advice, or the pros and cons of any of these
> approaches.
> Thanks,
> -Russ

View raw message