accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Valentin <ar...@arielvalentin.com>
Subject Re: RowID format tradeoffs
Date Sun, 06 Apr 2014 11:58:24 GMT
Russ,

I experienced the same problem. In the end what we decided to do was to take another property
and use it as a prefix and then presplit the tables
E.g. apples\0454316778
We still have situations where nodes run hot during peak usage but we are able to live with
it

Thanks,
Ariel
---
Sent from my mobile device. Please excuse any errors.

> On Apr 6, 2014, at 3:16 AM, Russ Weeks <rweeks@newbrightidea.com> wrote:
> 
> Hi,
> 
> I'm looking for advice re. the best way to structure my row IDs. Monotonically increasing
IDs have the very appealing property that I can quickly scan all recently-ingested unprocessed
rows, particularly because I maintain a "checkpoint" of the most-recently processed row.
> 
> Of course, the problem with increasing IDs is that it's the lowest-order bits which are
changing, which (I think?) means it's less optimal for distributing data across my cluster.
I guess that the ways to get around this are to either reverse the ID or to define partitions,
and use the partition ID as the high-order bits of the row id? Reversing the ID will destroy
the property I describe above; I guess that using partitions may preserve it as long as I
use a BatchScanner, but would a BatchScanner play nicely with AccumuloInputFormat? So many
questions.
> 
> Anyways, I think there's a pretty good chance that I'm missing something obvious in this
analysis. For instance, if it's easy to "rebalance" the data across my tablet servers periodically,
then I'd probably just stick with increasing IDs.
> 
> Very interested to hear your advice, or the pros and cons of any of these approaches.
> 
> Thanks,
> -Russ

Mime
View raw message