accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: RowID format tradeoffs
Date Fri, 11 Apr 2014 19:53:08 GMT
Russ-

Yeah, there's probably some relationship between numShards and
numTServers. Something like numShards >= n * numTServers, where n >=
1. You may way to experiment with different values of n.

As David alludes to, there's probably some penalty if n is too high.
I'd suggest some experimentation for your circumstance.

David-

The relationship is: numShards (per hash(row)) <= numTablets
You can anticipate increased nodes with higher initial n or by
changing some prefix to your ShardIDs for new data. Your query would
have to know about the old shard format and the new shard format, to
include all the relevant ranges for batch scanning / map-reducing.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Fri, Apr 11, 2014 at 1:02 PM, David Medinets
<david.medinets@gmail.com> wrote:
> What is the penalty if you set number of shards to be 999 when you only have
> 20 nodes?
> How is the shard number related to the number of tablets?
> What would happen if you doubled the number of nodes?
>
>
> On Fri, Apr 11, 2014 at 12:56 PM, Russ Weeks <rweeks@newbrightidea.com>
> wrote:
>>
>> Hi, Chris,
>>
>> Thanks for your response, sorry it's taken me so long to process it.
>>
>> I guess there needs to be some sort of relationship between the number of
>> shards and the number of tablet servers, right? Would you typically set
>> numShards to be the greatest # of tablet servers you'd anticipate needing,
>> and then maintain a mapping on the client side to say, "OK, right now my
>> optimal range set is
>> ([shard0_mindate...shard15_maxdate],[shard16_mindate...shard32_maxdate]...)"
>> do you see what I'm getting at? Because it seems like the alternative is to
>> re-encode all your row IDs as the number of shards changes, and I'd *really*
>> like my row ids to be immutable.
>>
>> Right now we're using a distributed service very similar to Snowflake[1]
>> for row ID generation. It's not exactly monotonically increasing, but nearly
>> so.
>>
>> -Russ
>>
>> [1]: https://github.com/twitter/snowflake/
>>
>>
>> On Sun, Apr 6, 2014 at 7:48 AM, Christopher <ctubbsii@apache.org> wrote:
>>>
>>> You could try sharding:
>>>
>>> If your RowID is ingest date (to achieve ability to scan over recently
>>> ingested data, as you describe), you could use RowID of
>>> "ShardID_IngestDate" instead, where:
>>>
>>> ShardID = hash(row) % numShards
>>>
>>> This will result in numShards number of rows for each IngestDate, and
>>> is chosen by you to be a value appropriate to your cluster. You can
>>> pre-split your cluster, for each ShardID, for better ingest and query.
>>>
>>> As for AccumuloInputFormat, it uses a regular scanner internally, but
>>> it supports multiple ranges, just like the BatchScanner, creating
>>> separate mappers for each one. All you need to do is query numShards
>>> number of ranges.
>>>
>>> Note: It sounds like you're currently using a 1-up increasing value
>>> for the current RowID. You may want to consider using IngestDate as I
>>> described above (to whatever degree of precision you need, as in
>>> YYYYMMddHHmmss...) or similar. This allows you to avoid maintaining a
>>> counter synchronized across your ingest, and could help you scale your
>>> ingest by parallelizing it with fewer concurrency issues. It also
>>> gives you the ability to analyze your ingest performance over time. It
>>> also allows you to do queries like "data processed in the last day".
>>> However, you'll lose the ability to do "last 100 rows processed". The
>>> sharding approach would work with either though.
>>>
>>> --
>>> Christopher L Tubbs II
>>> http://gravatar.com/ctubbsii
>>>
>>>
>>> On Sun, Apr 6, 2014 at 3:16 AM, Russ Weeks <rweeks@newbrightidea.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > I'm looking for advice re. the best way to structure my row IDs.
>>> > Monotonically increasing IDs have the very appealing property that I
>>> > can
>>> > quickly scan all recently-ingested unprocessed rows, particularly
>>> > because I
>>> > maintain a "checkpoint" of the most-recently processed row.
>>> >
>>> > Of course, the problem with increasing IDs is that it's the
>>> > lowest-order
>>> > bits which are changing, which (I think?) means it's less optimal for
>>> > distributing data across my cluster. I guess that the ways to get
>>> > around
>>> > this are to either reverse the ID or to define partitions, and use the
>>> > partition ID as the high-order bits of the row id? Reversing the ID
>>> > will
>>> > destroy the property I describe above; I guess that using partitions
>>> > may
>>> > preserve it as long as I use a BatchScanner, but would a BatchScanner
>>> > play
>>> > nicely with AccumuloInputFormat? So many questions.
>>> >
>>> > Anyways, I think there's a pretty good chance that I'm missing
>>> > something
>>> > obvious in this analysis. For instance, if it's easy to "rebalance" the
>>> > data
>>> > across my tablet servers periodically, then I'd probably just stick
>>> > with
>>> > increasing IDs.
>>> >
>>> > Very interested to hear your advice, or the pros and cons of any of
>>> > these
>>> > approaches.
>>> >
>>> > Thanks,
>>> > -Russ
>>
>>
>

Mime
View raw message