accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry P." <>
Subject Re: How to pre-split a table for UUID rowkeys
Date Mon, 05 Aug 2013 15:17:25 GMT
Thanks Eric. I came into work today after kicking off a 100 million test
data load and was pleasantly surprised to find the following distribution:

server1: 33.6 million docs,
server2: 32.8 million docs
server3: 33.6 million docs

So it looks like my 5 million record load just didn't get big enough to
need to split (and now I recall that my 5M load was using 500 byte records
rather than 1000 byte as I was later informed is closer to reality).

With a replication factor of 3 and 3 nodes, total consumed space is 248GB,
so compression looks to be about 18% for this random data.  Real data will
compress better I'm sure, as this test data is just the RandomBatchWriter
tweaked to use a UUID as the RowKey to better match our app instead of the
monotonically increasing number.  Hopefully later today the full test suite
will be ready so I can ingest real data.

Thanks for the addsplits syntax example -- I like that idea more than
working with a splits file, as it's easier to script and one less
dependency if you will.  I'll presplit with that and re-test and see if the
distribution occurs sooner than I was seeing last week.

Thanks again Eric, the info you and the other folks on this list give out
every week is invaluable.

On Fri, Aug 2, 2013 at 5:35 PM, Eric Newton <> wrote:

> Apparently 5M 1K documents isn't enough to split the tablet.  I'm guessing
> that your documents are compressing well, or you are able to fit them all
> in memory.  You could try flushing the table and see if it splits.
> shell > flush -t table -w
> Or, you could just add splits if you know the UUIDs are uniformly
> distributed:
> shell > addsplits -t table 1 2 3 4 5 6 7 8 9 a b c d e f
> Or, if you just want accumulo to split at a certain size under the 1G
> default:
> shell > config -t table -s table.split.threshold=10M
> -Eric
> On Fri, Aug 2, 2013 at 5:41 PM, Terry P. <> wrote:
>> Greetings folks,
>> Have a bit of a non-typical Accumulo use case using Accumulo as a backend
>> data store for a search index to provide fault tolerance should the index
>> get corrupted.  Max docs stored in Accumulo will be under 1 billion at full
>> volume.
>> The search index is used to "find" the data a user is interested in, and
>> the search index then retrieves the document from Accumulo using its RowKey
>> which was gotten from the search index.  The RowKey is a java.util.UUID
>> string that has had the '-' dashes stripped out.
>> I have a 3 node cluster and as a quick test have ingested 5 million 1K
>> documents into it, yet they all went to a single TabletServer.  I was kind
>> of surprised -- I knew this would be the case for a row key using a
>> monotonically increasing number, but I thought with a UUID type rowkey the
>> entries would have been spread across the TabletServers at least some, even
>> without pre-splitting the table.
>> Clearly my understanding of how Accumulo spreads the data out is lacking.
>>  Can anyone shed more light on it?  And possibly recommend a table split
>> strategy for a 3-node cluster such as I have described?
>> Many thanks in advance,
>> Terry

View raw message