accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Hulbert <>
Subject Re: Unbalanced tablets or extra rfiles
Date Tue, 07 Jun 2016 21:48:02 GMT
Yeah it looks like in both cases there tablets that have ~del markers 
but are also referenced as entries for tablets. I assume there's no 
problem with both? Most are many many months old.

Many actually seem to have multiple file: assignments (multiple rows in 
metadata table) ...which shouldn't happen, right?

I also assume that the files in the directory don't particularly matter 
since they are assigned to other tablets in the metdata table.

Cool & thanks again. Fun to learn the internals.


On 06/07/2016 05:34 PM, Josh Elser wrote:
> re #1, you can try grep'ing over the Accumulo metadata table to see if 
> there are references to the file. It's possible that some files might 
> be kept around for table snapshots (but these should eventually be 
> compacted per Mike's point in #3, I believe).
> Mike Drob wrote:
>> 1) Is your Accumulo Garbage Collector process running? It will delete
>> un-referenced files.
>> 2) I've heard it said that 200 tablets per tserver is the sweet spot,
>> but it depends a lot on your read and write patterns.
>> 3)

>> On Tue, Jun 7, 2016 at 4:03 PM, Andrew Hulbert <
>> <>> wrote:
>>     Hi all,
>>     A few questions on behavior if you have any time...
>>     1. When looking in accumulo's HDFS directories I'm seeing a
>>     situation where "tablets" aka "directories" for a table have more
>>     than the default 1G split threshold worth of rfiles in them. In one
>>     large instance, we have 400G worth of rfiles in the default_tablet
>>     directory (a mix of A, C, and F-type rfiles). We took one of these
>>     tables and compacted it and now there are appropriately ~1G worth of
>>     files in HDFS. On an unrelated table we have tablets with 100+G of
>>     bulk imported rfiles in the tablet's HDFS directory.
>>     These seems to be common across multiple clouds. All the ingest is
>>     done via batch writing. Is anyone aware of why this would happen or
>>     if it is even important? Perhaps these are leftover rfiles from some
>>     process. Their timestamps cover large date ranges.
>>     2. There's been some discussion on the number of files per tserver
>>     for efficiency. Are there any limits on the size of rfiles for
>>     efficiency? For instance, I assume that compacting all the files
>>     into a single rfile per 1G split is more efficient bc it avoids
>>     merging (but maybe decreases concurrency). However, would it be
>>     better to have 500 tablets per node on a table with 1G splits versus
>>     having 50 tablets with 10G splits. Assuming HDFS and Accumulo don't
>>     mind 10G files!
>>     3. Is there any way to force idle tablets to actually major compact
>>     other than the shell? Seems like it never happens.
>>     Thanks!
>>     Andrew

View raw message