Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Subject: Re: Unbalanced tablets or extra rfiles
To: user@accumulo.apache.org
References: <57573696.7070603@ccri.com>
 <CAJRvFdqxByjYN8W0eRa2_Lcu5sk6ZJrXsR9UviLJMwU-1iKX7w@mail.gmail.com>
 <57573DE2.1030605@gmail.com>
From: Andrew Hulbert <ahulbert@ccri.com>
Message-ID: <57574112.5020805@ccri.com>
Date: Tue, 7 Jun 2016 17:48:02 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.8.0
MIME-Version: 1.0
In-Reply-To: <57573DE2.1030605@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
archived-at: Tue, 07 Jun 2016 21:48:07 -0000

Yeah it looks like in both cases there tablets that have ~del markers 
but are also referenced as entries for tablets. I assume there's no 
problem with both? Most are many many months old.

Many actually seem to have multiple file: assignments (multiple rows in 
metadata table) ...which shouldn't happen, right?

I also assume that the files in the directory don't particularly matter 
since they are assigned to other tablets in the metdata table.

Cool & thanks again. Fun to learn the internals.

-Andrew


On 06/07/2016 05:34 PM, Josh Elser wrote:
> re #1, you can try grep'ing over the Accumulo metadata table to see if 
> there are references to the file. It's possible that some files might 
> be kept around for table snapshots (but these should eventually be 
> compacted per Mike's point in #3, I believe).
>
> Mike Drob wrote:
>> 1) Is your Accumulo Garbage Collector process running? It will delete
>> un-referenced files.
>> 2) I've heard it said that 200 tablets per tserver is the sweet spot,
>> but it depends a lot on your read and write patterns.
>> 3)
>> https://accumulo.apache.org/1.7/accumulo_user_manual#_table_compaction_major_everything_idle 
>>
>>
>> On Tue, Jun 7, 2016 at 4:03 PM, Andrew Hulbert <ahulbert@ccri.com
>> <mailto:ahulbert@ccri.com>> wrote:
>>
>>     Hi all,
>>
>>     A few questions on behavior if you have any time...
>>
>>     1. When looking in accumulo's HDFS directories I'm seeing a
>>     situation where "tablets" aka "directories" for a table have more
>>     than the default 1G split threshold worth of rfiles in them. In one
>>     large instance, we have 400G worth of rfiles in the default_tablet
>>     directory (a mix of A, C, and F-type rfiles). We took one of these
>>     tables and compacted it and now there are appropriately ~1G worth of
>>     files in HDFS. On an unrelated table we have tablets with 100+G of
>>     bulk imported rfiles in the tablet's HDFS directory.
>>
>>     These seems to be common across multiple clouds. All the ingest is
>>     done via batch writing. Is anyone aware of why this would happen or
>>     if it is even important? Perhaps these are leftover rfiles from some
>>     process. Their timestamps cover large date ranges.
>>
>>     2. There's been some discussion on the number of files per tserver
>>     for efficiency. Are there any limits on the size of rfiles for
>>     efficiency? For instance, I assume that compacting all the files
>>     into a single rfile per 1G split is more efficient bc it avoids
>>     merging (but maybe decreases concurrency). However, would it be
>>     better to have 500 tablets per node on a table with 1G splits versus
>>     having 50 tablets with 10G splits. Assuming HDFS and Accumulo don't
>>     mind 10G files!
>>
>>     3. Is there any way to force idle tablets to actually major compact
>>     other than the shell? Seems like it never happens.
>>
>>     Thanks!
>>
>>     Andrew
>>
>>