accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <>
Subject Re: Unbalanced tablets or extra rfiles
Date Tue, 07 Jun 2016 21:58:11 GMT
On Tue, Jun 7, 2016 at 5:48 PM, Andrew Hulbert <> wrote:

> Yeah it looks like in both cases there tablets that have ~del markers but
> are also referenced as entries for tablets. I assume there's no problem
> with both? Most are many many months old.
> Many actually seem to have multiple file: assignments (multiple rows in
> metadata table) ...which shouldn't happen, right?

Its ok for multiple tablets(rows in metadata table) to reference the same
file.  When a tablet splits, both children may reference some of the
parents files.  When a file is bulk imported, it may go to multiple tablets.

> I also assume that the files in the directory don't particularly matter
> since they are assigned to other tablets in the metdata table.
> Cool & thanks again. Fun to learn the internals.
> -Andrew
> On 06/07/2016 05:34 PM, Josh Elser wrote:
>> re #1, you can try grep'ing over the Accumulo metadata table to see if
>> there are references to the file. It's possible that some files might be
>> kept around for table snapshots (but these should eventually be compacted
>> per Mike's point in #3, I believe).
>> Mike Drob wrote:
>>> 1) Is your Accumulo Garbage Collector process running? It will delete
>>> un-referenced files.
>>> 2) I've heard it said that 200 tablets per tserver is the sweet spot,
>>> but it depends a lot on your read and write patterns.
>>> 3)
>>> On Tue, Jun 7, 2016 at 4:03 PM, Andrew Hulbert <
>>> <>> wrote:
>>>     Hi all,
>>>     A few questions on behavior if you have any time...
>>>     1. When looking in accumulo's HDFS directories I'm seeing a
>>>     situation where "tablets" aka "directories" for a table have more
>>>     than the default 1G split threshold worth of rfiles in them. In one
>>>     large instance, we have 400G worth of rfiles in the default_tablet
>>>     directory (a mix of A, C, and F-type rfiles). We took one of these
>>>     tables and compacted it and now there are appropriately ~1G worth of
>>>     files in HDFS. On an unrelated table we have tablets with 100+G of
>>>     bulk imported rfiles in the tablet's HDFS directory.
>>>     These seems to be common across multiple clouds. All the ingest is
>>>     done via batch writing. Is anyone aware of why this would happen or
>>>     if it is even important? Perhaps these are leftover rfiles from some
>>>     process. Their timestamps cover large date ranges.
>>>     2. There's been some discussion on the number of files per tserver
>>>     for efficiency. Are there any limits on the size of rfiles for
>>>     efficiency? For instance, I assume that compacting all the files
>>>     into a single rfile per 1G split is more efficient bc it avoids
>>>     merging (but maybe decreases concurrency). However, would it be
>>>     better to have 500 tablets per node on a table with 1G splits versus
>>>     having 50 tablets with 10G splits. Assuming HDFS and Accumulo don't
>>>     mind 10G files!
>>>     3. Is there any way to force idle tablets to actually major compact
>>>     other than the shell? Seems like it never happens.
>>>     Thanks!
>>>     Andrew

View raw message