accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: Unbalanced tablets or extra rfiles
Date Tue, 07 Jun 2016 21:58:11 GMT
On Tue, Jun 7, 2016 at 5:48 PM, Andrew Hulbert <ahulbert@ccri.com> wrote:

> Yeah it looks like in both cases there tablets that have ~del markers but
> are also referenced as entries for tablets. I assume there's no problem
> with both? Most are many many months old.
>
> Many actually seem to have multiple file: assignments (multiple rows in
> metadata table) ...which shouldn't happen, right?
>

Its ok for multiple tablets(rows in metadata table) to reference the same
file.  When a tablet splits, both children may reference some of the
parents files.  When a file is bulk imported, it may go to multiple tablets.


>
> I also assume that the files in the directory don't particularly matter
> since they are assigned to other tablets in the metdata table.
>
> Cool & thanks again. Fun to learn the internals.
>
> -Andrew
>
>
>
> On 06/07/2016 05:34 PM, Josh Elser wrote:
>
>> re #1, you can try grep'ing over the Accumulo metadata table to see if
>> there are references to the file. It's possible that some files might be
>> kept around for table snapshots (but these should eventually be compacted
>> per Mike's point in #3, I believe).
>>
>> Mike Drob wrote:
>>
>>> 1) Is your Accumulo Garbage Collector process running? It will delete
>>> un-referenced files.
>>> 2) I've heard it said that 200 tablets per tserver is the sweet spot,
>>> but it depends a lot on your read and write patterns.
>>> 3)
>>>
>>> https://accumulo.apache.org/1.7/accumulo_user_manual#_table_compaction_major_everything_idle
>>>
>>> On Tue, Jun 7, 2016 at 4:03 PM, Andrew Hulbert <ahulbert@ccri.com
>>> <mailto:ahulbert@ccri.com>> wrote:
>>>
>>>     Hi all,
>>>
>>>     A few questions on behavior if you have any time...
>>>
>>>     1. When looking in accumulo's HDFS directories I'm seeing a
>>>     situation where "tablets" aka "directories" for a table have more
>>>     than the default 1G split threshold worth of rfiles in them. In one
>>>     large instance, we have 400G worth of rfiles in the default_tablet
>>>     directory (a mix of A, C, and F-type rfiles). We took one of these
>>>     tables and compacted it and now there are appropriately ~1G worth of
>>>     files in HDFS. On an unrelated table we have tablets with 100+G of
>>>     bulk imported rfiles in the tablet's HDFS directory.
>>>
>>>     These seems to be common across multiple clouds. All the ingest is
>>>     done via batch writing. Is anyone aware of why this would happen or
>>>     if it is even important? Perhaps these are leftover rfiles from some
>>>     process. Their timestamps cover large date ranges.
>>>
>>>     2. There's been some discussion on the number of files per tserver
>>>     for efficiency. Are there any limits on the size of rfiles for
>>>     efficiency? For instance, I assume that compacting all the files
>>>     into a single rfile per 1G split is more efficient bc it avoids
>>>     merging (but maybe decreases concurrency). However, would it be
>>>     better to have 500 tablets per node on a table with 1G splits versus
>>>     having 50 tablets with 10G splits. Assuming HDFS and Accumulo don't
>>>     mind 10G files!
>>>
>>>     3. Is there any way to force idle tablets to actually major compact
>>>     other than the shell? Seems like it never happens.
>>>
>>>     Thanks!
>>>
>>>     Andrew
>>>
>>>
>>>
>

Mime
View raw message