accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Unbalanced tablets or extra rfiles
Date Tue, 07 Jun 2016 21:34:26 GMT
re #1, you can try grep'ing over the Accumulo metadata table to see if 
there are references to the file. It's possible that some files might be 
kept around for table snapshots (but these should eventually be 
compacted per Mike's point in #3, I believe).

Mike Drob wrote:
> 1) Is your Accumulo Garbage Collector process running? It will delete
> un-referenced files.
> 2) I've heard it said that 200 tablets per tserver is the sweet spot,
> but it depends a lot on your read and write patterns.
> 3)
> https://accumulo.apache.org/1.7/accumulo_user_manual#_table_compaction_major_everything_idle
>
> On Tue, Jun 7, 2016 at 4:03 PM, Andrew Hulbert <ahulbert@ccri.com
> <mailto:ahulbert@ccri.com>> wrote:
>
>     Hi all,
>
>     A few questions on behavior if you have any time...
>
>     1. When looking in accumulo's HDFS directories I'm seeing a
>     situation where "tablets" aka "directories" for a table have more
>     than the default 1G split threshold worth of rfiles in them. In one
>     large instance, we have 400G worth of rfiles in the default_tablet
>     directory (a mix of A, C, and F-type rfiles). We took one of these
>     tables and compacted it and now there are appropriately ~1G worth of
>     files in HDFS. On an unrelated table we have tablets with 100+G of
>     bulk imported rfiles in the tablet's HDFS directory.
>
>     These seems to be common across multiple clouds. All the ingest is
>     done via batch writing. Is anyone aware of why this would happen or
>     if it is even important? Perhaps these are leftover rfiles from some
>     process. Their timestamps cover large date ranges.
>
>     2. There's been some discussion on the number of files per tserver
>     for efficiency. Are there any limits on the size of rfiles for
>     efficiency? For instance, I assume that compacting all the files
>     into a single rfile per 1G split is more efficient bc it avoids
>     merging (but maybe decreases concurrency). However, would it be
>     better to have 500 tablets per node on a table with 1G splits versus
>     having 50 tablets with 10G splits. Assuming HDFS and Accumulo don't
>     mind 10G files!
>
>     3. Is there any way to force idle tablets to actually major compact
>     other than the shell? Seems like it never happens.
>
>     Thanks!
>
>     Andrew
>
>

Mime
View raw message