accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Hulbert <ahulb...@ccri.com>
Subject Unbalanced tablets or extra rfiles
Date Tue, 07 Jun 2016 21:03:18 GMT
Hi all,

A few questions on behavior if you have any time...

1. When looking in accumulo's HDFS directories I'm seeing a situation 
where "tablets" aka "directories" for a table have more than the default 
1G split threshold worth of rfiles in them. In one large instance, we 
have 400G worth of rfiles in the default_tablet directory (a mix of A, 
C, and F-type rfiles). We took one of these tables and compacted it and 
now there are appropriately ~1G worth of files in HDFS. On an unrelated 
table we have tablets with 100+G of bulk imported rfiles in the tablet's 
HDFS directory.

These seems to be common across multiple clouds. All the ingest is done 
via batch writing. Is anyone aware of why this would happen or if it is 
even important? Perhaps these are leftover rfiles from some process. 
Their timestamps cover large date ranges.

2. There's been some discussion on the number of files per tserver for 
efficiency. Are there any limits on the size of rfiles for efficiency? 
For instance, I assume that compacting all the files into a single rfile 
per 1G split is more efficient bc it avoids merging (but maybe decreases 
concurrency). However, would it be better to have 500 tablets per node 
on a table with 1G splits versus having 50 tablets with 10G splits. 
Assuming HDFS and Accumulo don't mind 10G files!

3. Is there any way to force idle tablets to actually major compact 
other than the shell? Seems like it never happens.

Thanks!

Andrew

Mime
View raw message