accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Unbalanced tablets or extra rfiles
Date Tue, 07 Jun 2016 22:15:52 GMT


Keith Turner wrote:
>
>
> On Tue, Jun 7, 2016 at 5:48 PM, Andrew Hulbert <ahulbert@ccri.com
> <mailto:ahulbert@ccri.com>> wrote:
>
>     Yeah it looks like in both cases there tablets that have ~del
>     markers but are also referenced as entries for tablets. I assume
>     there's no problem with both? Most are many many months old.

Yeah, nothing inherently wrong with it. It's easier to create the ~del 
entry when we know one tablet is done with it. The GC still checks the 
tablet row-space to make sure no tablets still have a reference (to 
Keith's point about how multiple tablets can refer to the same file).

>     Many actually seem to have multiple file: assignments (multiple rows
>     in metadata table) ...which shouldn't happen, right?
>
>
> Its ok for multiple tablets(rows in metadata table) to reference the
> same file.  When a tablet splits, both children may reference some of
> the parents files.  When a file is bulk imported, it may go to multiple
> tablets.
>
>
>     I also assume that the files in the directory don't particularly
>     matter since they are assigned to other tablets in the metdata table.
>
>     Cool & thanks again. Fun to learn the internals.
>
>     -Andrew
>
>
>
>     On 06/07/2016 05:34 PM, Josh Elser wrote:
>
>         re #1, you can try grep'ing over the Accumulo metadata table to
>         see if there are references to the file. It's possible that some
>         files might be kept around for table snapshots (but these should
>         eventually be compacted per Mike's point in #3, I believe).
>
>         Mike Drob wrote:
>
>             1) Is your Accumulo Garbage Collector process running? It
>             will delete
>             un-referenced files.
>             2) I've heard it said that 200 tablets per tserver is the
>             sweet spot,
>             but it depends a lot on your read and write patterns.
>             3)
>             https://accumulo.apache.org/1.7/accumulo_user_manual#_table_compaction_major_everything_idle
>
>
>             On Tue, Jun 7, 2016 at 4:03 PM, Andrew Hulbert
>             <ahulbert@ccri.com <mailto:ahulbert@ccri.com>
>             <mailto:ahulbert@ccri.com <mailto:ahulbert@ccri.com>>> wrote:
>
>                  Hi all,
>
>                  A few questions on behavior if you have any time...
>
>                  1. When looking in accumulo's HDFS directories I'm seeing a
>                  situation where "tablets" aka "directories" for a table
>             have more
>                  than the default 1G split threshold worth of rfiles in
>             them. In one
>                  large instance, we have 400G worth of rfiles in the
>             default_tablet
>                  directory (a mix of A, C, and F-type rfiles). We took
>             one of these
>                  tables and compacted it and now there are appropriately
>             ~1G worth of
>                  files in HDFS. On an unrelated table we have tablets
>             with 100+G of
>                  bulk imported rfiles in the tablet's HDFS directory.
>
>                  These seems to be common across multiple clouds. All
>             the ingest is
>                  done via batch writing. Is anyone aware of why this
>             would happen or
>                  if it is even important? Perhaps these are leftover
>             rfiles from some
>                  process. Their timestamps cover large date ranges.
>
>                  2. There's been some discussion on the number of files
>             per tserver
>                  for efficiency. Are there any limits on the size of
>             rfiles for
>                  efficiency? For instance, I assume that compacting all
>             the files
>                  into a single rfile per 1G split is more efficient bc
>             it avoids
>                  merging (but maybe decreases concurrency). However,
>             would it be
>                  better to have 500 tablets per node on a table with 1G
>             splits versus
>                  having 50 tablets with 10G splits. Assuming HDFS and
>             Accumulo don't
>                  mind 10G files!
>
>                  3. Is there any way to force idle tablets to actually
>             major compact
>                  other than the shell? Seems like it never happens.
>
>                  Thanks!
>
>                  Andrew
>
>
>
>

Mime
View raw message