accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: comparing different rfile "densities"
Date Tue, 11 Nov 2014 16:49:30 GMT
You could also affirm your thoughts about RFile usage by generating a 
histogram over the metadata table for your table.

The table ID is the common prefix in the metadata table. Each unique row 
is a tablet, and will contain zero to many keys with a column family of 
'file'. The column qualifier is a URI to the file (can you make a Path 
object from it). The value is a CSV where the first element is the 
approximate size of the file (bytes) and the second element is the 
number of key-value pairs in the file.

This might help you make a more quantitative analysis over the 
distribution of files as you tweak things.

To reaffirm what Mike said, trying to get close to 1:1 files to tablets 
is definitely ideal but can be difficult to manage when considering all 
of potential knobs you can turn (for both ingest and query characteristics).

Mike Drob wrote:
> I'm not sure how to quantify this and give you a way to verify, but in
> my experience you want to be producing rflies that load into a single
> tablet. Typically, this means number of reducers equal to the number of
> tablets in the table that you will be importing and perhaps a custom
> partitioner. I think your intuition is spot on, here.
> Of course, if that means that you have a bunch of tiny files, then maybe
> it's time to rethink your split strategy.
> On Tue, Nov 11, 2014 at 5:56 AM, Jeff Turner <
> <>> wrote:
>     is there a good way to compare the overall system effect of
>     bulk loading different sets of rfiles that have the same data,
>     but very different "densities"?
>     i've been working on a way to re-feed a lot of data in to a table,
>     and have started to believe that our default scheme for creating
>     rfiles - mapred in to ~100-200 splits, sampled from 50k tablets -
>     is actually pretty bad.  subjectively, it feels like rfiles that "span"
>     300 or 400 tablets is bad in at least two ways for the tservers -
>     until the files are compacted, all of the "potential" tservers have
>     to check the file, right?  and then, during compaction, do portions
>     of that rfile get volleyed around the cloud until all tservers
>     have grabbed their portion?  (so, there's network overhead, repeatedly
>     reading files and skipping most of the data, ...)
>     if my new idea works, i will have a lot more control over the density
>     of rfiles, and most of them will span just one or two tablets.
>     so, is there a way to measure/simulate overall system benefit or cost
>     of different approaches to building bulk-load data (destined for an
>     established table, across N tservers, ...)?
>     i guess that a related question would be "are 1000 smaller and denser
>     bulk files better than 100 larger bulk files produced under a typical
>     getSplits() scheme?"
>     thanks,
>     jeff

View raw message