accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Turner <>
Subject Re: comparing different rfile "densities"
Date Thu, 13 Nov 2014 11:36:32 GMT
i had a conversation about this yesterday that confused me.
if this is all documented somewhere, or is described in ppt, let me know.

how much does the goal of "approximately one rfile per tserver" help
with building "good" (i.e., dense) rfiles for bulk-load?

i had not heard of the the notion of one file per tserver (as an 
efficiency trick).
i assumed that, since tserver/task-tracker are usually 1:1, the goal was 
really to
establish splits that would use a reasonable number of reducers. which is a
compromise from rfile-per-tablet, when tables have very large numbers of 

so, two questions:

- do tservers generally get a contiguous range of tablets?
   - if an rfile spans five consecutive tablets, what is the likelihood 
that they are all on one tserver?
   - if tablets are less systematically allocated, then the rfile would 
cover multiple tservers

- even if we can generate rfiles that "fit in to one tserver", isn't 
there still an issue that
    the new rfile may cover 100's of the tablets owned by a tserver?  so 
any scan of
    any of those tablets will have to peek in the new file (until 

i think i'm getting close to, or already passed by, the point of 
diminishing returns
as far as optimizing.  now i'm just curious.

On 11/11/14 2:57 PM, Adam Fuchs wrote:
> Jeff,
> "Density" is an interesting measure here, because RFiles are going to
> be sorted such that, even when the file is split between tablets, a
> read of the file is going to be (mostly) a sequential scan. I think
> instead you might want to look at a few other metrics: network
> overhead, name node operation rates, and number of files per tablet.
> The network overhead is going to be more an issue of locality than
> density, so you'd have to do more than just have separate files per
> tablet to optimize that. You'll need some way of specifying or hinting
> at where the files should be generated. As an aside, we generally want
> to shoot for "probabilistic locality" so that the aggregate traffic
> over top level switches is much smaller than the total data processed
> (smaller enough so that it isn't a bottleneck). This is generally a
> little easier than guaranteeing that files are always read from a
> local drive. You might be able to measure this by monitoring your
> network usage, assuming those sensors are available to you. Accumulo
> also prints out information on entries/s and bytes/s for compactions
> in the debug logs.
> Impact on the namenode is more or less proportional to the number of
> blocks+files that you generate. As long as you're not generating a
> large number of files that are smaller than your block size (64MB?
> 128MB?) you're probably going to be close to optimal here. I'm not
> sure at what point the number of files+blocks becomes a bottleneck,
> but I've seen it happen when generating a very large number of tiny
> files. This is something that may cause you problems if you generate
> 50K files per ingest cycle rather than 500 or 5K. Measure this by
> looking at the size of files that are being ingested.
> Number of files per tablet has a big effect on performance -- much
> more so than number of tablets per file. Query latency and aggregate
> scan performance are directly proportional to the number of files per
> tablet. Generating one file per tablet or one file per group of
> tablets doesn't really change this metric. You can measure this by
> scanning the metadata table as Josh suggested.
> I'm very interested in this subject, so please let us know what you find.
> Cheers,
> Adam
> On Tue, Nov 11, 2014 at 6:56 AM, Jeff Turner <> wrote:
>> is there a good way to compare the overall system effect of
>> bulk loading different sets of rfiles that have the same data,
>> but very different "densities"?
>> i've been working on a way to re-feed a lot of data in to a table,
>> and have started to believe that our default scheme for creating
>> rfiles - mapred in to ~100-200 splits, sampled from 50k tablets -
>> is actually pretty bad.  subjectively, it feels like rfiles that "span"
>> 300 or 400 tablets is bad in at least two ways for the tservers -
>> until the files are compacted, all of the "potential" tservers have
>> to check the file, right?  and then, during compaction, do portions
>> of that rfile get volleyed around the cloud until all tservers
>> have grabbed their portion?  (so, there's network overhead, repeatedly
>> reading files and skipping most of the data, ...)
>> if my new idea works, i will have a lot more control over the density
>> of rfiles, and most of them will span just one or two tablets.
>> so, is there a way to measure/simulate overall system benefit or cost
>> of different approaches to building bulk-load data (destined for an
>> established table, across N tservers, ...)?
>> i guess that a related question would be "are 1000 smaller and denser
>> bulk files better than 100 larger bulk files produced under a typical
>> getSplits() scheme?"
>> thanks,
>> jeff

View raw message