accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <>
Subject Re: comparing different rfile "densities"
Date Thu, 13 Nov 2014 15:13:31 GMT

I also suspect that this is reaching the point of over-optimization.
While there are some situations in which contiguous tablets will be
likely to be co-hosted on the same tserver, this is improbable without
a custom load balancer. However, that might not really matter. When
bulk loading files, Accumulo will only assign the file to tablets for
which that RFile has keys. If you partition your data non-contiguously
when creating your files to be bulk loaded then you can theoretically
align your data exactly with the tablets on a tablet server. This
would require something other than the RangePartitioner that we
typically recommend for MapReduce-based bulk ingest.

More generally, I'm not sure I agree with the "one file per tserver"
concept. I would think you would want at least enough files per
tserver to leverage the parallelism of your hardware in generating
those files. RFile generation is typically CPU-bound, so that puts you
closer to the 20-40 per server range.


On Thu, Nov 13, 2014 at 6:36 AM, Jeff Turner <> wrote:
> i had a conversation about this yesterday that confused me.
> if this is all documented somewhere, or is described in ppt, let me know.
> how much does the goal of "approximately one rfile per tserver" help
> with building "good" (i.e., dense) rfiles for bulk-load?
> i had not heard of the the notion of one file per tserver (as an efficiency
> trick).
> i assumed that, since tserver/task-tracker are usually 1:1, the goal was
> really to
> establish splits that would use a reasonable number of reducers. which is a
> compromise from rfile-per-tablet, when tables have very large numbers of
> tablets.
> so, two questions:
> - do tservers generally get a contiguous range of tablets?
>   - if an rfile spans five consecutive tablets, what is the likelihood that
> they are all on one tserver?
>   - if tablets are less systematically allocated, then the rfile would cover
> multiple tservers
> - even if we can generate rfiles that "fit in to one tserver", isn't there
> still an issue that
>    the new rfile may cover 100's of the tablets owned by a tserver?  so any
> scan of
>    any of those tablets will have to peek in the new file (until
> compaction).
> i think i'm getting close to, or already passed by, the point of diminishing
> returns
> as far as optimizing.  now i'm just curious.
> On 11/11/14 2:57 PM, Adam Fuchs wrote:
>> Jeff,
>> "Density" is an interesting measure here, because RFiles are going to
>> be sorted such that, even when the file is split between tablets, a
>> read of the file is going to be (mostly) a sequential scan. I think
>> instead you might want to look at a few other metrics: network
>> overhead, name node operation rates, and number of files per tablet.
>> The network overhead is going to be more an issue of locality than
>> density, so you'd have to do more than just have separate files per
>> tablet to optimize that. You'll need some way of specifying or hinting
>> at where the files should be generated. As an aside, we generally want
>> to shoot for "probabilistic locality" so that the aggregate traffic
>> over top level switches is much smaller than the total data processed
>> (smaller enough so that it isn't a bottleneck). This is generally a
>> little easier than guaranteeing that files are always read from a
>> local drive. You might be able to measure this by monitoring your
>> network usage, assuming those sensors are available to you. Accumulo
>> also prints out information on entries/s and bytes/s for compactions
>> in the debug logs.
>> Impact on the namenode is more or less proportional to the number of
>> blocks+files that you generate. As long as you're not generating a
>> large number of files that are smaller than your block size (64MB?
>> 128MB?) you're probably going to be close to optimal here. I'm not
>> sure at what point the number of files+blocks becomes a bottleneck,
>> but I've seen it happen when generating a very large number of tiny
>> files. This is something that may cause you problems if you generate
>> 50K files per ingest cycle rather than 500 or 5K. Measure this by
>> looking at the size of files that are being ingested.
>> Number of files per tablet has a big effect on performance -- much
>> more so than number of tablets per file. Query latency and aggregate
>> scan performance are directly proportional to the number of files per
>> tablet. Generating one file per tablet or one file per group of
>> tablets doesn't really change this metric. You can measure this by
>> scanning the metadata table as Josh suggested.
>> I'm very interested in this subject, so please let us know what you find.
>> Cheers,
>> Adam
>> On Tue, Nov 11, 2014 at 6:56 AM, Jeff Turner <> wrote:
>>> is there a good way to compare the overall system effect of
>>> bulk loading different sets of rfiles that have the same data,
>>> but very different "densities"?
>>> i've been working on a way to re-feed a lot of data in to a table,
>>> and have started to believe that our default scheme for creating
>>> rfiles - mapred in to ~100-200 splits, sampled from 50k tablets -
>>> is actually pretty bad.  subjectively, it feels like rfiles that "span"
>>> 300 or 400 tablets is bad in at least two ways for the tservers -
>>> until the files are compacted, all of the "potential" tservers have
>>> to check the file, right?  and then, during compaction, do portions
>>> of that rfile get volleyed around the cloud until all tservers
>>> have grabbed their portion?  (so, there's network overhead, repeatedly
>>> reading files and skipping most of the data, ...)
>>> if my new idea works, i will have a lot more control over the density
>>> of rfiles, and most of them will span just one or two tablets.
>>> so, is there a way to measure/simulate overall system benefit or cost
>>> of different approaches to building bulk-load data (destined for an
>>> established table, across N tservers, ...)?
>>> i guess that a related question would be "are 1000 smaller and denser
>>> bulk files better than 100 larger bulk files produced under a typical
>>> getSplits() scheme?"
>>> thanks,
>>> jeff

View raw message