accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Turner <>
Subject Re: comparing different rfile "densities"
Date Tue, 11 Nov 2014 19:57:48 GMT
is a file listed in metadata under *all* of the tablets that might have 
entries in the file?

(this example is probably bad, but i hope get get the gist):

if a table has tablets for rows A, B, C, D, and E, and a new rfile has
entries for B and E, would tablets B, C, D, and E all have pointers to 
the new file,
in !METADATA?  or is there some cleverness, and only B and E point to 
the file?

On 11/11/14 11:49 AM, Josh Elser wrote:
> You could also affirm your thoughts about RFile usage by generating a 
> histogram over the metadata table for your table.
> The table ID is the common prefix in the metadata table. Each unique 
> row is a tablet, and will contain zero to many keys with a column 
> family of 'file'. The column qualifier is a URI to the file (can you 
> make a Path object from it). The value is a CSV where the first 
> element is the approximate size of the file (bytes) and the second 
> element is the number of key-value pairs in the file.
> This might help you make a more quantitative analysis over the 
> distribution of files as you tweak things.
> To reaffirm what Mike said, trying to get close to 1:1 files to 
> tablets is definitely ideal but can be difficult to manage when 
> considering all of potential knobs you can turn (for both ingest and 
> query characteristics).
> Mike Drob wrote:
>> I'm not sure how to quantify this and give you a way to verify, but in
>> my experience you want to be producing rflies that load into a single
>> tablet. Typically, this means number of reducers equal to the number of
>> tablets in the table that you will be importing and perhaps a custom
>> partitioner. I think your intuition is spot on, here.
>> Of course, if that means that you have a bunch of tiny files, then maybe
>> it's time to rethink your split strategy.
>> On Tue, Nov 11, 2014 at 5:56 AM, Jeff Turner <
>> <>> wrote:
>>     is there a good way to compare the overall system effect of
>>     bulk loading different sets of rfiles that have the same data,
>>     but very different "densities"?
>>     i've been working on a way to re-feed a lot of data in to a table,
>>     and have started to believe that our default scheme for creating
>>     rfiles - mapred in to ~100-200 splits, sampled from 50k tablets -
>>     is actually pretty bad.  subjectively, it feels like rfiles that 
>> "span"
>>     300 or 400 tablets is bad in at least two ways for the tservers -
>>     until the files are compacted, all of the "potential" tservers have
>>     to check the file, right?  and then, during compaction, do portions
>>     of that rfile get volleyed around the cloud until all tservers
>>     have grabbed their portion?  (so, there's network overhead, 
>> repeatedly
>>     reading files and skipping most of the data, ...)
>>     if my new idea works, i will have a lot more control over the 
>> density
>>     of rfiles, and most of them will span just one or two tablets.
>>     so, is there a way to measure/simulate overall system benefit or 
>> cost
>>     of different approaches to building bulk-load data (destined for an
>>     established table, across N tservers, ...)?
>>     i guess that a related question would be "are 1000 smaller and 
>> denser
>>     bulk files better than 100 larger bulk files produced under a 
>> typical
>>     getSplits() scheme?"
>>     thanks,
>>     jeff

View raw message