accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: How to best measure how the lack of data-locality affects query performance
Date Wed, 07 Oct 2015 16:58:24 GMT


Jeff Kubina wrote:
> Per my thread "How does Accumulo process r-files for bulk ingesting?"
> on the user@ list I would like to test/measure how a lack of
> data-locality of bulk ingested files effects query performance. I seek
> comments/suggestions on the outline of the design for the test:
>
> Outline:
> 1. Create a table and pre-split it to have m tablets where m="total tservers".
> 2. Create 1 r-file containing m*n records that evenly distribute
> across the m tablets.
> 3. Bulk ingest the r-file.
> 4. Query each of the split ranges in the table and log their times.
> 5. Compact the table and wait for the compaction to complete.
> 6. Query each of the split ranges in the table and log their times.
> 7. Compute the ratio of the median times from steps 4 and 6.
>
> Questions:
> 1. Instead of compacting the table should I create a new table by
> generating the m r-files whose ranges intersect only one of the
> tablets and bulk ingest them?

If you can be tricky in your non-data-local case to evenly balance the 
data, you could just do one table import followed by a compaction and 
rerun on the same table.

You'd just want to make sure you have a decent distribution of the data 
across all servers in both the data-local and non-data-local cases

> 2. What is a good size for n, the number of records per tablet server?

I'm wondering if it depends on the type of workload that you're looking 
to run. Does it make a difference if you're just running randomized 
point queries? Or doing scan over the entire table?

Assuming you're just doing one tablet per server for your table (it's 
not apparent to me if there's a reason that would result in a lesser 
test), I'd guess a couple 100MB's worth of records per tablet would be 
good. Enough to get a few HDFS blocks per RFile, but not enough that 
Accumulo would automatically split it from underneath you. You could 
also try to increase the split threshold and put more data per file.

Mime
View raw message