accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Kubina <jeff.kub...@gmail.com>
Subject How to best measure how the lack of data-locality affects query performance
Date Wed, 07 Oct 2015 16:46:27 GMT
Per my thread "How does Accumulo process r-files for bulk ingesting?"
on the user@ list I would like to test/measure how a lack of
data-locality of bulk ingested files effects query performance. I seek
comments/suggestions on the outline of the design for the test:

Outline:
1. Create a table and pre-split it to have m tablets where m="total tservers".
2. Create 1 r-file containing m*n records that evenly distribute
across the m tablets.
3. Bulk ingest the r-file.
4. Query each of the split ranges in the table and log their times.
5. Compact the table and wait for the compaction to complete.
6. Query each of the split ranges in the table and log their times.
7. Compute the ratio of the median times from steps 4 and 6.

Questions:
1. Instead of compacting the table should I create a new table by
generating the m r-files whose ranges intersect only one of the
tablets and bulk ingest them?

2. What is a good size for n, the number of records per tablet server?

Mime
View raw message