accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <scubafu...@gmail.com>
Subject Re: Seeking Iterator
Date Mon, 12 Jan 2015 22:29:14 GMT
On Mon, Jan 12, 2015 at 4:10 PM, Josh Elser <josh.elser@gmail.com> wrote:
> seek()'ing doesn't always imply an increase in performance -- remember that
> RFiles (the files that back Accumulo tables), are composed of multiple
> blocks/sections with an index of them. A seek is comprised of using that
> index to find the block/section of the RFile and then a linear scan forward
> to find the first key for the range you seek()'ed to.
>
> Thus, if you're repeatedly re-seek()'ing within the same block, you'll waste
> a lot of time re-read the same data. In your situation, it sounds like the
> cost of re-reading the data after a seek is about the same as naively
> consuming the records.
>
> You can try altering table.file.compress.blocksize (and then compacting your
> table) to see how this changes.
>

There is actually some fairly well-optimized code in the RFile seek
that minimizes the re-reading of RFile data and index blocks. Seeking
forward by one key adds a couple of key comparisons and function
calls, but that's about it. Incidentally, key comparisons are pretty
high up on my list of things that could use some performance
optimization.

Adam

Mime
View raw message