accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Tubbs (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ACCUMULO-775) Optimize iterator seek() method when seeking forward
Date Tue, 18 Dec 2012 17:08:12 GMT

     [ https://issues.apache.org/jira/browse/ACCUMULO-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Christopher Tubbs updated ACCUMULO-775:
---------------------------------------

    Fix Version/s:     (was: 1.5.0)
    
> Optimize iterator seek() method when seeking forward
> ----------------------------------------------------
>
>                 Key: ACCUMULO-775
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-775
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>            Priority: Minor
>              Labels: iterator, scan, seek
>
> At present, seeking is a very expensive operation. Yet, it is a very common case, especially
when writing filtering/consuming/skipping iterators to seek to the next possible match (perhaps
in the next row, when matching a column family with a regular expression), rather than continuing
to iterate. A common solution to the problem of whether to scan or seek is to continue to
scan for some threshold (~10-20 entries), hoping to just "run into" the next possible match,
rather than waste resources seeking directly to it.
> This pattern can be rolled in to the lower level iterator, so that iterators on top don't
have to do this. They can seek, and the underlying source iterator can simply consume the
next X entries when it makes sense, rather than waste resources seeking.
> I could be wrong (please comment and correct me below if I am), but I imagine that the
places where this would make the most sense is if the data currently being sought (seek'd)
is in the current compressed block from the underlying file, especially if it is forward,
relative to the current pointer. A better seek method should be able to tell where one currently
is, and whether the requested data is within reach without doing all the expensive operations
to re-seek to the same compressed block that is already loaded, reload it, decompress it,
and scan to the requested starting point.
> Having such an optimization would eliminate the need for users to try to calibrate their
own such scan vs. seek optimization based on guessing whether their data is in the current
block or another one, while still getting that same performance benefit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message