accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Fuchs (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ACCUMULO-543) why are we still reusing RFile.Readers?
Date Wed, 18 Apr 2012 20:18:41 GMT
why are we still reusing RFile.Readers?
---------------------------------------

                 Key: ACCUMULO-543
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-543
             Project: Accumulo
          Issue Type: Improvement
          Components: tserver
            Reporter: Adam Fuchs
            Assignee: Adam Fuchs


The TabletServer includes a bunch of code to manage reuse of RFile.Reader objects across scans.
There are a few reasons for doing this that certainly made sense in the past, including:
* Reducing index reads. If we have already found a location in the RFile index, and we can
easily determine that this is the location that we need for the next read, then we can skip
doing a binary search in the RFile index. Index caching and hierarchical indexes may improve
this aspect to where we don't need this optimization.
* Reducing re-reading of the same data block. Re-reading a block used to mean that HDFS would
create a new socket rather than reusing the current socket. Seeking to the same spot HDFS
now reuses connections, so this is not as big a deal.
* Reducing namenode operations. Re-opening an RFile requires knowledge of the RFile's size.
Even if we have the block locations cached, the file size still includes an HDFS namenode
operation. This is something we should probably be caching within the TabletServer, since
our RFiles are immutable.

There are also a few problems with reusing RFiles, including:
* Increased TabletServer complexity. We need to manage the resource pool among many scans.
Not reusing readers along with ACCUMULO-416 may make it possible to eliminate coordination
of RFile.Readers at the TabletServer level altogether.
* Risk of information leakage. RFile.Readers hold state, and we don't want that state passing
between scan sessions. Sharing RFile.Readers makes it more likely that a bug in this class
exists.
* Risk of interference. RFile.Readers are not thread safe now, but even if they were we wouldn't
want separate scan sessions (or maybe even multiple threads in the same scan session) affecting
each other through the RFile.Reader. Sharing RFile.Readers also makes it more likely that
a bug in this class exists.

The relevant question is what would be the effect on creating a new RFile.Reader anytime we
want to read an RFile instead of grabbing one from the pool? ACCUMULO-416 is certainly a prerequisite
for this ticket, and we know we need to cache the RFile length. Are there any other prerequisites?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message