accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4164) Avoid copy of RFile Index blocks when in cache
Date Tue, 13 Sep 2016 19:00:23 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15488088#comment-15488088
] 

Keith Turner commented on ACCUMULO-4164:
----------------------------------------

I finally got around to creating something to do this.  I was going to use the new RFile API
in 1.8.0 to show the change, but 1.8.0 has always had this improvement.  So I write something
that uses the internal RFile APIs in 1.6 and 1.7.  The following repo has a random seek test,
that continually does 1,000 random
seeks against an RFile with 10M key values.   Running this test against 1.7.1 the times converge
to ~97ms for 1,000 random seeks.  Running this test against 1.7.2 the times converge to ~11ms
for 1,000 random seeks.

https://github.com/keith-turner/rfile-pert-test

> Avoid copy of RFile Index blocks when in cache
> ----------------------------------------------
>
>                 Key: ACCUMULO-4164
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4164
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.6.5, 1.7.1
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.6.6, 1.7.2, 1.8.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I have been doing performance experiments with RFile.  During the course of these experiments
I noticed that RFile is not as fast at it should be in the case where index blocks are in
cache and the RFile is not already open.  The reason is that the RFile code copies and deserializes
the index data even though its already in memory.
> I made the following change to RFile in a branch.
>  * Avoid copy of index data when its in cache
>  * Deserialize offsets lazily (instead of upfront) during binary search
>  * Stopped calling lots of synchronized methods during deserialization of index info.
 The existing code use ByteArrayInputStream which results in lots of fine grained synchronization.
 Switching to an inputstream that offers the same functionality w/o sync showed a measurable
performance difference.  
> These changes lead to performance in the following two situations  :
>  * When an RFiles data is in cache, but its not open on the tserver.  
>  * For RFiles with multilevel indexes with index data in cache.   Currently an open RFile
only keeps the root node in memory.   Lower level index nodes are always read from the cache
or DFS.   The changes I made would always avoid the copy and deserialization of lower level
index nodes when in cache.
> I have seen significant performance improvements testing with the two cases above.  My
test are currently based on a new API I am creating for RFile, so I can not easily share them
until I get that pushed.  
> For the case where a tserver has all files frequently in use already open and those files
have a single level index, these changes should not make a significant performance difference.
> These change should result in less memory use for opening the same rfile multiple times
for different scans (when data is in cache).  In this case all of the RFiles would share the
same byte array holding the serialized index data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message