accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivan Bella (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4667) LocalityGroupIterator very inefficient with large locality groups
Date Tue, 27 Jun 2017 19:45:00 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065362#comment-16065362
] 

Ivan Bella commented on ACCUMULO-4667:
--------------------------------------

[~kturner] You are correct.  I believe that is what the count is used for in the map passed
into the seek call.  I will used that to pre-filter the locality groups as is currently being
done in the seek.

> LocalityGroupIterator very inefficient with large locality groups
> -----------------------------------------------------------------
>
>                 Key: ACCUMULO-4667
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4667
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>    Affects Versions: 1.6.6, 1.7.3, 1.8.1, 2.0.0
>            Reporter: Ivan Bella
>            Assignee: Ivan Bella
>             Fix For: 1.8.2, 2.0.0
>
>
> On one of our systems we tracked some scans that were taking an extremely long time to
complete (many hours).  As it turns out the scan was relatively simple in that it was scanning
a tablet for all keys that had a specific column family.  Note that there was very little
data that actually matched this column familiy.  Upon tracing the code we found that it was
spending a large amount of time in the LocalityGroupIterator.  Stack traces continually found
the code to be at line 128 or 129 of the LocalityGroupIterator.  Those line numbers are consistent
from the 1.6 series all the way to 2.0.0 (master).  In this case the column family being searched
for was included in one of a dozen or so locality groups on that table, and the locality group
itself had 40 or so column families.  We see several things that can be done here:
> 1) The code that checks the group column families against those being searched for can
quickly exit once if finds a match
> 2) The code that checks the group column families against those being searched for can
look at the relative size of those two groups an invert the logic appropriately for a more
efficient loop.
> 3) We could create a cached map of column families to locality groups allowing us to
avoid examining each locality group every time we seek.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message