Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 12 Jul 2013 17:45:49 +0000 (UTC)
From: "Enis Soztutar (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12657395.1373594018415.40708.1373651149606@arcas>
In-Reply-To: <JIRA.12657395.1373594018415@arcas>
References: <JIRA.12657395.1373594018415@arcas>
Subject: [jira] [Commented] (HBASE-8935) IntegrationTestBigLinkedList fails
 under load on 0.94 due to some scan issues
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707141#comment-13707141 ] 

Enis Soztutar commented on HBASE-8935:
--------------------------------------

+1 on the patch. 
                
> IntegrationTestBigLinkedList fails under load on 0.94 due to some scan issues
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-8935
>                 URL: https://issues.apache.org/jira/browse/HBASE-8935
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HBASE-8935-v0-94-logging.patch
>
>
> We observe occasional failures (4-5k undefined and unreferenced nodes in the list, running) when the cluster is stressed while running IntegrationTestBigLinkedList on 0.94. Unfortunately debug logging is disabled.
> If Verify is rerun after the fact, the data is there, so it's not data loss. All the missing keys come from very narrow range from the very beginning of  one region; most of this region is not affected.
> In the case I'm looking at, region range is
> {code}\xD0\x0C`=%\xA2\xD3\xA8\x0A\xF8\x917\x05\x94\x1D> .. \xD4\x0Bx\xAF\x88\x0C\xF4"7\x9A\x9F{\xCE\x0E\x8A{code}
> and problematic renage
> {code}\xD0\x0C`QLD\xED2\xD5c\x8D\xDB5\x01\xD2H] ... \xD0\x0D\x89)\x11\x0E8\xC5\x08o\xD7\xE3$\xB3\xAAu{code}
> There are no splits and compactions on the region during MR job; there are no compactions after that could have affected the data, although there is one flush that happened long after, and region was moved and reopened (before I ran verify manually that showed that data is in HBase).
> One thing that happened was that the scanners used by all the map tasks had lease expirations during the MR job that had missing data, some of them twice.
> First scanner expiration for each task I looked at was exactly 1 minute after "Processing split" log message, when the scanner is opened.
> Only the tasks where the scanners have expired twice (2 of them) logged any errors; presumably one expiration in each task happened after the reading was finished, because everything was slow and scanner wasn't closed in time - no errors or exceptions are logged in the tasks where scanner only expired once.
> The job I ran afterwards had no scanner expirations so it does close them correctly in normal case...
> The task that was reading the problematic range had one scanner expiration and didn't log any errors.
> It's a little bit special (or it may be a total red herring) in that in other tasks, scanner expired while the task was splitting partial outputs (which may mean end of input reading?), whereas in the task that lost data spilling started long (~40s) after the scanner expired.
> However there's one another such task (spilling started 25s after scanner expired) and it didn't log any errors and didn't lose data.
> At this point I am not sure about the root cause but I suspect it might be related to scanner expiration handling, given the patterns.
> One thing for sure is that there isn't enough logging... so I will start with adding that.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira