accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-1453) Track tablet migrations and failed loads
Date Thu, 23 May 2013 19:09:20 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665507#comment-13665507
] 

Keith Turner commented on ACCUMULO-1453:
----------------------------------------

One of the usual suspects for failed loads is walog recovery problems.  The issue w/ RFiles
mentioned in the issue description seems more tricky to isolate.  If a tablet has a problematic
file, the tablet will likely load successfully.  Failure will occur when a scan of the rfile
is attempted. When a tablet server fails, how do you know which tablet(s) caused the problem?
 

Another approach to solving this issue may be to identify what can cause tablet server failure
and try to defend against those.  One possible cause could be key/value in a rfile that exceeds
memory.  This would be easy to defend against by making Accumulo refuse to load key/values
that are too large.  Another possible cause is an iterator that runs amok and consumes all
memory.  This is harder to defend against, ACCUMULO-1188 is one approach.


                
> Track tablet migrations and failed loads
> ----------------------------------------
>
>                 Key: ACCUMULO-1453
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1453
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: master, tserver
>    Affects Versions: 1.5.0, 1.4.3
>            Reporter: Mike Drob
>
> If a bad RFile or Tablet somehow gets in the system and brings down a tserver, then as
the master migrates it to other servers it will likely cause cascading failures.
> It might be a good idea to keep track of how many consecutive failures to load there
are for a given tablet, and either warn or refuse to host the tablet if this value exceeds
a given threshold.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message