accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ACCUMULO-1940) Data file in !METADATA differs from in memory data
Date Wed, 27 Nov 2013 17:59:36 GMT

     [ https://issues.apache.org/jira/browse/ACCUMULO-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Elser updated ACCUMULO-1940:
---------------------------------

    Description: 
Found during CI run with agitation.

Got the first two error messages 5 times (assuming in a retry on failure block):

{noformat}
Failed to do close consistency check for tablet c;79d0ab;7870a
	java.lang.RuntimeException: Data file in !METADATA differ from in memory data c;79d0ab;7870a
 {/t-0005h1j/A0005n8k.rf=797350457 19198312, /t-0005h1j/C0005skm.rf=798078368 19322025, /t-0005h1j/C0005tet.rf=89783168
2196349, /t-0005h1j/C0005u20.rf=90979448 2227972, /t-0005h1j/F0005u0v.rf=23410023 582233,
/t-0005h1j/F0005u2p.rf=21958551 547159, /t-0005h1j/F0005u3g.rf=14395121 358893}  {/t-0005h1j/A0005n8k.rf=797350457
19198312, /t-0005h1j/C0005skm.rf=798078368 19322025, /t-0005h1j/C0005tet.rf=89783168 2196349,
/t-0005h1j/C0005u20.rf=90979448 2227972, /t-0005h1j/F0005u2p.rf=21958551 547159, /t-0005h1j/F0005u3g.rf=14395121
358893}
		at org.apache.accumulo.server.tabletserver.Tablet.closeConsistencyCheck(Tablet.java:2847)
		at org.apache.accumulo.server.tabletserver.Tablet.completeClose(Tablet.java:2780)
		at org.apache.accumulo.server.tabletserver.Tablet.close(Tablet.java:2658)
		at org.apache.accumulo.server.tabletserver.TabletServer$UnloadTabletHandler.run(TabletServer.java:2357)
		at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
		at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
		at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
		at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
		at java.lang.Thread.run(Thread.java:744)
{noformat}

Then, we logged that we failed the consistency check

{noformat}
Consistency check fails, retrying java.lang.RuntimeException: Failed to do close consistency
check for tablet c;79d0ab;7870a
{noformat}

In the end, we gave up and closed it anyways.

{noformat}
Tablet closed consistency check has failed for c;79d0ab;7870a giving up and closing
{noformat}

Before all of this happened, we tried to bring this tablet online after a failure on a new
tserver. During the minc as part of the recovery process, we failed to get the lease on the
.rf_tmp file we tried to create. We failed this a couple of times, but eventually got the
tmp file we needed and the recovery process completed and we could bring the tablet online.
The difference between the in-memory version and the !METADATA version was this one flushed
rfile that we created during this recovery process.

The problem eventually fixed itself because the tablet was migrated to a different server
and we just took what was (correctly) in the !METADATA table.

There still is an unknown issue of how we missed the flush RFile in the DatafileManager's
copy.

  was:
Found during CI run with agitation.

Got the first two error messages 5 times (assuming in a retry on failure block):

{noformat}
Failed to do close consistency check for tablet c;79d0ab;7870a
	java.lang.RuntimeException: Data file in !METADATA differ from in memory data c;79d0ab;7870a
 {/t-0005h1j/A0005n8k.rf=797350457 19198312, /t-0005h1j/C0005skm.rf=798078368 19322025, /t-0005h1j/C0005tet.rf=89783168
2196349, /t-0005h1j/C0005u20.rf=90979448 2227972, /t-0005h1j/F0005u0v.rf=23410023 582233,
/t-0005h1j/F0005u2p.rf=21958551 547159, /t-0005h1j/F0005u3g.rf=14395121 358893}  {/t-0005h1j/A0005n8k.rf=797350457
19198312, /t-0005h1j/C0005skm.rf=798078368 19322025, /t-0005h1j/C0005tet.rf=89783168 2196349,
/t-0005h1j/C0005u20.rf=90979448 2227972, /t-0005h1j/F0005u2p.rf=21958551 547159, /t-0005h1j/F0005u3g.rf=14395121
358893}
		at org.apache.accumulo.server.tabletserver.Tablet.closeConsistencyCheck(Tablet.java:2847)
		at org.apache.accumulo.server.tabletserver.Tablet.completeClose(Tablet.java:2780)
		at org.apache.accumulo.server.tabletserver.Tablet.close(Tablet.java:2658)
		at org.apache.accumulo.server.tabletserver.TabletServer$UnloadTabletHandler.run(TabletServer.java:2357)
		at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
		at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
		at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
		at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
		at java.lang.Thread.run(Thread.java:744)
{noformat}

Then, we logged that we failed the consistency check

{noformat}
Consistency check fails, retrying java.lang.RuntimeException: Failed to do close consistency
check for tablet c;79d0ab;7870a
{noformat}

In the end, we gave up and closed it anyways.

{noformat}
Tablet closed consistency check has failed for c;79d0ab;7870a giving up and closing
{noformat}

This left me with some table problems, but everything appears to be working fine at the moment.
Not sure if something was lost silently (until I shut down and run a verify), but it certainly
looks ominous.


> Data file in !METADATA differs from in memory data
> --------------------------------------------------
>
>                 Key: ACCUMULO-1940
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1940
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.5.0
>            Reporter: Josh Elser
>
> Found during CI run with agitation.
> Got the first two error messages 5 times (assuming in a retry on failure block):
> {noformat}
> Failed to do close consistency check for tablet c;79d0ab;7870a
> 	java.lang.RuntimeException: Data file in !METADATA differ from in memory data c;79d0ab;7870a
 {/t-0005h1j/A0005n8k.rf=797350457 19198312, /t-0005h1j/C0005skm.rf=798078368 19322025, /t-0005h1j/C0005tet.rf=89783168
2196349, /t-0005h1j/C0005u20.rf=90979448 2227972, /t-0005h1j/F0005u0v.rf=23410023 582233,
/t-0005h1j/F0005u2p.rf=21958551 547159, /t-0005h1j/F0005u3g.rf=14395121 358893}  {/t-0005h1j/A0005n8k.rf=797350457
19198312, /t-0005h1j/C0005skm.rf=798078368 19322025, /t-0005h1j/C0005tet.rf=89783168 2196349,
/t-0005h1j/C0005u20.rf=90979448 2227972, /t-0005h1j/F0005u2p.rf=21958551 547159, /t-0005h1j/F0005u3g.rf=14395121
358893}
> 		at org.apache.accumulo.server.tabletserver.Tablet.closeConsistencyCheck(Tablet.java:2847)
> 		at org.apache.accumulo.server.tabletserver.Tablet.completeClose(Tablet.java:2780)
> 		at org.apache.accumulo.server.tabletserver.Tablet.close(Tablet.java:2658)
> 		at org.apache.accumulo.server.tabletserver.TabletServer$UnloadTabletHandler.run(TabletServer.java:2357)
> 		at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 		at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 		at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> 		at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 		at java.lang.Thread.run(Thread.java:744)
> {noformat}
> Then, we logged that we failed the consistency check
> {noformat}
> Consistency check fails, retrying java.lang.RuntimeException: Failed to do close consistency
check for tablet c;79d0ab;7870a
> {noformat}
> In the end, we gave up and closed it anyways.
> {noformat}
> Tablet closed consistency check has failed for c;79d0ab;7870a giving up and closing
> {noformat}
> Before all of this happened, we tried to bring this tablet online after a failure on
a new tserver. During the minc as part of the recovery process, we failed to get the lease
on the .rf_tmp file we tried to create. We failed this a couple of times, but eventually got
the tmp file we needed and the recovery process completed and we could bring the tablet online.
The difference between the in-memory version and the !METADATA version was this one flushed
rfile that we created during this recovery process.
> The problem eventually fixed itself because the tablet was migrated to a different server
and we just took what was (correctly) in the !METADATA table.
> There still is an unknown issue of how we missed the flush RFile in the DatafileManager's
copy.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message