Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7135010AB5 for ; Tue, 22 Oct 2013 17:18:45 +0000 (UTC) Received: (qmail 6490 invoked by uid 500); 22 Oct 2013 17:18:43 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 6449 invoked by uid 500); 22 Oct 2013 17:18:42 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 6428 invoked by uid 99); 22 Oct 2013 17:18:42 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Oct 2013 17:18:42 +0000 Date: Tue, 22 Oct 2013 17:18:42 +0000 (UTC) From: "Eric Newton (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (ACCUMULO-1740) intermittent integration test failure MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Newton resolved ACCUMULO-1740. ----------------------------------- Resolution: Fixed Calling this fixed, but please reopen if you see it again. > intermittent integration test failure > ------------------------------------- > > Key: ACCUMULO-1740 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1740 > Project: Accumulo > Issue Type: Bug > Components: test > Reporter: Eric Newton > Assignee: Eric Newton > > Some of the recovery integration tests fail with a very long timeout (10 minutes). > After a restart of the tablet servers, the WAL is sorted, and the root tablet is assigned. After that, the master does not assign the !METADATA tablets. > I've managed to jstack the master, and it seems to be stuck scanning. I turned on DEBUG log messages and I see this: > {noformat} > 2013-09-25 17:27:46,340 [impl.TabletServerBatchReaderIterator] DEBUG: Server : rd6ul-14706v.tycho.ncsc.mil:37957 msg : java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready for > read. ch : java.nio.channels.SocketChannel[connected local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957] > 2013-09-25 17:27:46,340 [impl.TabletServerBatchReaderIterator] DEBUG: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957] > java.io.IOException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957] > at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:705) > at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:364) > at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47) > at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) > at java.lang.Thread.run(Thread.java:662) > Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957] > at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) > at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) > at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) > at org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.readAll(ThriftTransportPool.java:254) > at org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:601) > at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:470) > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) > at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:310) > at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:290) > at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:650) > ... 7 more > Caused by: java.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.0.1:33362 remote=rd6ul-14706v.tycho.ncsc.mil/10.0.0.1:37957] > at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) > at java.io.BufferedInputStream.read(BufferedInputStream.java:317) > at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) > ... 18 more > {noformat} > The tablet server does put the root tablet online. > There are 8 tests that restart tablet servers, this usually only happens to one of the tests per run, making it difficult to track down. -- This message was sent by Atlassian JIRA (v6.1#6144)