Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: "Perko, Ralph J" <Ralph.Perko@pnnl.gov>
To: "user@accumulo.apache.org" <user@accumulo.apache.org>
Date: Thu, 19 Jul 2012 11:55:03 -0700
Subject: Re: table data missing
Thread-Topic: table data missing
Thread-Index: Ac1l4CukVj9EQur1RWijXYhbhS23CQ==
Message-ID: <CC2DA41C.4D47%ralph.perko@pnnl.gov>
In-Reply-To: 
 <CADxc9BkHD2parK5MZta1dNjPomvo-xJXcQvZThHAHMpC8ByMsQ@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.2.3.120616
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Thanks for the help.  It is fixed and was related to the loggers as you
said.  Here is what I did:

Environment:
7-node managed cluster

Problem:
The walogs directory was configured to use a shared directory which was
used by all the nodes.  When the loggers were trying to start only the
first one there could get the .lock file, the others failed to start and
my tables were not visible (though I am not sure why.  The one running
logger had access to the walog files)

Solution:
I created a new walogs directory on partitions unique to each node.  I
then copied the contents of the original walogs directory to the new
walogs directory on each node.  I restarted accumulo and all the tables
were back.

Thanks again,
Ralph


On 7/19/12 9:43 AM, "Eric Newton" <eric.newton@gmail.com> wrote:

>You should have as many loggers as you have tablet servers.
>
>Your log recovery is failing because the loggers are not running.
>
>Please start all your loggers, and/or determine while they are going
>down.  Then restart the master and the system should recover.
>
>-Eric
>
>On Thu, Jul 19, 2012 at 12:39 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
>wrote:
>> From the master log file at startup:
>>
>> 9 08:38:40,612 [master.CoordinateRecoveryTask] WARN : Unable to recover
>>=20
>>192.168.1.244:11224/65911601-d684-43e8-94b3-cdf959590298(java.io.IOExcept
>>io
>> n: org.apache.thrift.transport.TTransportException:
>> java.net.ConnectException: Connection refused)
>> java.io.IOException: org.apache.thrift.transport.TTransportException:
>> java.net.ConnectException: Connection refused
>>         at
>>=20
>>org.apache.accumulo.server.tabletserver.log.RemoteLogger.<init>(RemoteLog
>>ge
>> r.java:99)
>>         at
>>=20
>>org.apache.accumulo.server.master.CoordinateRecoveryTask$RecoveryJob.star
>>tC
>> opy(CoordinateRecoveryTask.java:132)
>>         at
>>=20
>>org.apache.accumulo.server.master.CoordinateRecoveryTask$RecoveryJob.acce
>>ss
>> $400(CoordinateRecoveryTask.java:114)
>>         at
>>=20
>>org.apache.accumulo.server.master.CoordinateRecoveryTask.recover(Coordina
>>te
>> RecoveryTask.java:289)
>>         at
>>=20
>>org.apache.accumulo.server.master.Master$TabletGroupWatcher.run(Master.ja
>>va
>> :1351)
>> Caused by: org.apache.thrift.transport.TTransportException:
>> java.net.ConnectException: Connection refused
>>         at
>>=20
>>org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTranspo
>>rt
>> (ThriftTransportPool.java:475)
>>         at
>>=20
>>org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(Thr
>>if
>> tTransportPool.java:464)
>>         at
>>=20
>>org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(Thr
>>if
>> tTransportPool.java:441)
>>         at=20
>>org.apache.accumulo.core.util.ThriftUtil.getClient(ThriftUtil.java:67)
>>         at
>>=20
>>org.apache.accumulo.server.tabletserver.log.RemoteLogger.<init>(RemoteLog
>>ge
>> r.java:96)
>>         ... 4 more
>> Caused by: java.net.ConnectException: Connection refused
>>         at sun.nio.ch.Net.connect(Native Method)
>>         at=20
>>sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:500)
>>         at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:81)
>>         at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:65)
>>         at
>>=20
>>org.apache.accumulo.core.util.TTimeoutTransport.create(TTimeoutTransport.
>>ja
>> va:39)
>>         at
>>=20
>>org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTranspo
>>rt
>> (ThriftTransportPool.java:473)
>>         ... 8 more
>> 19 08:38:40,652 [master.CoordinateRecoveryTask] WARN : Recovery of
>> 192.168.1.244:11224:65911601-d684-43e8-94b3-cdf959590298 failed
>> 19 08:38:45,071 [master.CoordinateRecoveryTask] INFO : Deleting recovery
>> directory org.apache.hadoop.fs.FileStatus@75641fd
>> 19 09:08:40,848 [master.CoordinateRecoveryTask] WARN : Recovery taking
>>too
>> long, giving up
>> 19 09:08:40,848 [master.EventCoordinator] INFO : Log recovery
>> 192.168.1.244:11224/65911601-d684-43e8-94b3-cdf959590298 complete
>>
>>
>>
>>
>> On 7/19/12 9:34 AM, "Keith Turner" <keith@deenlo.com> wrote:
>>
>>>What you are describing sounds like ZooKeeper is up and running (this
>>>is where table config info is stored, so thats why you can list
>>>tables), but not tablets are assigned to tablet servers.  Need to
>>>determine why no tablets are assigned.  Look in the master log for
>>>anything suspicious related to tablet assignment.
>>>
>>>
>>>On Thu, Jul 19, 2012 at 12:28 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
>>>wrote:
>>>> Hi,
>>>>
>>>> I restarted my cluster and now the Accumulo Overview page says there
>>>>are 0 tables.  However, when I go to the Table List page, all my tables
>>>>are listed with a status of "ONLINE" but nothing else.  From the
>>>>Accumulo shell I cannot access any of my tables but I can list them,
>>>>like the web site.  Hadoop is up and healthy.  The tablet servers are
>>>>up
>>>>but each states 0 for Hosted Tablets.  Do you know what is causing this
>>>>and how to fix it?
>>>>
>>>> Thanks,
>>>> Ralph
>>>>
>>>>
>>