accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Hulbert <ahulb...@ccri.com>
Subject Re: Recovery file versus directory
Date Fri, 18 Mar 2016 13:51:13 GMT
Looks like the only thing we have in the gc logs are:

DEBUG: deleted [hdfs://../accumulo/wal/<uuid> ...]
DEBUG: Removing sorted WAL hdfs://...<uuid>

I can't tell if they are before or after in time than when I deleted the 
file

hdfs://accumulo/wal/<uuid>/failed

Here's the other issue we were looking at:

https://issues.apache.org/jira/browse/ACCUMULO-3727

FYI I originally increased the num WALs up to 8 to help batch write 
ingest...Now I've modified it only to be for the tables that needed 
ingest instead of the entire cluster, and reset the num WALs for the 
cluster back to 3, and I haven't had any errors since (3 days). Not sure 
why that would be a problem except for the few times that the metadata 
table was involved.

Andrew

On 03/18/2016 09:43 AM, Andrew Hulbert wrote:
> I'll tar them up and see what I can find! Thanks.
>
> On 03/17/2016 08:18 PM, Michael Wall wrote:
>> Andrew,
>>
>> Sounds a lot like 
>> https://issues.apache.org/jira/browse/ACCUMULO-4157. I'll look to see 
>> if what you describe could also happen with this bug.  If you still 
>> have the gc logs, can you look for a message like "Removing WAL for 
>> offline server" with the uuid?
>>
>> Mike
>>
>> On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <ahulbert@ccri.com 
>> <mailto:ahulbert@ccri.com>> wrote:
>>
>>     Hi folks,
>>
>>     We experienced a problem this morning with a recovery on 1.6.1
>>     that went something like this:
>>
>>     FileNotFoundException: File does not exist:
>>     hdfs:///accumulo/recovery/<uuid>/failed/data
>>
>>     at Tablet.java:1410
>>     at Tablet.java:1233
>>     etc.
>>     at TabletServer:2923
>>
>>     Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed
>>     was a 0 byte file, not a directory...and it was preventing
>>     tablets from getting assigned (I am not sure what caused the
>>     original failure, but I believe what happened is a tserver node
>>     was going down...the master indicated it was trying to shutdown
>>     the a tserver which was so bad off someone just rekicked the node).
>>
>>     I looked through the fixes for 1.6.2,3,4,5 but didn't see
>>     anything related on the release notes pages but I haven't gone
>>     through all the tickets yet. I haven't been able to get anyone to
>>     upgrade to 1.6.5 yet and perhaps its already fixed.
>>
>>     Just wondering if that's something that has been seen before?
>>
>>     In order to fix it I just deleted the failed file and it proceeded
>>
>>     Thanks!
>>
>>     Andrew
>>
>>
>


Mime
View raw message