accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam J. Shook" <adamjsh...@gmail.com>
Subject Re: Question on missing RFiles
Date Wed, 16 May 2018 15:25:02 GMT
I tried building a timeline but the logs are just not there.  We weren't
sending the debug logs to Splunk due to the verbosity, but we may be
tweaking the log4j settings a bit to make sure we get the log data stored
in the event this happens again.  This very well could be attributed to the
recovery failure; hard to say.  I'll be upgrading to 1.9.1 soon.

On Mon, May 14, 2018 at 8:53 AM, Michael Wall <mjwall@gmail.com> wrote:

> Can you pick some of the files that are missing and search through your
> logs to put together a timeline?  See if you can find that file for a
> specific tablet.  Then grab all the logs for when a file was created as
> result of a compaction, and a when a file was included in compaction for
> that table.  Follow compactions for that tablet until you started getting
> errors.  Then see what logs you have for WAL replay during that time for
> that tablet and the metadata and can try to correlate.
>
> It's a shame you don't have the GC logs.  If you saw it was GC'd then
> showed up in the metadata table again that would help explain what
> happened.  Like Christopher mentioned, this could be related to a recovery
> failure.
>
> Mike
>
> On Sat, May 12, 2018 at 5:26 PM Adam J. Shook <adamjshook@gmail.com>
> wrote:
>
>> WALs are turned on.  Durability is set to flush for all tables except for
>> root and metadata which are sync.  The current rfile names on HDFS and
>> in the metadata table are greater than the files that are missing.
>>  Searched through all of our current and historical logs in Splunk (which
>> are only INFO level or higher).  Issues from the logs:
>>
>> * Problem reports saying the files are not found
>> * IllegalStateException saying the rfile is closed when it tried to load
>> the Bloom filter (likely the flappy DataNode)
>> * IOException when reading the file saying Stream is closed (likely the
>> flappy DataNode)
>>
>> Nothing in the GC logs -- all the above errors are in the tablet server
>> logs.  The logs may have rolled over, though, and our debug logs don't make
>> it into Splunk.
>>
>> --Adam
>>
>> On Fri, May 11, 2018 at 6:16 PM, Christopher <ctubbsii@apache.org> wrote:
>>
>>> Oh, it occurs to me that this may be related to the WAL bugs that Keith
>>> fixed for 1.9.1... which could affect the metadata table recovery after a
>>> failure.
>>>
>>> On Fri, May 11, 2018 at 6:11 PM Michael Wall <mjwall@gmail.com> wrote:
>>>
>>>> Adam,
>>>>
>>>> Do you have GC logs?  Can you see if those missing RFiles were removed
>>>> by the GC process?  That could indicate you somehow got old metadata info
>>>> replayed.  Also, the rfiles increment so compare the current rfile names
in
>>>> the srv.dir directory vs what is in the metadata table.  Are the existing
>>>> files after files in the metadata.  Finally, pick a few of the missing
>>>> files and grep all your master and tserver logs to see if you can learn
>>>> anything.  This sounds ungood.
>>>>
>>>> Mike
>>>>
>>>> On Fri, May 11, 2018 at 6:06 PM Christopher <ctubbsii@apache.org>
>>>> wrote:
>>>>
>>>>> This is strange. I've only ever seen this when HDFS has reported
>>>>> problems, such as missing blocks, or another obvious failure. What is
your
>>>>> durability settings (were WALs turned on)?
>>>>>
>>>>> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook <adamjshook@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> On one of our clusters, there are a good number of missing RFiles
>>>>>> from HDFS, however HDFS is not/has not reported any missing blocks.
 We
>>>>>> were experiencing issues with HDFS; some flapping DataNode processes
that
>>>>>> needed more heap.
>>>>>>
>>>>>> I don't anticipate I can do much besides create a bunch of empty
>>>>>> RFiles (open to suggestions).  My question is, Is it possible that
Accumulo
>>>>>> could have written the metadata for these RFiles but failed to write
it to
>>>>>> HDFS?  In which case it would have been re-tried later and the data
was
>>>>>> persisted to a different RFile?  Or is it an 'RFile is in Accumulo
metadata
>>>>>> if and only if it is in HDFS' situation?
>>>>>>
>>>>>> Accumulo 1.8.1 on HDFS 2.6.0.
>>>>>>
>>>>>> Thank you,
>>>>>> --Adam
>>>>>>
>>>>>
>>

Mime
View raw message