asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianfeng Jia <jianfeng....@gmail.com>
Subject Re: Help! Any idea to stop AsterixDB from recovering?
Date Tue, 17 May 2016 07:37:59 GMT
I’m trying to reproduce the issue and will dump it then. 

> On May 16, 2016, at 10:57 PM, Michael Blow <mblow.apache@gmail.com> wrote:
> 
> It would good to get thread dumps if this happens again.
> On Mon, May 16, 2016 at 10:56 PM Jianfeng Jia <jianfeng.jia@gmail.com>
> wrote:
> 
>> I revisited the logs, and luckily it hasn’t been cleared. Here is part of
>> the nc1’s log:
>> 
>> May 15, 2016 1:04:10 PM
>> org.apache.hyracks.storage.common.buffercache.BufferCache openFile
>> INFO: Opening file: 14 in cache:
>> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>> May 15, 2016 1:04:10 PM
>> org.apache.hyracks.storage.common.buffercache.BufferCache openFile
>> INFO: Opening file: 13 in cache:
>> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>> May 15, 2016 1:04:10 PM
>> org.apache.hyracks.storage.common.buffercache.BufferCache createFile
>> INFO: Creating file:
>> /nc1/iodevice1/storage/partition_0/hackathon/log_device_idx_log_device/2016-05-15-12-56-48-712_2016-05-15-12-23-31-225_f
>> in cache: org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>> May 15, 2016 1:04:10 PM
>> org.apache.hyracks.storage.common.buffercache.BufferCache openFile
>> INFO: Opening file: 15 in cache:
>> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>> ——————————————————————————————————————
>> /// I shut down the cluster from here and start the server right away.
>> ——————————————————————————————————————
>> May 15, 2016 1:43:12 PM
>> org.apache.asterix.transaction.management.service.recovery.RecoveryManager
>> startRecoveryRedoPhase
>> INFO: Logs REDO phase completed. Redo logs count: 1197
>> May 15, 2016 1:43:12 PM org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness
>> flush
>> INFO: Started a flush operation for index: LSMBTree
>> [/nc1/iodevice1/storage/partition_0/Metadata/Dataset_idx_Dataset/] ...
>> May 15, 2016 1:43:12 PM
>> org.apache.hyracks.storage.common.buffercache.BufferCache createFile
>> INFO: Creating file:
>> /nc1/iodevice1/storage/partition_0/Metadata/Dataset_idx_Dataset/2016-05-15-13-43-12-680_2016-05-15-13-43-12-680_f
>> in cache: org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>> 
>> No logs generated in that 43mins. During that time one CPU was exhaustive
>> and I remember no file was touched or generated in the asterix foler. Then
>> it may not be the problem of the buffercache in the recovery phase?
>> 
>> 
>> 
>> 
>> 
>>> On May 16, 2016, at 9:28 PM, Mike Carey <dtabass@gmail.com> wrote:
>>> 
>>> Agreed and agreed.  But is the spinning on recovery?
>>> 
>>> (What's the role of the buffer cache during recovery?)
>>> 
>>> 
>>> On 5/17/16 2:10 AM, Jianfeng Jia wrote:
>>>> I think the BuffeCache is the core issue, the recovery process may just
>> run into the same spin trap where it was stopped.
>>>> And I create another issue that we should be able to Abort the task so
>> that we don’t need to restart the server.
>>>> 
>>>>> On May 16, 2016, at 7:24 AM, Michael Blow <mblow.apache@gmail.com>
>> wrote:
>>>>> 
>>>>> This might be related: (ASTERIXDB-1438) BufferCache spins indefinitely
>> when
>>>>> cache is exceeded.
>>>>> 
>>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1438
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> -MDB
>>>>> 
>>>>> On Mon, May 16, 2016 at 1:52 AM Mike Carey <dtabass@gmail.com>
wrote:
>>>>> 
>>>>>> Glad it worked out - can someone also capture the core issue in
>> JIRA?  Thx!
>>>>>> On May 15, 2016 11:40 PM, "Jianfeng Jia" <jianfeng.jia@gmail.com>
>> wrote:
>>>>>> 
>>>>>>> Great! The server is back now. Thanks a lot!
>>>>>>>> On May 15, 2016, at 2:26 PM, Murtadha Hubail <hubailmor@gmail.com>
>>>>>>> wrote:
>>>>>>>> You can delete the existing log files and create new empty
ones with
>>>>>>> incremented  log file number, but it is very important that you
don't
>>>>>>> delete the checkpoint file.
>>>>>>>> Of course any data in the old log files will be lost, but
the data
>>>>>>> already on disk will be available.
>>>>>>>>> On May 15, 2016, at 1:23 PM, Jianfeng Jia <jianfeng.jia@gmail.com>
>>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>> We submitted a long running join+insert query and stop
the cluster
>> to
>>>>>>> stop running it. However, when it restarted it ran the recovery
>> forever,
>>>>>>>>> the logs shows that it is creating a lot of buffer cache.
>>>>>>>>> 
>>>>>>>>> In order to bring the cluster back to answer the query,
is there
>> any
>>>>>>> hacking solutions? such as remove the recovery txnlogs? I’m
worried
>> that
>>>>>> it
>>>>>>> will ruin the cluster somehow.
>>>>>>>>> We are in a contest so any early helps are really appreciated!
>> Thanks!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> 
>>>>>>>>> Jianfeng Jia
>>>>>>>>> PhD Candidate of Computer Science
>>>>>>>>> University of California, Irvine
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> Jianfeng Jia
>>>>>>> PhD Candidate of Computer Science
>>>>>>> University of California, Irvine
>>>>>>> 
>>>>>>> 
>>>> 
>>>> 
>>>> Best,
>>>> 
>>>> Jianfeng Jia
>>>> PhD Candidate of Computer Science
>>>> University of California, Irvine
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> Best,
>> 
>> Jianfeng Jia
>> PhD Candidate of Computer Science
>> University of California, Irvine
>> 
>> 



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message