asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Murtadha Hubail <hubail...@gmail.com>
Subject Re: Help! Any idea to stop AsterixDB from recovering?
Date Sat, 21 May 2016 00:22:10 GMT
How large is the memory component size and the checkpoint interval? If you have a large data
size to be recovered, there will be many small recovery partitions files written to disk under
the logs directory. You might want to check how many files/directories are there. Also, if
the JVM memory is almost fully consumed, every time before a recovery partition is loaded
to memory, any other data partition already in memory will be removed and this will cause
it to be reloaded again when it is needed. This might cause this busy behavior.

Also, if you are on the current master, I merged a patch earlier this week to make the spilled
partition size configurable. You might want to increase it and see if the recovery time improves.

Cheers,
Murtadha

> On May 20, 2016, at 5:00 PM, Jianfeng Jia <jianfeng.jia@gmail.com> wrote:
> 
> I am able to reproduce the issue. The thread dump is pasted here: http://pastebin.com/KKvLVVg9
<http://pastebin.com/KKvLVVg9> <http://pastebin.com/KKvLVVg9 <http://pastebin.com/KKvLVVg9>>
> It looks pretty busy with loadPartitionToMemory.
> 
> "main" #1 prio=5 os_prio=0 tid=0x00007f33f400d800 nid=0x14 runnable [0x00007f33fc27a000]
>   java.lang.Thread.State: RUNNABLE
>    at java.io.FileInputStream.read(Native Method)
>    at org.apache.asterix.transaction.management.service.recovery.RecoveryManager$JobEntityCommits.loadPartitionToMemory(RecoveryManager.java:1033)
>    at org.apache.asterix.transaction.management.service.recovery.RecoveryManager$JobEntityCommits.serachPartition(RecoveryManager.java:983)
>    at org.apache.asterix.transaction.management.service.recovery.RecoveryManager$JobEntityCommits.containsEntityCommitForTxnId(RecoveryManager.java:947)
>    at org.apache.asterix.transaction.management.service.recovery.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:334)
>    - locked <0x00000006c03dc1c0> (a org.apache.asterix.transaction.management.service.recovery.RecoveryManager)
>    at org.apache.asterix.transaction.management.service.recovery.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:203)
>    - locked <0x00000006c03dc1c0> (a org.apache.asterix.transaction.management.service.recovery.RecoveryManager)
>    at org.apache.asterix.transaction.management.service.recovery.RecoveryManager.startRecovery(RecoveryManager.java:195)
>    at org.apache.asterix.hyracks.bootstrap.NCApplicationEntryPoint.start(NCApplicationEntryPoint.java:142)
>    at org.apache.hyracks.control.nc.NodeControllerService.startApplication(NodeControllerService.java:324)
>    at org.apache.hyracks.control.nc.NodeControllerService.start(NodeControllerService.java:259)
>    at org.apache.hyracks.control.nc.NCDriver.main(NCDriver.java:48)
> 
> 
> 
>> On May 17, 2016, at 12:37 AM, Jianfeng Jia <jianfeng.jia@gmail.com <mailto:jianfeng.jia@gmail.com>>
wrote:
>> 
>> I’m trying to reproduce the issue and will dump it then. 
>> 
>>> On May 16, 2016, at 10:57 PM, Michael Blow <mblow.apache@gmail.com <mailto:mblow.apache@gmail.com>
<mailto:mblow.apache@gmail.com <mailto:mblow.apache@gmail.com>>> wrote:
>>> 
>>> It would good to get thread dumps if this happens again.
>>> On Mon, May 16, 2016 at 10:56 PM Jianfeng Jia <jianfeng.jia@gmail.com <mailto:jianfeng.jia@gmail.com>
<mailto:jianfeng.jia@gmail.com <mailto:jianfeng.jia@gmail.com>>>
>>> wrote:
>>> 
>>>> I revisited the logs, and luckily it hasn’t been cleared. Here is part
of
>>>> the nc1’s log:
>>>> 
>>>> May 15, 2016 1:04:10 PM
>>>> org.apache.hyracks.storage.common.buffercache.BufferCache openFile
>>>> INFO: Opening file: 14 in cache:
>>>> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>>>> May 15, 2016 1:04:10 PM
>>>> org.apache.hyracks.storage.common.buffercache.BufferCache openFile
>>>> INFO: Opening file: 13 in cache:
>>>> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>>>> May 15, 2016 1:04:10 PM
>>>> org.apache.hyracks.storage.common.buffercache.BufferCache createFile
>>>> INFO: Creating file:
>>>> /nc1/iodevice1/storage/partition_0/hackathon/log_device_idx_log_device/2016-05-15-12-56-48-712_2016-05-15-12-23-31-225_f
>>>> in cache: org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>>>> May 15, 2016 1:04:10 PM
>>>> org.apache.hyracks.storage.common.buffercache.BufferCache openFile
>>>> INFO: Opening file: 15 in cache:
>>>> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>>>> ——————————————————————————————————————
>>>> /// I shut down the cluster from here and start the server right away.
>>>> ——————————————————————————————————————
>>>> May 15, 2016 1:43:12 PM
>>>> org.apache.asterix.transaction.management.service.recovery.RecoveryManager
>>>> startRecoveryRedoPhase
>>>> INFO: Logs REDO phase completed. Redo logs count: 1197
>>>> May 15, 2016 1:43:12 PM org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness
>>>> flush
>>>> INFO: Started a flush operation for index: LSMBTree
>>>> [/nc1/iodevice1/storage/partition_0/Metadata/Dataset_idx_Dataset/] ...
>>>> May 15, 2016 1:43:12 PM
>>>> org.apache.hyracks.storage.common.buffercache.BufferCache createFile
>>>> INFO: Creating file:
>>>> /nc1/iodevice1/storage/partition_0/Metadata/Dataset_idx_Dataset/2016-05-15-13-43-12-680_2016-05-15-13-43-12-680_f
>>>> in cache: org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>>>> 
>>>> No logs generated in that 43mins. During that time one CPU was exhaustive
>>>> and I remember no file was touched or generated in the asterix foler. Then
>>>> it may not be the problem of the buffercache in the recovery phase?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On May 16, 2016, at 9:28 PM, Mike Carey <dtabass@gmail.com <mailto:dtabass@gmail.com>
<mailto:dtabass@gmail.com <mailto:dtabass@gmail.com>>> wrote:
>>>>> 
>>>>> Agreed and agreed.  But is the spinning on recovery?
>>>>> 
>>>>> (What's the role of the buffer cache during recovery?)
>>>>> 
>>>>> 
>>>>> On 5/17/16 2:10 AM, Jianfeng Jia wrote:
>>>>>> I think the BuffeCache is the core issue, the recovery process may
just
>>>> run into the same spin trap where it was stopped.
>>>>>> And I create another issue that we should be able to Abort the task
so
>>>> that we don’t need to restart the server.
>>>>>> 
>>>>>>> On May 16, 2016, at 7:24 AM, Michael Blow <mblow.apache@gmail.com
<mailto:mblow.apache@gmail.com> <mailto:mblow.apache@gmail.com <mailto:mblow.apache@gmail.com>>>
>>>> wrote:
>>>>>>> 
>>>>>>> This might be related: (ASTERIXDB-1438) BufferCache spins indefinitely
>>>> when
>>>>>>> cache is exceeded.
>>>>>>> 
>>>>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1438 <https://issues.apache.org/jira/browse/ASTERIXDB-1438>
<https://issues.apache.org/jira/browse/ASTERIXDB-1438 <https://issues.apache.org/jira/browse/ASTERIXDB-1438>>
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> -MDB
>>>>>>> 
>>>>>>> On Mon, May 16, 2016 at 1:52 AM Mike Carey <dtabass@gmail.com
<mailto:dtabass@gmail.com>> wrote:
>>>>>>> 
>>>>>>>> Glad it worked out - can someone also capture the core issue
in
>>>> JIRA?  Thx!
>>>>>>>> On May 15, 2016 11:40 PM, "Jianfeng Jia" <jianfeng.jia@gmail.com
<mailto:jianfeng.jia@gmail.com> <mailto:jianfeng.jia@gmail.com <mailto:jianfeng.jia@gmail.com>>>
>>>> wrote:
>>>>>>>> 
>>>>>>>>> Great! The server is back now. Thanks a lot!
>>>>>>>>>> On May 15, 2016, at 2:26 PM, Murtadha Hubail <hubailmor@gmail.com
<mailto:hubailmor@gmail.com> <mailto:hubailmor@gmail.com <mailto:hubailmor@gmail.com>>>
>>>>>>>>> wrote:
>>>>>>>>>> You can delete the existing log files and create
new empty ones with
>>>>>>>>> incremented  log file number, but it is very important
that you don't
>>>>>>>>> delete the checkpoint file.
>>>>>>>>>> Of course any data in the old log files will be lost,
but the data
>>>>>>>>> already on disk will be available.
>>>>>>>>>>> On May 15, 2016, at 1:23 PM, Jianfeng Jia <jianfeng.jia@gmail.com
<mailto:jianfeng.jia@gmail.com> <mailto:jianfeng.jia@gmail.com <mailto:jianfeng.jia@gmail.com>>>
>>>>>>>>> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> We submitted a long running join+insert query
and stop the cluster
>>>> to
>>>>>>>>> stop running it. However, when it restarted it ran the
recovery
>>>> forever,
>>>>>>>>>>> the logs shows that it is creating a lot of buffer
cache.
>>>>>>>>>>> 
>>>>>>>>>>> In order to bring the cluster back to answer
the query, is there
>>>> any
>>>>>>>>> hacking solutions? such as remove the recovery txnlogs?
I’m worried
>>>> that
>>>>>>>> it
>>>>>>>>> will ruin the cluster somehow.
>>>>>>>>>>> We are in a contest so any early helps are really
appreciated!
>>>> Thanks!
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> 
>>>>>>>>>>> Jianfeng Jia
>>>>>>>>>>> PhD Candidate of Computer Science
>>>>>>>>>>> University of California, Irvine
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> 
>>>>>>>>> Jianfeng Jia
>>>>>>>>> PhD Candidate of Computer Science
>>>>>>>>> University of California, Irvine
>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Jianfeng Jia
>>>>>> PhD Candidate of Computer Science
>>>>>> University of California, Irvine
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> Best,
>>>> 
>>>> Jianfeng Jia
>>>> PhD Candidate of Computer Science
>>>> University of California, Irvine
>>>> 
>>>> 
>> 
>> 
>> 
>> Best,
>> 
>> Jianfeng Jia
>> PhD Candidate of Computer Science
>> University of California, Irvine
>> 
> 
> 
> 
> Best,
> 
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message