falcon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ajay Yadava (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FALCON-1391) OOM in Integration tests arbitrarily
Date Wed, 12 Aug 2015 19:41:45 GMT

    [ https://issues.apache.org/jira/browse/FALCON-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694060#comment-14694060
] 

Ajay Yadava commented on FALCON-1391:
-------------------------------------

I took a heap dump and found that the culprit object is the following char[ ]

{code}
<coordinator-app xmlns="uri:oozie:coordinator:0.3" name="FALCON_PROCESS_DEFAULT_p1439387945322"
frequency="5" timezone="UTC" freq_timeunit="MINUTE" end_of_duration="NONE" instance-number="19998"
action-nominal-time="2012-06-28T10:25Z" action-actual-time="2015-08-12T13:59Z">

  <controls>

    <timeout>30</timeout>

    <concurrency>1</concurrency>

    <execution>LAST_ONLY</execution>

    <throttle>12</throttle>

  </controls>

  <action>

    <workflow>

      <app-path>jail://global:00/projects/falcon/staging/falcon/workflows/process/p1439387945322/fcf6989cb5e377cd8ce3bb2035e54dcc_1439387946542/DEFAULT</app-path>

      <configuration>

        <property>

          <name>feedInstancePaths</name>

          <value>IGNORE</value>

        </property>

        <property>

          <name>falconInPaths</name>

          <value>IGNORE</value>

        </property>

        <property>

          <name>userJMSNotificationEnabled</name>

          <value>true</value>

        </property>

        <property>

          <name>drSourceClusterFS</name>

          <value>jail://global:00</value>

        </property>

        <property>

          <name>drSourceDir</name>

          <value>jail://global:00/tmp/test1</value>

        </property>

        <property>

          <name>drNotificationReceivers</name>

          <value>NA</value>

        </property>

        <property>

          <name>falconInputFeeds</name>

          <value>NONE</value>

        </property>

        <property>

          <name>timeStamp</name>

          <value>${coord:formatTime(coord:actualTime(), 'yyyy-MM-dd-HH-mm')}</value>

        </property>

        <property>

          <name>distcpMapBandwidth</name>

          <value>100</value>

        </property>

        <property>

          <name>falconInputNames</name>

          <value>IGNORE</value>

        </property>

        <property>

          <name>feedNames</name>

          <value>IGNORE</value>

        </property>

        <property>

          <name>distcpMaxMaps</name>

          <value>1</value>

        </property>

        <property>

          <name>drTargetDir</name>

          <value>/tmp/test1</value>

        </property>

        <property>

          <name>falcon.recipe.processName</name>

          <value />

        </property>

        <property>

          <name>oozie.wf.subworkflow.classpath.inheritance</name>

          <value>true</value>

        </property>

        <property>

          <name>nominalTime</name>

          <value>${coord:formatTime(coord:nominalTime(), 'yyyy-MM-dd-HH-mm')}</value>

        </property>

        <property>

          <name>oozie.wf.external.id</name>

          <value>p1439387945322/DEFAULT/${coord:nominalTime()}</value>

        </property>

        <property>

          <name>drTargetClusterFS</name>

          <value>jail://global:00</value>

        </property>

      </configuration>

    </workflow>

  </action>

</coordinator-app>
{code}

It seemed like that we had a faulty process in our integration tests. Based on the nominal
Time and instance-number I was able to trace the start of the process to be 2012-04-20T00:00Z.
Properties seemed to be of replication but it is for a process so it seemed to be a HDFS_REPLICATION
recipe.
Frequency of minutes(5) and instance number seemed suspicious and somehow related to the issue.


With above information I traced the origin of these values to be 
/falcon/webapp/src/test/resources/hdfs-replication.properties

Usage analysis showed this to be used in FalconCLIIT. Increasing the frequency to a much higher
value or increasing the start date solved the OOM issue (though it shows another exception
of SSL). Reverting just the tests added in 1188 from FalconCLIIT solved the issue completely
for me.



> OOM in Integration tests arbitrarily
> ------------------------------------
>
>                 Key: FALCON-1391
>                 URL: https://issues.apache.org/jira/browse/FALCON-1391
>             Project: Falcon
>          Issue Type: Sub-task
>            Reporter: Ajay Yadava
>
> Following is the error that I am seeing. This is not related to ActiveMQ but is happening
because of OutOfMemory issue.
> (http://stackoverflow.com/questions/12128134/active-mq-giving-outofmemory-error)
> {code}
> Running org.apache.falcon.validation.FeedEntityValidationIT
> Exception in thread "InactivityMonitor WriteCheck" java.lang.OutOfMemoryError: Java heap
space
> at org.apache.activemq.transport.InactivityMonitor$5.newThread(InactivityMonitor.java:366)
> at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:610)
> at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:924)
> at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
> at org.apache.activemq.transport.InactivityMonitor.writeCheck(InactivityMonitor.java:148)
> at org.apache.activemq.transport.InactivityMonitor$2.run(InactivityMonitor.java:114)
> at org.apache.activemq.thread.SchedulerTimerTask.run(SchedulerTimerTask.java:33)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> Exception in thread "ActiveMQ Journal Checkpoint Worker" java.lang.OutOfMemoryError:
Java heap space
> at org.apache.kahadb.util.DataByteArrayOutputStream.<init>(DataByteArrayOutputStream.java:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message