manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Job hanging on "Starting up" with never ending external query.
Date Tue, 22 Jan 2013 17:53:05 GMT
Hmm.

The following threads are of interest here:

Thread 29975: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
be imprecise)
 - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
 - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
 - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.<init>(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
(Interpreted frame)
 - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
 - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
long, int) @bci=106, line=246 (Interpreted frame)
 - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
line=179 (Interpreted frame)

... which is probably waiting in this one:

Thread 24457: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
be imprecise)
 - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
 - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
@bci=80, line=80 (Interpreted frame)
 - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
java.lang.String, java.lang.String, java.lang.String,
java.lang.String) @bci=433, line=128 (Interpreted frame)
 - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
@bci=36, line=1212 (Interpreted frame)

... which is waiting to obtain a JDBC connection, and the reason it
can't is because it thinks that the only available JDBC connection is
currently in use.

Since you have only a single connection around, and nothing else is
active, it stands to reason that a JDBC connection handle has somehow
been leaked, which is a challenge since connections are typically
freed in a try/finally block through ManifoldCF.

I notice that your stack frames are pretty unusual - what JDK is this
that you are using?

Karl


Since
On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
<anthony.leonard@york.ac.uk> wrote:
> Dear Karl,
>
> Our DBA noticed that each time our job was run 10 Oracle connections were
> created. So, we dropped the "Max connections" parameter on the repository
> connection config to 1 and re-ran the job with the DBA watching. The job
> worked fine but the DBA reported that 1 connection was created and then 10
> more briefly ...
>
> Out of curiosity we re-ran the job again with no further changes and this
> time got the following results:
>
> * the job hung in the "Starting Up" phase again, with the same logging and
> symptoms as detailed before on this thread.
> * the DBA reported seeing no connections at all this time.
> * I have attached a thread dump created by "jstack -F <pid>". This is
> reporting all threads as blocked.
>
> Any ideas? Any help with this would certainly be very gratefully received.
>
> Best wishes,
> Anthony.
>
> --
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010
>
>
> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> kill -QUIT should not abort the agents process, just cause a thread
>> dump.  kill -9 is a different story.
>>
>> You can also do the same thing by using jstack, in the jvm bin directory.
>>
>> Karl
>>
>>
>> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>> <anthony.leonard@york.ac.uk> wrote:
>> > Dear Karl,
>> >
>> > Many thanks for your insights. I'll do a kill -QUIT next time we have
>> > this
>> > issue which should hopefully give me the thread dump. However we've
>> > noticed
>> > that killing processes means we have to run the "locks-clean" script so
>> > it's
>> > not our favourite way of doing it.
>> >
>> > Also I definitely think a timeout for queries would be a good thing.
>> >
>> > I guess we go back to checking that the connection to the database
>> > should
>> > have been ok last night...
>> >
>> > Best wishes,
>> > Anthony.
>> >
>> > --
>> > Dr Anthony Leonard
>> > System Integrator, Information Directorate
>> > University of York, Heslington, York, UK, YO10 5DD
>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> > Times Higher Education University of the Year 2010
>> >
>> >
>> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <daddywri@gmail.com> wrote:
>> >>
>> >> Hi Anthony,
>> >>
>> >> What happens between the framework recognizing that the job should be
>> >> started (which it does fine in both cases), and actually achieving a
>> >> correct job start, is the "seeding" phase, which is going to try to
>> >> execute the seeding query against your Oracle database.  If something
>> >> happens at that time to hang the JDBC connection's seeding query, then
>> >> it precisely explains the behavior you are seeing.
>> >>
>> >> It is also the case that the timeout on the queries that the JDBC
>> >> connector does is effectively infinite.  This makes me suspicious that
>> >> what is happening is an Oracle query is going out but there is no
>> >> response ever coming back.
>> >>
>> >> The other possibility is that the JDBC connector is in fact correctly
>> >> throwing a ServiceInterruption, but that the ManifoldCF code is either
>> >> not handling it properly, or the connector is not forming it properly.
>> >>  In that case, when you notice a hung job, the startup thread will be
>> >> a particular place in the code, and I can diagnose it that way.
>> >>
>> >> The first order of business is therefore to get a thread dump when the
>> >> system is "hung".  That will help confirm the picture.  There are a
>> >> number of additional questions here.
>> >>
>> >> (1) Why is this happening?  Is there any possibility that the Oracle
>> >> database you are crawling is (very occasionally) not able to properly
>> >> respond to a JDBC query?  I can imagine that, under some network
>> >> conditions, it might be possible for the Oracle JDBC driver to wind up
>> >> waiting indefinitely for a response that never comes.
>> >>
>> >> (2) Given that we can't always control the infrastructure we're trying
>> >> to crawl through, should we attempt to provide a reasonable
>> >> workaround?  For example, a timeout on JDBC connector queries, where
>> >> we throw a ServiceInterruption if the timeout is exceeded?
>> >>
>> >> Karl
>> >>
>> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>> >> <anthony.leonard@york.ac.uk> wrote:
>> >> > Hi there,
>> >> >
>> >> > We have recently started running a nightly job 2AM in ManifoldCF to
>> >> > extract
>> >> > data from an Oracle repository and populate a Solr index. Most nights
>> >> > this
>> >> > works fine, but occasionally the job has been hanging at the
>> >> > "Starting
>> >> > up"
>> >> > phase. We have observed this on our test setup also occasionally. A
>> >> > restart
>> >> > of ManifoldCF usually solves this.
>> >> >
>> >> > Using the simple history reports today I looked up all records and
>> >> > sorted
>> >> > them by the "Time" column, largest first, and found the following:
>> >> >
>> >> > "Start Time","Activity","Identifier","Result
>> >> > Code","Bytes","Time","Result
>> >> > Description"
>> >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>> >> > ...","ERROR","0","1926607529","Interrupted: null"
>> >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>> >> > ...","ERROR","0","31644956","Interrupted: null"
>> >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>> >> > ...","ERROR","0","31637594","Interrupted: null"
>> >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>> >> > ...","OK","0","17511",""
>> >> > ... etc ...
>> >> >
>> >> > If the Time column is in millis that means the first query was
>> >> > hanging
>> >> > for
>> >> > 22 days! (This was in the period before we went live when our live
>> >> > server
>> >> > was sitting idle for a while.) The other two occasions it was hanging
>> >> > for
>> >> > about 8 hours until we arrived to restart the job in the morning. I
>> >> > have
>> >> > confirmed that the Oracle database we are connecting to was available
>> >> > throughout these periods. These times are also too long for any
>> >> > network
>> >> > or
>> >> > database timeouts, which makes me suspect that it's a problem with
>> >> > the
>> >> > application.
>> >> >
>> >> > We have the following logging config in properties.xml
>> >> >
>> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>> >> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>> >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>> >> >
>> >> > The job failed again last night and when I checked at 10:40 AM this
>> >> > morning
>> >> > the last few lines of manifoldcf.log were:
>> >> >
>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
>> >> > 1352455005553 needs to be started; it was last checked at
>> >> > 1358733575454,
>> >> > and
>> >> > now it is 1358733585635
>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match
>> >> > found
>> >> > within interval 1358733575454 to 1358733585635
>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
>> >> > 1352455005553 needs to be started; it was last checked at
>> >> > 1358733585636,
>> >> > and
>> >> > now it is 1358733595662
>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match
>> >> > found
>> >> > within interval 1358733585636 to 1358733595662
>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
>> >> > 1352455005553 needs to be started; it was last checked at
>> >> > 1358733595663,
>> >> > and
>> >> > now it is 1358733605813
>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND
>> >> > within
>> >> > interval 1358733595663 to 1358733605813
>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
>> >> > '1352455005553'
>> >> > is
>> >> > within run window at 1358733605813 ms. (which starts at 1358733600000
>> >> > ms.)
>> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job
>> >> > start
>> >> > for job 1352455005553
>> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>> >> > 1352455005553
>> >> > for startup
>> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect
>> >> > string
>> >> > is
>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >> >
>> >> > After that - nothing. Once I restarted manifoldcf this morning the
>> >> > job
>> >> > magically restarted itself and the following log messages were added
>> >> > where
>> >> > it had left off:
>> >> >
>> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>> >> > 1352455005553
>> >> > back to 'ReadyForStartup' state
>> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
>> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>> >> > 1352455005553
>> >> > for startup
>> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string
>> >> > is
>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553
is
>> >> > now
>> >> > started
>> >> > ... etc ...
>> >> >
>> >> > So it appears that the application is running fine with a Job start
>> >> > thread
>> >> > logging nicely every ten seconds until it tries to start this job and
>> >> > hangs
>> >> > entirely until the system is shutdown - even though it performed that
>> >> > same
>> >> > tasks perfectly the night before and later the same day.
>> >> >
>> >> > Can anyone advise on what might be happening here? We are running
>> >> > ManifoldCF
>> >> > version 1.0.1 on Ubuntu 10.04.
>> >> >
>> >> > Best wishes,
>> >> > Anthony.
>> >> >
>> >> > --
>> >> > Dr Anthony Leonard
>> >> > System Integrator, Information Directorate
>> >> > University of York, Heslington, York, UK, YO10 5DD
>> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> >> > Times Higher Education University of the Year 2010
>> >
>> >
>
>

Mime
View raw message