manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Leonard <anthony.leon...@york.ac.uk>
Subject Re: Job hanging on "Starting up" with never ending external query.
Date Mon, 21 Jan 2013 14:04:52 GMT
Dear Karl,

Many thanks for your insights. I'll do a kill -QUIT next time we have this
issue which should hopefully give me the thread dump. However we've noticed
that killing processes means we have to run the "locks-clean" script so
it's not our favourite way of doing it.

Also I definitely think a timeout for queries would be a good thing.

I guess we go back to checking that the connection to the database should
have been ok last night...

Best wishes,
Anthony.

-- 
Dr Anthony Leonard
System Integrator, Information Directorate
University of York, Heslington, York, UK, YO10 5DD
Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
Times Higher Education University of the Year 2010


On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Anthony,
>
> What happens between the framework recognizing that the job should be
> started (which it does fine in both cases), and actually achieving a
> correct job start, is the "seeding" phase, which is going to try to
> execute the seeding query against your Oracle database.  If something
> happens at that time to hang the JDBC connection's seeding query, then
> it precisely explains the behavior you are seeing.
>
> It is also the case that the timeout on the queries that the JDBC
> connector does is effectively infinite.  This makes me suspicious that
> what is happening is an Oracle query is going out but there is no
> response ever coming back.
>
> The other possibility is that the JDBC connector is in fact correctly
> throwing a ServiceInterruption, but that the ManifoldCF code is either
> not handling it properly, or the connector is not forming it properly.
>  In that case, when you notice a hung job, the startup thread will be
> a particular place in the code, and I can diagnose it that way.
>
> The first order of business is therefore to get a thread dump when the
> system is "hung".  That will help confirm the picture.  There are a
> number of additional questions here.
>
> (1) Why is this happening?  Is there any possibility that the Oracle
> database you are crawling is (very occasionally) not able to properly
> respond to a JDBC query?  I can imagine that, under some network
> conditions, it might be possible for the Oracle JDBC driver to wind up
> waiting indefinitely for a response that never comes.
>
> (2) Given that we can't always control the infrastructure we're trying
> to crawl through, should we attempt to provide a reasonable
> workaround?  For example, a timeout on JDBC connector queries, where
> we throw a ServiceInterruption if the timeout is exceeded?
>
> Karl
>
> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
> <anthony.leonard@york.ac.uk> wrote:
> > Hi there,
> >
> > We have recently started running a nightly job 2AM in ManifoldCF to
> extract
> > data from an Oracle repository and populate a Solr index. Most nights
> this
> > works fine, but occasionally the job has been hanging at the "Starting
> up"
> > phase. We have observed this on our test setup also occasionally. A
> restart
> > of ManifoldCF usually solves this.
> >
> > Using the simple history reports today I looked up all records and sorted
> > them by the "Time" column, largest first, and found the following:
> >
> > "Start Time","Activity","Identifier","Result Code","Bytes","Time","Result
> > Description"
> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
> > ...","ERROR","0","1926607529","Interrupted: null"
> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
> > ...","ERROR","0","31644956","Interrupted: null"
> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
> > ...","ERROR","0","31637594","Interrupted: null"
> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
> > ...","OK","0","17511",""
> > ... etc ...
> >
> > If the Time column is in millis that means the first query was hanging
> for
> > 22 days! (This was in the period before we went live when our live server
> > was sitting idle for a while.) The other two occasions it was hanging for
> > about 8 hours until we arrived to restart the job in the morning. I have
> > confirmed that the Oracle database we are connecting to was available
> > throughout these periods. These times are also too long for any network
> or
> > database timeouts, which makes me suspect that it's a problem with the
> > application.
> >
> > We have the following logging config in properties.xml
> >
> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
> >
> > The job failed again last night and when I checked at 10:40 AM this
> morning
> > the last few lines of manifoldcf.log were:
> >
> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
> > 1352455005553 needs to be started; it was last checked at 1358733575454,
> and
> > now it is 1358733585635
> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match found
> > within interval 1358733575454 to 1358733585635
> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
> > 1352455005553 needs to be started; it was last checked at 1358733585636,
> and
> > now it is 1358733595662
> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match found
> > within interval 1358733585636 to 1358733595662
> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
> > 1352455005553 needs to be started; it was last checked at 1358733595663,
> and
> > now it is 1358733605813
> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND
> within
> > interval 1358733595663 to 1358733605813
> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job '1352455005553' is
> > within run window at 1358733605813 ms. (which starts at 1358733600000
> ms.)
> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job
> start
> > for job 1352455005553
> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job 1352455005553
> > for startup
> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect string
> is
> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> >
> > After that - nothing. Once I restarted manifoldcf this morning the job
> > magically restarted itself and the following log messages were added
> where
> > it had left off:
> >
> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
> 1352455005553
> > back to 'ReadyForStartup' state
> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job 1352455005553
> > for startup
> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string is
> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is now
> > started
> > ... etc ...
> >
> > So it appears that the application is running fine with a Job start
> thread
> > logging nicely every ten seconds until it tries to start this job and
> hangs
> > entirely until the system is shutdown - even though it performed that
> same
> > tasks perfectly the night before and later the same day.
> >
> > Can anyone advise on what might be happening here? We are running
> ManifoldCF
> > version 1.0.1 on Ubuntu 10.04.
> >
> > Best wishes,
> > Anthony.
> >
> > --
> > Dr Anthony Leonard
> > System Integrator, Information Directorate
> > University of York, Heslington, York, UK, YO10 5DD
> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> > Times Higher Education University of the Year 2010
>

Mime
View raw message