manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Le <michael.aaron...@gmail.com>
Subject JDBC Connection Exception
Date Mon, 07 May 2012 05:25:58 GMT
Hello,

Using a JDBC Repository connection to an Oracle 11g database, I've had
issues where in the initial seeding stage the connection to the database is
closing in the middle of processing the result set.  The original data
table I'm trying to index is about 10 million records, and with the
original code, I could never get past about 750K records.

I spent some time with the pooling parameters to the bitmachanic database
pooling, but the API and source doesn't seem to be available any more.
 Even the original author doesn't have the code or specs any more.  The
parameter modifications to the pool allowed me to get through the first
stage of processing a 2M row subset, but during the second stage where it's
trying to obtain the documents, the connections again started being closed.
 I ended up just replacing the connection pool code, with an oracle
implementation, and its churning through the documents happily.  As a foot
note, on my sample subset of about 400K documents, the throughput went from
about 10 documents/s to 19 docs/s, but this may just be a side effect of
oracle database load or network traffic.

Has anyone else had issues processing a large Oracle repository?  I've
noted the benchmarks were done with 300K documents, and even in our initial
testing with about 500K documents, no issues arose.

The second and more pressing issue is the jobqueues table.  In the process
of dubugging the database connection issues, jobs were started, stopped,
deleted, aborted, and various WHERE clauses were applied to the seeding
queries/jobs.   MCF is now reporting that there are long running queries
against this table.  In the past, I've just truncated the jobqueues table,
but this had the side effect of stuffing a document into solr (output
connector) multiple times.  What API calls, or sql can I run to clean up
the jobqueues table?  Should I just wait for all jobs to finish and then at
that point truncate the table?  I've broken my data into several smaller
subsets of around 1-2 million rows, but that has the side effect of a
jobqueues table that is 6-8 million rows.

Any support would be greatly appreciated.

Thanks,
-Michael Le

Mime
View raw message