manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian Zapczynski" <Ian.Zapczyn...@veritablelp.com>
Subject Re: Need examples of expressions used to specify multiple folders to index
Date Fri, 20 Mar 2015 18:25:04 GMT
Thanks for the help, Karl.    Yup, I was using the simple-to-set-up single process configuration,
and silly me.... after I restarted from scratch at one point, I completely failed to update
the combined-options-env.win config file that you referred to, so MCF was still set to use
only 256 Mb despite my thinking otherwise.   I've bumped it up to 4 Gb, and the job recovered
and is finally again moving along.  
-Ian

>>> Karl Wright <daddywri@gmail.com> 3/20/2015 10:55 AM >>>
Hi Ian,

HSQLDB is an interesting database in that it is *not* memory constrained. It attempts to keep
everything in memory.

I'd strongly suggest either giving the MCF agents process a lot more memory, say 2G, if you
want to keep using hsqldb. A better choice would be postgresql or mysql. There's a configuration
file where you can put java switches for all of the processes; start by doing that.

Thanks,
Karl




On Fri, Mar 20, 2015 at 9:29 AM, Ian Zapczynski <Ian.Zapczynski@veritablelp.com> wrote:


Hi Karl,
I have SOLR and ManifoldCF running with Tomcat on a Windows 2012 R2 server. Linux would have
been my preference, but various logistics prevented me from using that. I have set the maximum
document length to be 3072000. I chose a larger size than what might be normal because when
I first did a test, I could see that a lot of docs were getting rejected based on size, and
it seems folks around here don't reduce/shrink the size of their PDFs. 
The errors from the log are below. I was more busy paying attention to the errors spit out
to the console, which didn't so obviously point to the backend database being the culprit.
I'm guessing that I'm pushing the database too hard and should really be using PostgreSQL,
right? I don't know why, but I didn't see or reference the deployment documentation that covered
using various other databases until now. I was working off of the ManifoldCF End User Documentation
as well as a (mostly) helpful blog post I found elsewhere. 
Much thanks,
-Ian
WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Solr exception during indexing file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file1.pdf
(500): Server at http://localhost:8983/solr returned non ok status:500, message:Server Error
org.apache.solr.common.SolrException: Server at http://localhost:8983/solr returned non ok
status:500, message:Server Error
at org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:303)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:894)
WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Service interruption reported for job
1426796577848 connection 'MACLSTR file server': Solr exception during indexing file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file3.pdf
(500): Server at http://localhost:8983/solr returned non ok status:500, message:Server Error
ERROR 2015-03-19 18:31:45,730 (Job delete thread) - Job delete thread aborting and restarting
due to database connection reset: Database exception: SQLException doing query (S1000): java.lang.OutOfMemoryError:
GC overhead limit exceeded
ERROR 2015-03-19 18:31:45,309 (Finisher thread) - Finisher thread aborting and restarting
due to database connection reset: Database exception: SQLException doing query (S1000): java.lang.OutOfMemoryError:
GC overhead limit exceeded
ERROR 2015-03-19 18:31:43,043 (Set priority thread) - Set priority thread aborting and restarting
due to database connection reset: Database exception: SQLException doing query (S1000): java.lang.OutOfMemoryError:
GC overhead limit exceeded
ERROR 2015-03-19 18:32:02,292 (Job notification thread) - Job notification thread aborting
and restarting due to database connection reset: Database exception: SQLException doing query
(S1000): java.lang.OutOfMemoryError: GC overhead limit exceeded
FATAL 2015-03-19 18:32:05,870 (Thread-3838608) - C:\apache-manifoldcf-2.0.1\example\.\./dbname.data
getFromFile out of mem 531146
WARN 2015-03-19 18:32:09,167 (Assessment thread) - Found a long-running query (64919 ms):
[SELECT id,status,connectionname FROM jobs WHERE assessmentstate=? FOR UPDATE]
WARN 2015-03-19 18:32:09,167 (Assessment thread) - Parameter 0: 'N'
ERROR 2015-03-19 18:32:09,167 (Job reset thread) - Job reset thread aborting and restarting
due to database connection reset: Database exception: SQLException doing query (S1000): java.lang.RuntimeException:
Logging failed when attempting to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data
getFromFile out of mem 531146 java.lang.RuntimeException: Logging failed when attempting to
log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem 531146
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: SQLException
doing query (S1000): java.lang.RuntimeException: Logging failed when attempting to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data
getFromFile out of mem 531146 java.lang.RuntimeException: Logging failed when attempting to
log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem 531146
at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702)
at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771)
at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
at org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performModification(DBInterfaceHSQLDB.java:750)
at org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performUpdate(DBInterfaceHSQLDB.java:296)
at org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80)
at org.apache.manifoldcf.crawler.jobs.JobQueue.noDocPriorities(JobQueue.java:967)
at org.apache.manifoldcf.crawler.jobs.JobManager.noDocPriorities(JobManager.java:8148)
at org.apache.manifoldcf.crawler.jobs.JobManager.finishJobStops(JobManager.java:8123)
at org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:69)
Caused by: java.sql.SQLException: java.lang.RuntimeException: Logging failed when attempting
to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem 531146 java.lang.RuntimeException:
Logging failed when attempting to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data
getFromFile out of mem 531146
at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source)
at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source)
at org.apache.manifoldcf.core.database.Database.execute(Database.java:903)
at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:683)
Caused by: org.hsqldb.HsqlException: java.lang.RuntimeException: Logging failed when attempting
to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem 531146
at org.hsqldb.error.Error.error(Unknown Source)
at org.hsqldb.result.Result.newErrorResult(Unknown Source)
at org.hsqldb.StatementDMQL.execute(Unknown Source)
at org.hsqldb.Session.executeCompiledStatement(Unknown Source)
at org.hsqldb.Session.execute(Unknown Source)
... 4 more
Caused by: java.lang.RuntimeException: Logging failed when attempting to log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data
getFromFile out of mem 531146
at org.hsqldb.lib.FrameworkLogger.privlog(Unknown Source)
at org.hsqldb.lib.FrameworkLogger.severe(Unknown Source)
at org.hsqldb.persist.Logger.logSevereEvent(Unknown Source)
at org.hsqldb.persist.DataFileCache.logSevereEvent(Unknown Source)
at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source)
at org.hsqldb.persist.DataFileCache.get(Unknown Source)
at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source)
at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source)
at org.hsqldb.index.NodeAVLDisk.getRight(Unknown Source)
at org.hsqldb.index.IndexAVL.next(Unknown Source)
at org.hsqldb.index.IndexAVL.next(Unknown Source)
at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source)
at org.hsqldb.RangeVariable$RangeIteratorMain.findNext(Unknown Source)
at org.hsqldb.RangeVariable$RangeIteratorMain.next(Unknown Source)
at org.hsqldb.StatementDML.executeUpdateStatement(Unknown Source)
at org.hsqldb.StatementDML.getResult(Unknown Source)
... 7 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
... 23 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
WARN 2015-03-19 18:32:09,167 (Assessment thread) - Plan: isDistinctSelect=[false]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: isGrouped=[false]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: isAggregated=[false]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: columns=[ COLUMN: PUBLIC.JOBS.ID
not nullable
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN: PUBLIC.JOBS.STATUS not nullable
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: COLUMN: PUBLIC.JOBS.CONNECTIONNAME
not nullable
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: 
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: [range variable 1
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join type=INNER
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: table=JOBS
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: cardinality=5
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: access=FULL SCAN
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: join condition = [index=SYS_IDX_SYS_PK_10234_10237
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: other condition=[
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: EQUAL arg_left=[ COLUMN: PUBLIC.JOBS.ASSESSMENTSTATE
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ] arg_right=[ DYNAMIC PARAM: , TYPE
= CHARACTER
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: PARAMETERS=[
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: @0[DYNAMIC PARAM: , TYPE = CHARACTER
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: ]]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - Plan: SUBQUERIES[]
WARN 2015-03-19 18:32:09,182 (Assessment thread) - 
FATAL 2015-03-19 18:32:09,198 (Job notification thread) - JobNotificationThread initialization
error tossed: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
FATAL 2015-03-19 18:32:09,198 (Set priority thread) - SetPriorityThread initialization error
tossed: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
FATAL 2015-03-19 18:32:09,198 (Finisher thread) - FinisherThread initialization error tossed:
GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
FATAL 2015-03-19 18:32:09,198 (Job delete thread) - JobDeleteThread initialization error tossed:
GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
FATAL 2015-03-19 18:32:09,198 (Seeding thread) - SeedingThread initialization error tossed:
GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded

>>> Karl Wright <daddywri@gmail.com> 3/19/2015 3:34 PM >>>
Hi Ian, 

ManifoldCF operates under what is known as a "bounded" memory model. That means that you should
always be able to find a memory size that works (that isn't huge).

The only exception to this is for Solr indexing that does *not* go via the extracting update
handler. The standard update handler unfortunately *requires* that the entire document fit
in memory. If this is what you are doing, you must take steps to limit the maximum document
size to prevent OOM's.

160,000 documents is quite small by MCF standards (we do 10 million to 50 million on some
setups). So let's diagnose your problem before taking any bizarre actions. Can you provide
an out-of-memory dump from the log, for instance? Can you let us know what deployment model
you are using (e.g. single-process, etc.)?

Thanks,
Karl


On Thu, Mar 19, 2015 at 3:07 PM, Ian Zapczynski <Ian.Zapczynski@veritablelp.com> wrote:


Hello all. I am using ManifoldCF to index a Windows share containing well over 160,000 files
(.xls, .pdf, .doc). I keep getting memory errors when I try to index the whole folder at once
and have not been able to resolve this by throwing memory and CPU at Tomcat and the VM, so
I thought I'd try this a different way.
What I'd like to do now is break what was a single job up into multiple jobs. Each job should
index all indexable files under a parent folder, with one job indexing folders whose names
begin with the letters A-G as well as all subfolders and files within, another job for H-M
also with all subfolders/files, and so on. My problem is, somehow I can't manage to figure
out what expression to use to get it to index what I want. 
In the Job settings under Paths, I have specified the parent folder, and within there I've
tried:
1. Include file(s) or directory(s) matching * (this works, but indexes every file in every
folder within the parent, eventually causing me unresolvable GC memory overhead errors)
2. Include file(s) or directory(s) matching ^(?i)[A-G]* (this does not work; it supposedly
indexes one file and then quits)
3. Include file(s) or directory(s) matching A* (this does not work; it supposedly indexes
one file and then quits, and there are many folders directly under the parent that begin with
'A')
Can anyone help confirm what type of expression I should use in the paths to accomplish what
I want? 
Or alternately if you think I should be able to index 160,000+ files in one job without getting
GC memory overhead errors, I'm open to hear your suggestions on resolving those. All I know
to do is increase the maximum memory in Tomcat as well as on the OS, and that didn't help
at all. 
Thanks much!


-Ian





Mime
View raw message