manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Need examples of expressions used to specify multiple folders to index
Date Fri, 20 Mar 2015 14:55:32 GMT
Hi Ian,

HSQLDB is an interesting database in that it is *not* memory constrained.
It attempts to keep everything in memory.

I'd strongly suggest either giving the MCF agents process a lot more
memory, say 2G, if you want to keep using hsqldb.  A better choice would be
postgresql or mysql.  There's a configuration file where you can put java
switches for all of the processes; start by doing that.

Thanks,
Karl




On Fri, Mar 20, 2015 at 9:29 AM, Ian Zapczynski <
Ian.Zapczynski@veritablelp.com> wrote:

>  Hi Karl,
>
> I have SOLR and ManifoldCF running with Tomcat on a Windows 2012 R2
> server.   Linux would have been my preference, but various logistics
> prevented me from using that.  I have set the maximum document length to be
> 3072000.   I chose a larger size than what might be normal because when I
> first did a test, I could see that a lot of docs were getting rejected
> based on size, and it seems folks around here don't reduce/shrink the size
> of their PDFs.
>
>  The errors from the log are below.   I was more busy paying attention to
> the errors spit out to the console, which didn't so obviously point to the
> backend database being the culprit.    I'm guessing that I'm pushing the
> database too hard and should really be using PostgreSQL, right?   I don't
> know why, but I didn't see or reference the deployment documentation that
> covered using various other databases until now.    I was working off of
> the ManifoldCF End User Documentation as well as a (mostly) helpful blog
> post I found elsewhere.
>
> Much thanks,
>
> -Ian
>
>
>  WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Solr exception during
> indexing
> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file1.pdf
> (500): Server at http://localhost:8983/solr returned non ok status:500,
> message:Server Error
> org.apache.solr.common.SolrException: Server at http://localhost:8983/solr
> returned non ok status:500, message:Server Error
>  at
> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:303)
>  at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
>  at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
>  at
> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:894)
>  WARN 2015-03-19 18:30:48,030 (Worker thread '34') - Service interruption
> reported for job 1426796577848 connection 'MACLSTR file server': Solr
> exception during indexing
> file://///host.domain.com/FileShare1/Data/Manager%20Information/<foldername>/file3.pdf
> (500): Server at http://localhost:8983/solr returned non ok status:500,
> message:Server Error
> ERROR 2015-03-19 18:31:45,730 (Job delete thread) - Job delete thread
> aborting and restarting due to database connection reset: Database
> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
> overhead limit exceeded
> ERROR 2015-03-19 18:31:45,309 (Finisher thread) - Finisher thread aborting
> and restarting due to database connection reset: Database exception:
> SQLException doing query (S1000): java.lang.OutOfMemoryError: GC overhead
> limit exceeded
> ERROR 2015-03-19 18:31:43,043 (Set priority thread) - Set priority thread
> aborting and restarting due to database connection reset: Database
> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
> overhead limit exceeded
> ERROR 2015-03-19 18:32:02,292 (Job notification thread) - Job notification
> thread aborting and restarting due to database connection reset: Database
> exception: SQLException doing query (S1000): java.lang.OutOfMemoryError: GC
> overhead limit exceeded
> FATAL 2015-03-19 18:32:05,870 (Thread-3838608) -
> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
> 531146
>  WARN 2015-03-19 18:32:09,167 (Assessment thread) - Found a long-running
> query (64919 ms): [SELECT id,status,connectionname FROM jobs WHERE
> assessmentstate=? FOR UPDATE]
>  WARN 2015-03-19 18:32:09,167 (Assessment thread) -   Parameter 0: 'N'
> ERROR 2015-03-19 18:32:09,167 (Job reset thread) - Job reset thread
> aborting and restarting due to database connection reset: Database
> exception: SQLException doing query (S1000): java.lang.RuntimeException:
> Logging failed when attempting to log:
> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
> 531146
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
> exception: SQLException doing query (S1000): java.lang.RuntimeException:
> Logging failed when attempting to log:
> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
> 531146
>  at
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:702)
>  at
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
>  at
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:771)
>  at
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
>  at
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
>  at
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
>  at
> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performModification(DBInterfaceHSQLDB.java:750)
>  at
> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.performUpdate(DBInterfaceHSQLDB.java:296)
>  at
> org.apache.manifoldcf.core.database.BaseTable.performUpdate(BaseTable.java:80)
>  at
> org.apache.manifoldcf.crawler.jobs.JobQueue.noDocPriorities(JobQueue.java:967)
>  at
> org.apache.manifoldcf.crawler.jobs.JobManager.noDocPriorities(JobManager.java:8148)
>  at
> org.apache.manifoldcf.crawler.jobs.JobManager.finishJobStops(JobManager.java:8123)
>  at
> org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:69)
> Caused by: java.sql.SQLException: java.lang.RuntimeException: Logging
> failed when attempting to log:
> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
> 531146 java.lang.RuntimeException: Logging failed when attempting to log:
> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
> 531146
>  at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
>  at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
>  at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source)
>  at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source)
>  at org.apache.manifoldcf.core.database.Database.execute(Database.java:903)
>  at
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:683)
> Caused by: org.hsqldb.HsqlException: java.lang.RuntimeException: Logging
> failed when attempting to log:
> C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of mem
> 531146
>  at org.hsqldb.error.Error.error(Unknown Source)
>  at org.hsqldb.result.Result.newErrorResult(Unknown Source)
>  at org.hsqldb.StatementDMQL.execute(Unknown Source)
>  at org.hsqldb.Session.executeCompiledStatement(Unknown Source)
>  at org.hsqldb.Session.execute(Unknown Source)
>  ... 4 more
> Caused by: java.lang.RuntimeException: Logging failed when attempting to
> log: C:\apache-manifoldcf-2.0.1\example\.\./dbname.data getFromFile out of
> mem 531146
>  at org.hsqldb.lib.FrameworkLogger.privlog(Unknown Source)
>  at org.hsqldb.lib.FrameworkLogger.severe(Unknown Source)
>  at org.hsqldb.persist.Logger.logSevereEvent(Unknown Source)
>  at org.hsqldb.persist.DataFileCache.logSevereEvent(Unknown Source)
>  at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source)
>  at org.hsqldb.persist.DataFileCache.get(Unknown Source)
>  at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source)
>  at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source)
>  at org.hsqldb.index.NodeAVLDisk.getRight(Unknown Source)
>  at org.hsqldb.index.IndexAVL.next(Unknown Source)
>  at org.hsqldb.index.IndexAVL.next(Unknown Source)
>  at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source)
>  at org.hsqldb.RangeVariable$RangeIteratorMain.findNext(Unknown Source)
>  at org.hsqldb.RangeVariable$RangeIteratorMain.next(Unknown Source)
>  at org.hsqldb.StatementDML.executeUpdateStatement(Unknown Source)
>  at org.hsqldb.StatementDML.getResult(Unknown Source)
>  ... 7 more
> Caused by: java.lang.reflect.InvocationTargetException
>  at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
>  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:483)
>  ... 23 more
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>  WARN 2015-03-19 18:32:09,167 (Assessment thread) -  Plan:
> isDistinctSelect=[false]
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:
> isGrouped=[false]
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:
> isAggregated=[false]
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan: columns=[
> COLUMN: PUBLIC.JOBS.ID not nullable
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:   COLUMN:
> PUBLIC.JOBS.STATUS not nullable
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:   COLUMN:
> PUBLIC.JOBS.CONNECTIONNAME not nullable
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan: ]
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan: [range variable
> 1
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:   join
> type=INNER
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:   table=JOBS
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:   cardinality=5
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:   access=FULL
> SCAN
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:   join
> condition = [index=SYS_IDX_SYS_PK_10234_10237
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:     other
> condition=[
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:     EQUAL
> arg_left=[     COLUMN: PUBLIC.JOBS.ASSESSMENTSTATE
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan: ]
> arg_right=[     DYNAMIC PARAM: , TYPE = CHARACTER
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan: ]]
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:   ]
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan:   ]]
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan: PARAMETERS=[
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan: @0[DYNAMIC
> PARAM: , TYPE = CHARACTER
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan: ]]
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -  Plan: SUBQUERIES[]
>  WARN 2015-03-19 18:32:09,182 (Assessment thread) -
> FATAL 2015-03-19 18:32:09,198 (Job notification thread) -
> JobNotificationThread initialization error tossed: GC overhead limit
> exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> FATAL 2015-03-19 18:32:09,198 (Set priority thread) - SetPriorityThread
> initialization error tossed: GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> FATAL 2015-03-19 18:32:09,198 (Finisher thread) - FinisherThread
> initialization error tossed: GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> FATAL 2015-03-19 18:32:09,198 (Job delete thread) - JobDeleteThread
> initialization error tossed: GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> FATAL 2015-03-19 18:32:09,198 (Seeding thread) - SeedingThread
> initialization error tossed: GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> >>> Karl Wright <daddywri@gmail.com> 3/19/2015 3:34 PM >>>
> Hi Ian,
>
> ManifoldCF operates under what is known as a "bounded" memory model. That
> means that you should always be able to find a memory size that works (that
> isn't huge).
>
> The only exception to this is for Solr indexing that does *not* go via the
> extracting update handler. The standard update handler unfortunately
> *requires* that the entire document fit in memory. If this is what you are
> doing, you must take steps to limit the maximum document size to prevent
> OOM's.
>
> 160,000 documents is quite small by MCF standards (we do 10 million to 50
> million on some setups). So let's diagnose your problem before taking any
> bizarre actions. Can you provide an out-of-memory dump from the log, for
> instance? Can you let us know what deployment model you are using (e.g.
> single-process, etc.)?
>
> Thanks,
> Karl
>
>
> On Thu, Mar 19, 2015 at 3:07 PM, Ian Zapczynski <
> Ian.Zapczynski@veritablelp.com> wrote:
>
>>  Hello all. I am using ManifoldCF to index a Windows share containing
>> well over 160,000 files (.xls, .pdf, .doc). I keep getting memory errors
>> when I try to index the whole folder at once and have not been able to
>> resolve this by throwing memory and CPU at Tomcat and the VM, so I thought
>> I'd try this a different way.
>>  What I'd like to do now is break what was a single job up into multiple
>> jobs. Each job should index all indexable files under a parent folder, with
>> one job indexing folders whose names begin with the letters A-G as well as
>> all subfolders and files within, another job for H-M also with all
>> subfolders/files, and so on. My problem is, somehow I can't manage to
>> figure out what expression to use to get it to index what I want.
>>  In the Job settings under Paths, I have specified the parent folder,
>> and within there I've tried:
>>  1. Include file(s) or directory(s) matching * (this works, but indexes
>> every file in every folder within the parent, eventually causing me
>> unresolvable GC memory overhead errors)
>> 2. Include file(s) or directory(s) matching ^(?i)[A-G]* (this does not
>> work; it supposedly indexes one file and then quits)
>> 3. Include file(s) or directory(s) matching A* (this does not work; it
>> supposedly indexes one file and then quits, and there are many folders
>> directly under the parent that begin with 'A')
>>  Can anyone help confirm what type of expression I should use in the
>> paths to accomplish what I want?
>>  Or alternately if you think I should be able to index 160,000+ files in
>> one job without getting GC memory overhead errors, I'm open to hear your
>> suggestions on resolving those. All I know to do is increase the maximum
>> memory in Tomcat as well as on the OS, and that didn't help at all.
>>  Thanks much!
>>  -Ian
>>
>
>

Mime
View raw message