manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashank Raj <shashank.raj2...@gmail.com>
Subject Re: ManifoldCF heap error.
Date Thu, 18 Jan 2018 15:41:41 GMT
Hi Karl,
I changed the number of worker threads to 6 but still the problem persists
when I use ManifoldCF's Tika. When going with "null" as output connection,
there seems no problem. Also tried with Solr without tika transformation
connection. That also works fine.
But as soon as I switch to Manifold's transformation connection Tika, I get
the same error. I have tried increasing heap size as well as decreasing
workers.
Also I've not selected "use extract update handler".

Approx size of directory to crawl: 200GB
In the future this size will be :10TB
Size of largest file in this directory :2Gb

NOTE: I am using Tomcat 8.0 to run manifold, connected to Postgresql 9.3
with Solr 6.6.

On 18-Jan-2018 6:21 PM, "Karl Wright" <daddywri@gmail.com> wrote:

> Hi Shashank,
>
> ManifoldCF's memory consumption is bounded but scales by the number of
> worker threads you allow.  If you have 100 worker threads and each doc can
> consume 50mb then you need to have at least 5gb right there for Solr
> output.  Tika is also quite expensive memory-wise so I'd allocate at least
> 10gb for ManifoldCF to support the pipeline you have set up.
>
> The best way to control memory, therefore, is probably to reduce the
> number of worker threads.
>
> (I assume you are using the combined war here, otherwise Tomcat would not
> be involved.)
>
> Karl
>
>
> On Thu, Jan 18, 2018 at 6:44 AM, Shashank Raj <shashank.raj2009@gmail.com>
> wrote:
>
>> Hello Karl,
>> GC Overhead heap error occurs each time and tomcat closes. Heap allocated
>> is 7Gb(Xmx). Is there any other reason this issue is coming up? I am using
>> ManifoldCF's tika.
>> I have Unchecked "Use Update Extract" and max doc size as 50mb.
>>
>>
>>
>

Mime
View raw message