manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF heap error.
Date Thu, 18 Jan 2018 15:52:13 GMT
Hmm, it might be worth asking this question in the Tika user list.  We've
not seen this kind of issue before with Tika transformation.

Also, I think it's worth downloading MCF 2.9.1, which updates the Tika
version to 2.17 from 2.16.  There were issues in 2.9 with incompatibilities
between our Tika version and the Apache POI version.  This is now publicly
available but the web site has not yet been updated, so modify the download
URL to 2.9.1 from 2.9 to get the point release.

Thanks,
Karl


On Thu, Jan 18, 2018 at 10:41 AM, Shashank Raj <shashank.raj2009@gmail.com>
wrote:

> Hi Karl,
> I changed the number of worker threads to 6 but still the problem persists
> when I use ManifoldCF's Tika. When going with "null" as output connection,
> there seems no problem. Also tried with Solr without tika transformation
> connection. That also works fine.
> But as soon as I switch to Manifold's transformation connection Tika, I
> get the same error. I have tried increasing heap size as well as decreasing
> workers.
> Also I've not selected "use extract update handler".
>
> Approx size of directory to crawl: 200GB
> In the future this size will be :10TB
> Size of largest file in this directory :2Gb
>
> NOTE: I am using Tomcat 8.0 to run manifold, connected to Postgresql 9.3
> with Solr 6.6.
>
> On 18-Jan-2018 6:21 PM, "Karl Wright" <daddywri@gmail.com> wrote:
>
>> Hi Shashank,
>>
>> ManifoldCF's memory consumption is bounded but scales by the number of
>> worker threads you allow.  If you have 100 worker threads and each doc can
>> consume 50mb then you need to have at least 5gb right there for Solr
>> output.  Tika is also quite expensive memory-wise so I'd allocate at least
>> 10gb for ManifoldCF to support the pipeline you have set up.
>>
>> The best way to control memory, therefore, is probably to reduce the
>> number of worker threads.
>>
>> (I assume you are using the combined war here, otherwise Tomcat would not
>> be involved.)
>>
>> Karl
>>
>>
>> On Thu, Jan 18, 2018 at 6:44 AM, Shashank Raj <shashank.raj2009@gmail.com
>> > wrote:
>>
>>> Hello Karl,
>>> GC Overhead heap error occurs each time and tomcat closes. Heap
>>> allocated is 7Gb(Xmx). Is there any other reason this issue is coming up? I
>>> am using ManifoldCF's tika.
>>> I have Unchecked "Use Update Extract" and max doc size as 50mb.
>>>
>>>
>>>
>>

Mime
View raw message