manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashank Raj <shashank.raj2...@gmail.com>
Subject Re: ManifoldCF heap error.
Date Thu, 08 Feb 2018 11:10:41 GMT
Hi Karl,
           Sorry for the late reply but changing the worker threads as well
as throttling to 2-3 did help and now I am not getting this issue anymore.

On 18-Jan-2018 9:44 PM, "Steph van Schalkwyk" <steph@remcam.net> wrote:

> Also check if all the files are successfully parsed by Tika.
>
>
>
> *Steph van Schalkwyk*
> Principal, Remcam Search Engines
> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
> <http://www.remcam.net/> Skype: svanschalkwyk
> <https://mail.google.com/mail/u/0/#>
> <http://linkedin.com/in/vanschalkwyk>
>
> On Thu, Jan 18, 2018 at 9:55 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Oh, also the maximum number of Tika connections should be limited to the
>> number of threads to be sure you're not wasting memory on extra Tika
>> instances (which might be expensive).
>>
>> Karl
>>
>>
>> On Thu, Jan 18, 2018 at 10:52 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hmm, it might be worth asking this question in the Tika user list.
>>> We've not seen this kind of issue before with Tika transformation.
>>>
>>> Also, I think it's worth downloading MCF 2.9.1, which updates the Tika
>>> version to 2.17 from 2.16.  There were issues in 2.9 with incompatibilities
>>> between our Tika version and the Apache POI version.  This is now publicly
>>> available but the web site has not yet been updated, so modify the download
>>> URL to 2.9.1 from 2.9 to get the point release.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Thu, Jan 18, 2018 at 10:41 AM, Shashank Raj <
>>> shashank.raj2009@gmail.com> wrote:
>>>
>>>> Hi Karl,
>>>> I changed the number of worker threads to 6 but still the problem
>>>> persists when I use ManifoldCF's Tika. When going with "null" as output
>>>> connection, there seems no problem. Also tried with Solr without tika
>>>> transformation connection. That also works fine.
>>>> But as soon as I switch to Manifold's transformation connection Tika, I
>>>> get the same error. I have tried increasing heap size as well as decreasing
>>>> workers.
>>>> Also I've not selected "use extract update handler".
>>>>
>>>> Approx size of directory to crawl: 200GB
>>>> In the future this size will be :10TB
>>>> Size of largest file in this directory :2Gb
>>>>
>>>> NOTE: I am using Tomcat 8.0 to run manifold, connected to Postgresql
>>>> 9.3 with Solr 6.6.
>>>>
>>>> On 18-Jan-2018 6:21 PM, "Karl Wright" <daddywri@gmail.com> wrote:
>>>>
>>>>> Hi Shashank,
>>>>>
>>>>> ManifoldCF's memory consumption is bounded but scales by the number of
>>>>> worker threads you allow.  If you have 100 worker threads and each doc
can
>>>>> consume 50mb then you need to have at least 5gb right there for Solr
>>>>> output.  Tika is also quite expensive memory-wise so I'd allocate at
least
>>>>> 10gb for ManifoldCF to support the pipeline you have set up.
>>>>>
>>>>> The best way to control memory, therefore, is probably to reduce the
>>>>> number of worker threads.
>>>>>
>>>>> (I assume you are using the combined war here, otherwise Tomcat would
>>>>> not be involved.)
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Jan 18, 2018 at 6:44 AM, Shashank Raj <
>>>>> shashank.raj2009@gmail.com> wrote:
>>>>>
>>>>>> Hello Karl,
>>>>>> GC Overhead heap error occurs each time and tomcat closes. Heap
>>>>>> allocated is 7Gb(Xmx). Is there any other reason this issue is coming
up? I
>>>>>> am using ManifoldCF's tika.
>>>>>> I have Unchecked "Use Update Extract" and max doc size as 50mb.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Mime
View raw message