manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Question about ManifoldCF 2.8
Date Wed, 30 Aug 2017 17:53:25 GMT
Hi Beelz,

File-based sync is deprecated because people often have problems with
getting file permissions right, and they do not understand how to shut
processes down cleanly, and zookeeper is resilient against that.  I highly
recommend using zookeeper sync.

ManifoldCF is engineered to not put files into memory so you do not need
huge amounts of memory.  The default values are more than enough for 35,000
files, which is a pretty small job for ManifoldCF.

Thanks,
Karl


On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <i93othman@gmail.com> wrote:

> I'm actually not using zookeeper. i want to know how is zookeeper
> different from file based sync? I also need a guidance on how to manage my
> pc's memory. How many Go should I allocate for the start-agent of
> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>
> Othman.
>
> On Wed, 30 Aug 2017 at 16:11, Karl Wright <daddywri@gmail.com> wrote:
>
>> Your disk is not writable for some reason, and that's interfering with
>> ManifoldCF 2.8 locking.
>>
>> I would suggest two things:
>>
>> (1) Use Zookeeper for sync instead of file-based sync.
>> (2) Have a look if you still get failures after that.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <i93othman@gmail.com>
>> wrote:
>>
>>> Hi Mr Karl,
>>>
>>> Thank you Mr Karl for your quick response. I have looked into the
>>> ManifoldCF log file and extracted the following warnings :
>>>
>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.
>>> 8\multiprocess-file-example\.\.\synch area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>> (Lowercase) Synapses.lock' failed : Access is denied.
>>>
>>>
>>> - Couldn't write to lock file; disk may be full. Shutting down process;
>>> locks may be left dangling. You must cleanup before restarting.
>>>
>>> ES (lowercase) synapses being the elasticsearch output connection.
>>> Moreover, the job uses Tika to extract metadata and a file system as a
>>> repository connection. During the job, I don't extract the content of the
>>> documents. I was wandering if the issue comes from elasticsearch ?
>>>
>>> Othman.
>>>
>>>
>>>
>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Othman,
>>>>
>>>> ManifoldCF aborts a job if there's an error that looks like it might go
>>>> away on retry, but does not.  It can be either on the repository side or
on
>>>> the output side.  If you look at the Simple History in the UI, or at the
>>>> manifoldcf.log file, you should be able to get a better sense of what went
>>>> wrong.  Without further information, I can't say any more.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm Othman Belhaj, a software engineer from société générale in
>>>>> France. I'm actually using your recent version of manifoldCF 2.8 . I'm
>>>>> working on an internal search engine. For this reason, I'm using manifoldcf
>>>>> in order to index documents on windows shares. I encountered a serious
>>>>> problem while crawling 35K documents. Most of the time, when manifoldcf
>>>>> start crawling a big sized documents (19Mo for example), it ends the
job
>>>>> with the following error: repeated service interruptions - failure
>>>>> processing document : software caused connection abort: socket write
error.
>>>>> Can you give me some tips on how to solve this problem, please ?
>>>>>
>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>> I'm looking forward for your response.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Othman BELHAJ
>>>>>
>>>>
>>>>
>>

Mime
View raw message