manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Question about ManifoldCF 2.8
Date Thu, 31 Aug 2017 11:53:27 GMT
If you are amenable, there is another workaround you could try.
Specifically:

(1) Shut down all MCF processes.
(2) Move the following two files from connector-common-lib to lib:

xmlbeans-2.6.0.jar
poi-ooxml-schemas-3.15.jar

(3) Restart everything and see if your crawl resumes.

Please let me know what happens.

Karl



On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <daddywri@gmail.com> wrote:

> I created a ticket for this: CONNECTORS-1450.
>
> One simple workaround is to use the external Tika server transformer
> rather than the embedded Tika Extractor.  I'm still looking into why the
> jar is not being found.
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <i93othman@gmail.com>
> wrote:
>
>> Yes, I'm actually using the latest binary version, and my job got stuck
>> on that specific file.
>> The job status is still Running. You can see it in the attached file. For
>> your information, the job started yesterday.
>>
>> Thanks,
>>
>> Othman
>>
>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> It looks like a dependency of Apache POI is missing.
>>> I think we will need a ticket to address this, if you are indeed using
>>> the binary distribution.
>>>
>>> Thanks!
>>> Karl
>>>
>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>> wrote:
>>>
>>>> I'm actually using the binary version. For security reasons, I can't
>>>> send any files from my computer. I have copied the stack trace and scanned
>>>> it with my cellphone. I hope it will be helpful. Meanwhile, I have read the
>>>> documentation about how to restrict the crawling and I don't think the '|'
>>>> works in the specified. For instance, I would like to restrict the crawling
>>>> for the documents that counts the 'sound' word . I proceed as follows:
>>>> *(SON)* . the document is with capital letters and I noticed that it didn't
>>>> take it into consideration.
>>>>
>>>> Thanks,
>>>> Othman
>>>>
>>>>
>>>>
>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> Hi Othman,
>>>>>
>>>>> The way you restrict documents with the windows share connector is by
>>>>> specifying information on the "Paths" tab in jobs that crawl windows
>>>>> shares.  There is end-user documentation both online and distributed
with
>>>>> all binary distributions that describe how to do this.  Have you found
it?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Karl,
>>>>>>
>>>>>> Thank you for your response, I will start using zookeeper and I will
>>>>>> let you know if it works. I have another question to ask. Actually,
I need
>>>>>> to make some filters while crawling. I don't want to crawl some files
and
>>>>>> some folders. Could you give me an example of how to use the regex.
Does
>>>>>> the regex allow to use /i to ignore cases ?
>>>>>>
>>>>>> Thanks,
>>>>>> Othman
>>>>>>
>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>
>>>>>>> Hi Beelz,
>>>>>>>
>>>>>>> File-based sync is deprecated because people often have problems
>>>>>>> with getting file permissions right, and they do not understand
how to shut
>>>>>>> processes down cleanly, and zookeeper is resilient against that.
 I highly
>>>>>>> recommend using zookeeper sync.
>>>>>>>
>>>>>>> ManifoldCF is engineered to not put files into memory so you
do not
>>>>>>> need huge amounts of memory.  The default values are more than
enough for
>>>>>>> 35,000 files, which is a pretty small job for ManifoldCF.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <i93othman@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> I'm actually not using zookeeper. i want to know how is zookeeper
>>>>>>>> different from file based sync? I also need a guidance on
how to manage my
>>>>>>>> pc's memory. How many Go should I allocate for the start-agent
of
>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>>>>>
>>>>>>>> Othman.
>>>>>>>>
>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Your disk is not writable for some reason, and that's
interfering
>>>>>>>>> with ManifoldCF 2.8 locking.
>>>>>>>>>
>>>>>>>>> I would suggest two things:
>>>>>>>>>
>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>>>>>> (2) Have a look if you still get failures after that.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>
>>>>>>>>>> Thank you Mr Karl for your quick response. I have
looked into the
>>>>>>>>>> ManifoldCF log file and extracted the following warnings
:
>>>>>>>>>>
>>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.8
>>>>>>>>>> \multiprocess-file-example\.\.\synch
>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
(Lowercase)
>>>>>>>>>> Synapses.lock' failed : Access is denied.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> - Couldn't write to lock file; disk may be full.
Shutting down
>>>>>>>>>> process; locks may be left dangling. You must cleanup
before restarting.
>>>>>>>>>>
>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output
>>>>>>>>>> connection. Moreover, the job uses Tika to extract
metadata and a file
>>>>>>>>>> system as a repository connection. During the job,
I don't extract the
>>>>>>>>>> content of the documents. I was wandering if the
issue comes from
>>>>>>>>>> elasticsearch ?
>>>>>>>>>>
>>>>>>>>>> Othman.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>
>>>>>>>>>>> ManifoldCF aborts a job if there's an error that
looks like it
>>>>>>>>>>> might go away on retry, but does not.  It can
be either on the repository
>>>>>>>>>>> side or on the output side.  If you look at the
Simple History in the UI,
>>>>>>>>>>> or at the manifoldcf.log file, you should be
able to get a better sense of
>>>>>>>>>>> what went wrong.  Without further information,
I can't say any more.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki
<
>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from
société générale in
>>>>>>>>>>>> France. I'm actually using your recent version
of manifoldCF 2.8 . I'm
>>>>>>>>>>>> working on an internal search engine. For
this reason, I'm using manifoldcf
>>>>>>>>>>>> in order to index documents on windows shares.
I encountered a serious
>>>>>>>>>>>> problem while crawling 35K documents. Most
of the time, when manifoldcf
>>>>>>>>>>>> start crawling a big sized documents (19Mo
for example), it ends the job
>>>>>>>>>>>> with the following error: repeated service
interruptions - failure
>>>>>>>>>>>> processing document : software caused connection
abort: socket write error.
>>>>>>>>>>>> Can you give me some tips on how to solve
this problem, please
>>>>>>>>>>>> ?
>>>>>>>>>>>>
>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch
2.1.0 .
>>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>

Mime
View raw message