manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Question about ManifoldCF 2.8
Date Thu, 31 Aug 2017 11:33:46 GMT
I created a ticket for this: CONNECTORS-1450.

One simple workaround is to use the external Tika server transformer rather
than the embedded Tika Extractor.  I'm still looking into why the jar is
not being found.

Karl


On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <i93othman@gmail.com> wrote:

> Yes, I'm actually using the latest binary version, and my job got stuck on
> that specific file.
> The job status is still Running. You can see it in the attached file. For
> your information, the job started yesterday.
>
> Thanks,
>
> Othman
>
> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddywri@gmail.com> wrote:
>
>> It looks like a dependency of Apache POI is missing.
>> I think we will need a ticket to address this, if you are indeed using
>> the binary distribution.
>>
>> Thanks!
>> Karl
>>
>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <i93othman@gmail.com>
>> wrote:
>>
>>> I'm actually using the binary version. For security reasons, I can't
>>> send any files from my computer. I have copied the stack trace and scanned
>>> it with my cellphone. I hope it will be helpful. Meanwhile, I have read the
>>> documentation about how to restrict the crawling and I don't think the '|'
>>> works in the specified. For instance, I would like to restrict the crawling
>>> for the documents that counts the 'sound' word . I proceed as follows:
>>> *(SON)* . the document is with capital letters and I noticed that it didn't
>>> take it into consideration.
>>>
>>> Thanks,
>>> Othman
>>>
>>>
>>>
>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Othman,
>>>>
>>>> The way you restrict documents with the windows share connector is by
>>>> specifying information on the "Paths" tab in jobs that crawl windows
>>>> shares.  There is end-user documentation both online and distributed with
>>>> all binary distributions that describe how to do this.  Have you found it?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello Karl,
>>>>>
>>>>> Thank you for your response, I will start using zookeeper and I will
>>>>> let you know if it works. I have another question to ask. Actually, I
need
>>>>> to make some filters while crawling. I don't want to crawl some files
and
>>>>> some folders. Could you give me an example of how to use the regex. Does
>>>>> the regex allow to use /i to ignore cases ?
>>>>>
>>>>> Thanks,
>>>>> Othman
>>>>>
>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>
>>>>>> Hi Beelz,
>>>>>>
>>>>>> File-based sync is deprecated because people often have problems
with
>>>>>> getting file permissions right, and they do not understand how to
shut
>>>>>> processes down cleanly, and zookeeper is resilient against that.
 I highly
>>>>>> recommend using zookeeper sync.
>>>>>>
>>>>>> ManifoldCF is engineered to not put files into memory so you do not
>>>>>> need huge amounts of memory.  The default values are more than enough
for
>>>>>> 35,000 files, which is a pretty small job for ManifoldCF.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm actually not using zookeeper. i want to know how is zookeeper
>>>>>>> different from file based sync? I also need a guidance on how
to manage my
>>>>>>> pc's memory. How many Go should I allocate for the start-agent
of
>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>>>>
>>>>>>> Othman.
>>>>>>>
>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Your disk is not writable for some reason, and that's interfering
>>>>>>>> with ManifoldCF 2.8 locking.
>>>>>>>>
>>>>>>>> I would suggest two things:
>>>>>>>>
>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>>>>> (2) Have a look if you still get failures after that.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <i93othman@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Mr Karl,
>>>>>>>>>
>>>>>>>>> Thank you Mr Karl for your quick response. I have looked
into the
>>>>>>>>> ManifoldCF log file and extracted the following warnings
:
>>>>>>>>>
>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.
>>>>>>>>> 8\multiprocess-file-example\.\.\synch
>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
(Lowercase)
>>>>>>>>> Synapses.lock' failed : Access is denied.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting
down
>>>>>>>>> process; locks may be left dangling. You must cleanup
before restarting.
>>>>>>>>>
>>>>>>>>> ES (lowercase) synapses being the elasticsearch output
connection.
>>>>>>>>> Moreover, the job uses Tika to extract metadata and a
file system as a
>>>>>>>>> repository connection. During the job, I don't extract
the content of the
>>>>>>>>> documents. I was wandering if the issue comes from elasticsearch
?
>>>>>>>>>
>>>>>>>>> Othman.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Othman,
>>>>>>>>>>
>>>>>>>>>> ManifoldCF aborts a job if there's an error that
looks like it
>>>>>>>>>> might go away on retry, but does not.  It can be
either on the repository
>>>>>>>>>> side or on the output side.  If you look at the Simple
History in the UI,
>>>>>>>>>> or at the manifoldcf.log file, you should be able
to get a better sense of
>>>>>>>>>> what went wrong.  Without further information, I
can't say any more.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société
générale in
>>>>>>>>>>> France. I'm actually using your recent version
of manifoldCF 2.8 . I'm
>>>>>>>>>>> working on an internal search engine. For this
reason, I'm using manifoldcf
>>>>>>>>>>> in order to index documents on windows shares.
I encountered a serious
>>>>>>>>>>> problem while crawling 35K documents. Most of
the time, when manifoldcf
>>>>>>>>>>> start crawling a big sized documents (19Mo for
example), it ends the job
>>>>>>>>>>> with the following error: repeated service interruptions
- failure
>>>>>>>>>>> processing document : software caused connection
abort: socket write error.
>>>>>>>>>>> Can you give me some tips on how to solve this
problem, please ?
>>>>>>>>>>>
>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0
.
>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>>
>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

Mime
View raw message