manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Beelz Ryuzaki <i93oth...@gmail.com>
Subject Re: Question about ManifoldCF 2.8
Date Thu, 31 Aug 2017 13:01:56 GMT
I have tried what you told me to do, and you expected the crawling resumed.
How about the regular expressions? How can I make complex regular
expressions in the job's paths tab ?

Thank you very much for your help.

Othman.


On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93othman@gmail.com> wrote:

> Ok, I will try it right away and let you know if it works.
>
> Othman.
>
> On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddywri@gmail.com> wrote:
>
>> Oh, and you also may need to edit your options.env files to include them
>> in the classpath for startup.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> If you are amenable, there is another workaround you could try.
>>> Specifically:
>>>
>>> (1) Shut down all MCF processes.
>>> (2) Move the following two files from connector-common-lib to lib:
>>>
>>> xmlbeans-2.6.0.jar
>>> poi-ooxml-schemas-3.15.jar
>>>
>>> (3) Restart everything and see if your crawl resumes.
>>>
>>> Please let me know what happens.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> I created a ticket for this: CONNECTORS-1450.
>>>>
>>>> One simple workaround is to use the external Tika server transformer
>>>> rather than the embedded Tika Extractor.  I'm still looking into why the
>>>> jar is not being found.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>> wrote:
>>>>
>>>>> Yes, I'm actually using the latest binary version, and my job got
>>>>> stuck on that specific file.
>>>>> The job status is still Running. You can see it in the attached file.
>>>>> For your information, the job started yesterday.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Othman
>>>>>
>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>
>>>>>> It looks like a dependency of Apache POI is missing.
>>>>>> I think we will need a ticket to address this, if you are indeed
>>>>>> using the binary distribution.
>>>>>>
>>>>>> Thanks!
>>>>>> Karl
>>>>>>
>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm actually using the binary version. For security reasons,
I can't
>>>>>>> send any files from my computer. I have copied the stack trace
and scanned
>>>>>>> it with my cellphone. I hope it will be helpful. Meanwhile, I
have read the
>>>>>>> documentation about how to restrict the crawling and I don't
think the '|'
>>>>>>> works in the specified. For instance, I would like to restrict
the crawling
>>>>>>> for the documents that counts the 'sound' word . I proceed as
follows:
>>>>>>> *(SON)* . the document is with capital letters and I noticed
that it didn't
>>>>>>> take it into consideration.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Othman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Othman,
>>>>>>>>
>>>>>>>> The way you restrict documents with the windows share connector
is
>>>>>>>> by specifying information on the "Paths" tab in jobs that
crawl windows
>>>>>>>> shares.  There is end-user documentation both online and
distributed with
>>>>>>>> all binary distributions that describe how to do this.  Have
you found it?
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <i93othman@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hello Karl,
>>>>>>>>>
>>>>>>>>> Thank you for your response, I will start using zookeeper
and I
>>>>>>>>> will let you know if it works. I have another question
to ask. Actually, I
>>>>>>>>> need to make some filters while crawling. I don't want
to crawl some files
>>>>>>>>> and some folders. Could you give me an example of how
to use the regex.
>>>>>>>>> Does the regex allow to use /i to ignore cases ?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Othman
>>>>>>>>>
>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Beelz,
>>>>>>>>>>
>>>>>>>>>> File-based sync is deprecated because people often
have problems
>>>>>>>>>> with getting file permissions right, and they do
not understand how to shut
>>>>>>>>>> processes down cleanly, and zookeeper is resilient
against that.  I highly
>>>>>>>>>> recommend using zookeeper sync.
>>>>>>>>>>
>>>>>>>>>> ManifoldCF is engineered to not put files into memory
so you do
>>>>>>>>>> not need huge amounts of memory.  The default values
are more than enough
>>>>>>>>>> for 35,000 files, which is a pretty small job for
ManifoldCF.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm actually not using zookeeper. i want to know
how is
>>>>>>>>>>> zookeeper different from file based sync? I also
need a guidance on how to
>>>>>>>>>>> manage my pc's memory. How many Go should I allocate
for the start-agent of
>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler
35K files ?
>>>>>>>>>>>
>>>>>>>>>>> Othman.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <daddywri@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Your disk is not writable for some reason,
and that's
>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>>
>>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>>
>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based
sync.
>>>>>>>>>>>> (2) Have a look if you still get failures
after that.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki
<
>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you Mr Karl for your quick response.
I have looked into
>>>>>>>>>>>>> the ManifoldCF log file and extracted
the following warnings :
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Attempt to set file lock
>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
(Lowercase)
>>>>>>>>>>>>> Synapses.lock' failed : Access is denied.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Couldn't write to lock file; disk may
be full. Shutting down
>>>>>>>>>>>>> process; locks may be left dangling.
You must cleanup before restarting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch
output
>>>>>>>>>>>>> connection. Moreover, the job uses Tika
to extract metadata and a file
>>>>>>>>>>>>> system as a repository connection. During
the job, I don't extract the
>>>>>>>>>>>>> content of the documents. I was wandering
if the issue comes from
>>>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright
<daddywri@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ManifoldCF aborts a job if there's
an error that looks like
>>>>>>>>>>>>>> it might go away on retry, but does
not.  It can be either on the
>>>>>>>>>>>>>> repository side or on the output
side.  If you look at the Simple History
>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log
file, you should be able to get a
>>>>>>>>>>>>>> better sense of what went wrong.
 Without further information, I can't say
>>>>>>>>>>>>>> any more.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM,
Beelz Ryuzaki <
>>>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm Othman Belhaj, a software
engineer from société générale
>>>>>>>>>>>>>>> in France. I'm actually using
your recent version of manifoldCF 2.8 . I'm
>>>>>>>>>>>>>>> working on an internal search
engine. For this reason, I'm using manifoldcf
>>>>>>>>>>>>>>> in order to index documents on
windows shares. I encountered a serious
>>>>>>>>>>>>>>> problem while crawling 35K documents.
Most of the time, when manifoldcf
>>>>>>>>>>>>>>> start crawling a big sized documents
(19Mo for example), it ends the job
>>>>>>>>>>>>>>> with the following error: repeated
service interruptions - failure
>>>>>>>>>>>>>>> processing document : software
caused connection abort: socket write error.
>>>>>>>>>>>>>>> Can you give me some tips on
how to solve this problem,
>>>>>>>>>>>>>>> please ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch
2.1.0 .
>>>>>>>>>>>>>>> I'm looking forward for your
response.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
>>

Mime
View raw message