manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Beelz Ryuzaki <i93oth...@gmail.com>
Subject Re: Question about ManifoldCF 2.8
Date Thu, 31 Aug 2017 15:23:44 GMT
I moved back both the jars you mentioned and a different is showing. You
will find the stack trace attached.

Thanks,
Othman

On Thu, 31 Aug 2017 at 17:09, Karl Wright <daddywri@gmail.com> wrote:

> I've looked at the dependencies; you should not have moved poi-3.15.jar.
> Please move that back, and commons-collections4-4.1.jar too.
>
> You *will* need to move curvesapi-1.04.jar though.
>
> Thanks,
> Karl
>
>
> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> If you include poi.jar, then all dependencies of poi.jar must also be
>> included.  This would mean that curvesapi-1.04.jar and
>> commons-collections4-4.1.jar should also be included.
>>
>> Karl
>>
>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <i93othman@gmail.com>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> I added the two jars that you have mentioned and another one :
>>> poi-3.15.jar . Unfortunately, there is another error showing. This time, it
>>> concerns excel files. You will find attached the stack trace.
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 15:32, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Othman,
>>>>
>>>> Yes, this shows that the jar we moved calls back into another jar,
>>>> which will also need to be moved.  *That* jar has yet another dependency
>>>> too.
>>>>
>>>> The list of jars is thus extended to include:
>>>>
>>>> poi-ooxml-3.15.jar
>>>> dom4j-1.6.1.jar
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>> wrote:
>>>>
>>>>> You will find attached the stack trace. My apologies for the bad
>>>>> quality of the image, I'm doing my best to send you the stack trace as
I
>>>>> don't have the right to send documents outside the company.
>>>>>
>>>>> Thank you for your time,
>>>>>
>>>>> Othman
>>>>>
>>>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>
>>>>>> Once again, I need a stack trace to diagnose what the problem is.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Oh, actually it didn't solve the problem. I looked into the log
file
>>>>>>> and saw the following error:
>>>>>>>
>>>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader
>>>>>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>>>>>>>
>>>>>>> Maybe another jar is missing ?
>>>>>>>
>>>>>>> Othman.
>>>>>>>
>>>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have tried what you told me to do, and you expected the
crawling
>>>>>>>> resumed. How about the regular expressions? How can I make
complex regular
>>>>>>>> expressions in the job's paths tab ?
>>>>>>>>
>>>>>>>> Thank you very much for your help.
>>>>>>>>
>>>>>>>> Othman.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok, I will try it right away and let you know if it works.
>>>>>>>>>
>>>>>>>>> Othman.
>>>>>>>>>
>>>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Oh, and you also may need to edit your options.env
files to
>>>>>>>>>> include them in the classpath for startup.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> If you are amenable, there is another workaround
you could try.
>>>>>>>>>>> Specifically:
>>>>>>>>>>>
>>>>>>>>>>> (1) Shut down all MCF processes.
>>>>>>>>>>> (2) Move the following two files from connector-common-lib
to
>>>>>>>>>>> lib:
>>>>>>>>>>>
>>>>>>>>>>> xmlbeans-2.6.0.jar
>>>>>>>>>>> poi-ooxml-schemas-3.15.jar
>>>>>>>>>>>
>>>>>>>>>>> (3) Restart everything and see if your crawl
resumes.
>>>>>>>>>>>
>>>>>>>>>>> Please let me know what happens.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright
<daddywri@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I created a ticket for this: CONNECTORS-1450.
>>>>>>>>>>>>
>>>>>>>>>>>> One simple workaround is to use the external
Tika server
>>>>>>>>>>>> transformer rather than the embedded Tika
Extractor.  I'm still looking
>>>>>>>>>>>> into why the jar is not being found.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki
<
>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I'm actually using the latest binary
version, and my job
>>>>>>>>>>>>> got stuck on that specific file.
>>>>>>>>>>>>> The job status is still Running. You
can see it in the
>>>>>>>>>>>>> attached file. For your information,
the job started yesterday.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright
<daddywri@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> It looks like a dependency of Apache
POI is missing.
>>>>>>>>>>>>>> I think we will need a ticket to
address this, if you are
>>>>>>>>>>>>>> indeed using the binary distribution.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM,
Beelz Ryuzaki <
>>>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm actually using the binary
version. For security reasons,
>>>>>>>>>>>>>>> I can't send any files from my
computer. I have copied the stack trace and
>>>>>>>>>>>>>>> scanned it with my cellphone.
I hope it will be helpful. Meanwhile, I have
>>>>>>>>>>>>>>> read the documentation about
how to restrict the crawling and I don't think
>>>>>>>>>>>>>>> the '|' works in the specified.
For instance, I would like to restrict the
>>>>>>>>>>>>>>> crawling for the documents that
counts the 'sound' word . I proceed as
>>>>>>>>>>>>>>> follows: *(SON)* . the document
is with capital letters and I noticed that
>>>>>>>>>>>>>>> it didn't take it into consideration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40,
Karl Wright <
>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The way you restrict documents
with the windows share
>>>>>>>>>>>>>>>> connector is by specifying
information on the "Paths" tab in jobs that
>>>>>>>>>>>>>>>> crawl windows shares.  There
is end-user documentation both online and
>>>>>>>>>>>>>>>> distributed with all binary
distributions that describe how to do this.
>>>>>>>>>>>>>>>> Have you found it?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25
AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello Karl,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you for your response,
I will start using zookeeper
>>>>>>>>>>>>>>>>> and I will let you know
if it works. I have another question to ask.
>>>>>>>>>>>>>>>>> Actually, I need to make
some filters while crawling. I don't want to crawl
>>>>>>>>>>>>>>>>> some files and some folders.
Could you give me an example of how to use the
>>>>>>>>>>>>>>>>> regex. Does the regex
allow to use /i to ignore cases ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at
19:53, Karl Wright <
>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> File-based sync is
deprecated because people often have
>>>>>>>>>>>>>>>>>> problems with getting
file permissions right, and they do not understand
>>>>>>>>>>>>>>>>>> how to shut processes
down cleanly, and zookeeper is resilient against
>>>>>>>>>>>>>>>>>> that.  I highly recommend
using zookeeper sync.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ManifoldCF is engineered
to not put files into memory so
>>>>>>>>>>>>>>>>>> you do not need huge
amounts of memory.  The default values are more than
>>>>>>>>>>>>>>>>>> enough for 35,000
files, which is a pretty small job for ManifoldCF.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017
at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>> i93othman@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm actually
not using zookeeper. i want to know how is
>>>>>>>>>>>>>>>>>>> zookeeper different
from file based sync? I also need a guidance on how to
>>>>>>>>>>>>>>>>>>> manage my pc's
memory. How many Go should I allocate for the start-agent of
>>>>>>>>>>>>>>>>>>> ManifoldCF? Is
4Go enough in order to crawler 35K files ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug
2017 at 16:11, Karl Wright <
>>>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Your disk
is not writable for some reason, and that's
>>>>>>>>>>>>>>>>>>>> interfering
with ManifoldCF 2.8 locking.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I would suggest
two things:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (1) Use Zookeeper
for sync instead of file-based sync.
>>>>>>>>>>>>>>>>>>>> (2) Have
a look if you still get failures after that.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Aug
30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>> i93othman@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Mr
Karl,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thank
you Mr Karl for your quick response. I have
>>>>>>>>>>>>>>>>>>>>> looked
into the ManifoldCF log file and extracted the following warnings :
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> - Attempt
to set file lock
>>>>>>>>>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
(Lowercase)
>>>>>>>>>>>>>>>>>>>>> Synapses.lock'
failed : Access is denied.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> - Couldn't
write to lock file; disk may be full.
>>>>>>>>>>>>>>>>>>>>> Shutting
down process; locks may be left dangling. You must cleanup before
>>>>>>>>>>>>>>>>>>>>> restarting.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ES (lowercase)
synapses being the elasticsearch output
>>>>>>>>>>>>>>>>>>>>> connection.
Moreover, the job uses Tika to extract metadata and a file
>>>>>>>>>>>>>>>>>>>>> system
as a repository connection. During the job, I don't extract the
>>>>>>>>>>>>>>>>>>>>> content
of the documents. I was wandering if the issue comes from
>>>>>>>>>>>>>>>>>>>>> elasticsearch
?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed,
30 Aug 2017 at 14:08, Karl Wright <
>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi
Othman,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ManifoldCF
aborts a job if there's an error that
>>>>>>>>>>>>>>>>>>>>>> looks
like it might go away on retry, but does not.  It can be either on
>>>>>>>>>>>>>>>>>>>>>> the
repository side or on the output side.  If you look at the Simple
>>>>>>>>>>>>>>>>>>>>>> History
in the UI, or at the manifoldcf.log file, you should be able to get
>>>>>>>>>>>>>>>>>>>>>> a
better sense of what went wrong.  Without further information, I can't
>>>>>>>>>>>>>>>>>>>>>> say
any more.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On
Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>>> i93othman@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
Hello,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
I'm Othman Belhaj, a software engineer from société
>>>>>>>>>>>>>>>>>>>>>>>
générale in France. I'm actually using your recent version of manifoldCF
>>>>>>>>>>>>>>>>>>>>>>>
2.8 . I'm working on an internal search engine. For this reason, I'm using
>>>>>>>>>>>>>>>>>>>>>>>
manifoldcf in order to index documents on windows shares. I encountered a
>>>>>>>>>>>>>>>>>>>>>>>
serious problem while crawling 35K documents. Most of the time, when
>>>>>>>>>>>>>>>>>>>>>>>
manifoldcf start crawling a big sized documents (19Mo for example), it ends
>>>>>>>>>>>>>>>>>>>>>>>
the job with the following error: repeated service interruptions - failure
>>>>>>>>>>>>>>>>>>>>>>>
processing document : software caused connection abort: socket write error.
>>>>>>>>>>>>>>>>>>>>>>>
Can you give me some tips on how to solve this
>>>>>>>>>>>>>>>>>>>>>>>
problem, please ?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>>>>>>>>>>>>>>>>
I'm looking forward for your response.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
Best regards,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
Othman BELHAJ
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>
>>
>

Mime
View raw message