manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Beelz Ryuzaki <i93oth...@gmail.com>
Subject Re: Question about ManifoldCF 2.8
Date Thu, 31 Aug 2017 15:33:55 GMT
Yes, I added it in the options.env.win file. Should it be the one in the
multiprocess-zk-example document or multiprocess-file-example ?

On Thu, 31 Aug 2017 at 17:30, Karl Wright <daddywri@gmail.com> wrote:

> It's not related at all to elasticsearch.
> Karl
>
>
> On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki <i93othman@gmail.com>
> wrote:
>
>> Could it be a problem of elasticsearch's version ? I'm actually using
>> 2.1.0 which is pretty old for this new version of ManifoldCF?
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki <i93othman@gmail.com> wrote:
>>
>>> I moved back both the jars you mentioned and a different is showing. You
>>> will find the stack trace attached.
>>>
>>> Thanks,
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 17:09, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> I've looked at the dependencies; you should not have moved
>>>> poi-3.15.jar.  Please move that back, and commons-collections4-4.1.jar too.
>>>>
>>>> You *will* need to move curvesapi-1.04.jar though.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> If you include poi.jar, then all dependencies of poi.jar must also be
>>>>> included.  This would mean that curvesapi-1.04.jar and
>>>>> commons-collections4-4.1.jar should also be included.
>>>>>
>>>>> Karl
>>>>>
>>>>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> I added the two jars that you have mentioned and another one :
>>>>>> poi-3.15.jar . Unfortunately, there is another error showing. This
time, it
>>>>>> concerns excel files. You will find attached the stack trace.
>>>>>>
>>>>>> Othman.
>>>>>>
>>>>>> On Thu, 31 Aug 2017 at 15:32, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>
>>>>>>> Hi Othman,
>>>>>>>
>>>>>>> Yes, this shows that the jar we moved calls back into another
jar,
>>>>>>> which will also need to be moved.  *That* jar has yet another
dependency
>>>>>>> too.
>>>>>>>
>>>>>>> The list of jars is thus extended to include:
>>>>>>>
>>>>>>> poi-ooxml-3.15.jar
>>>>>>> dom4j-1.6.1.jar
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You will find attached the stack trace. My apologies for
the bad
>>>>>>>> quality of the image, I'm doing my best to send you the stack
trace as I
>>>>>>>> don't have the right to send documents outside the company.
>>>>>>>>
>>>>>>>> Thank you for your time,
>>>>>>>>
>>>>>>>> Othman
>>>>>>>>
>>>>>>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Once again, I need a stack trace to diagnose what the
problem is.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <
>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Oh, actually it didn't solve the problem. I looked
into the log
>>>>>>>>>> file and saw the following error:
>>>>>>>>>>
>>>>>>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader
>>>>>>>>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>>>>>>>>>>
>>>>>>>>>> Maybe another jar is missing ?
>>>>>>>>>>
>>>>>>>>>> Othman.
>>>>>>>>>>
>>>>>>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I have tried what you told me to do, and you
expected the
>>>>>>>>>>> crawling resumed. How about the regular expressions?
How can I make complex
>>>>>>>>>>> regular expressions in the job's paths tab ?
>>>>>>>>>>>
>>>>>>>>>>> Thank you very much for your help.
>>>>>>>>>>>
>>>>>>>>>>> Othman.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ok, I will try it right away and let you
know if it works.
>>>>>>>>>>>>
>>>>>>>>>>>> Othman.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright
<daddywri@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Oh, and you also may need to edit your
options.env files to
>>>>>>>>>>>>> include them in the classpath for startup.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl
Wright <
>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you are amenable, there is another
workaround you could
>>>>>>>>>>>>>> try.  Specifically:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (1) Shut down all MCF processes.
>>>>>>>>>>>>>> (2) Move the following two files
from connector-common-lib to
>>>>>>>>>>>>>> lib:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> xmlbeans-2.6.0.jar
>>>>>>>>>>>>>> poi-ooxml-schemas-3.15.jar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (3) Restart everything and see if
your crawl resumes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please let me know what happens.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM,
Karl Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I created a ticket for this:
CONNECTORS-1450.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One simple workaround is to use
the external Tika server
>>>>>>>>>>>>>>> transformer rather than the embedded
Tika Extractor.  I'm still looking
>>>>>>>>>>>>>>> into why the jar is not being
found.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08
AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I'm actually using the
latest binary version, and my
>>>>>>>>>>>>>>>> job got stuck on that specific
file.
>>>>>>>>>>>>>>>> The job status is still Running.
You can see it in the
>>>>>>>>>>>>>>>> attached file. For your information,
the job started yesterday.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04,
Karl Wright <
>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It looks like a dependency
of Apache POI is missing.
>>>>>>>>>>>>>>>>> I think we will need
a ticket to address this, if you are
>>>>>>>>>>>>>>>>> indeed using the binary
distribution.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017
at 6:57 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>> i93othman@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm actually using
the binary version. For security
>>>>>>>>>>>>>>>>>> reasons, I can't
send any files from my computer. I have copied the stack
>>>>>>>>>>>>>>>>>> trace and scanned
it with my cellphone. I hope it will be helpful.
>>>>>>>>>>>>>>>>>> Meanwhile, I have
read the documentation about how to restrict the crawling
>>>>>>>>>>>>>>>>>> and I don't think
the '|' works in the specified. For instance, I would
>>>>>>>>>>>>>>>>>> like to restrict
the crawling for the documents that counts the 'sound'
>>>>>>>>>>>>>>>>>> word . I proceed
as follows: *(SON)* . the document is with capital letters
>>>>>>>>>>>>>>>>>> and I noticed that
it didn't take it into consideration.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017
at 12:40, Karl Wright <
>>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The way you restrict
documents with the windows share
>>>>>>>>>>>>>>>>>>> connector is
by specifying information on the "Paths" tab in jobs that
>>>>>>>>>>>>>>>>>>> crawl windows
shares.  There is end-user documentation both online and
>>>>>>>>>>>>>>>>>>> distributed with
all binary distributions that describe how to do this.
>>>>>>>>>>>>>>>>>>> Have you found
it?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Aug 31,
2017 at 5:25 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>> i93othman@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hello Karl,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thank you
for your response, I will start using
>>>>>>>>>>>>>>>>>>>> zookeeper
and I will let you know if it works. I have another question to
>>>>>>>>>>>>>>>>>>>> ask. Actually,
I need to make some filters while crawling. I don't want to
>>>>>>>>>>>>>>>>>>>> crawl some
files and some folders. Could you give me an example of how to
>>>>>>>>>>>>>>>>>>>> use the regex.
Does the regex allow to use /i to ignore cases ?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, 30
Aug 2017 at 19:53, Karl Wright <
>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> File-based
sync is deprecated because people often
>>>>>>>>>>>>>>>>>>>>> have
problems with getting file permissions right, and they do not
>>>>>>>>>>>>>>>>>>>>> understand
how to shut processes down cleanly, and zookeeper is resilient
>>>>>>>>>>>>>>>>>>>>> against
that.  I highly recommend using zookeeper sync.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ManifoldCF
is engineered to not put files into memory
>>>>>>>>>>>>>>>>>>>>> so you
do not need huge amounts of memory.  The default values are more
>>>>>>>>>>>>>>>>>>>>> than
enough for 35,000 files, which is a pretty small job for ManifoldCF.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed,
Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>> i93othman@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I'm
actually not using zookeeper. i want to know how
>>>>>>>>>>>>>>>>>>>>>> is
zookeeper different from file based sync? I also need a guidance on how
>>>>>>>>>>>>>>>>>>>>>> to
manage my pc's memory. How many Go should I allocate for the start-agent
>>>>>>>>>>>>>>>>>>>>>> of
ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On
Wed, 30 Aug 2017 at 16:11, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
Your disk is not writable for some reason, and
>>>>>>>>>>>>>>>>>>>>>>>
that's interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
I would suggest two things:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
(1) Use Zookeeper for sync instead of file-based
>>>>>>>>>>>>>>>>>>>>>>>
sync.
>>>>>>>>>>>>>>>>>>>>>>>
(2) Have a look if you still get failures after that.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
Thanks,
>>>>>>>>>>>>>>>>>>>>>>>
Karl
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>>>>
i93othman@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
Hi Mr Karl,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
Thank you Mr Karl for your quick response. I have
>>>>>>>>>>>>>>>>>>>>>>>>
looked into the ManifoldCF log file and extracted the following warnings :
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
- Attempt to set file lock
>>>>>>>>>>>>>>>>>>>>>>>>
'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>>>>>>>>>>>>
area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase)
>>>>>>>>>>>>>>>>>>>>>>>>
Synapses.lock' failed : Access is denied.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
- Couldn't write to lock file; disk may be full.
>>>>>>>>>>>>>>>>>>>>>>>>
Shutting down process; locks may be left dangling. You must cleanup before
>>>>>>>>>>>>>>>>>>>>>>>>
restarting.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
ES (lowercase) synapses being the elasticsearch
>>>>>>>>>>>>>>>>>>>>>>>>
output connection. Moreover, the job uses Tika to extract metadata and a
>>>>>>>>>>>>>>>>>>>>>>>>
file system as a repository connection. During the job, I don't extract the
>>>>>>>>>>>>>>>>>>>>>>>>
content of the documents. I was wandering if the issue comes from
>>>>>>>>>>>>>>>>>>>>>>>>
elasticsearch ?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
Othman.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
On Wed, 30 Aug 2017 at 14:08, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>>>>
daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
Hi Othman,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
ManifoldCF aborts a job if there's an error that
>>>>>>>>>>>>>>>>>>>>>>>>>
looks like it might go away on retry, but does not.  It can be either on
>>>>>>>>>>>>>>>>>>>>>>>>>
the repository side or on the output side.  If you look at the Simple
>>>>>>>>>>>>>>>>>>>>>>>>>
History in the UI, or at the manifoldcf.log file, you should be able to get
>>>>>>>>>>>>>>>>>>>>>>>>>
a better sense of what went wrong.  Without further information, I can't
>>>>>>>>>>>>>>>>>>>>>>>>>
say any more.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>
Karl
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>>>>>>
i93othman@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
I'm Othman Belhaj, a software engineer from
>>>>>>>>>>>>>>>>>>>>>>>>>>
société générale in France. I'm actually using your recent version of
>>>>>>>>>>>>>>>>>>>>>>>>>>
manifoldCF 2.8 . I'm working on an internal search engine. For this reason,
>>>>>>>>>>>>>>>>>>>>>>>>>>
I'm using manifoldcf in order to index documents on windows shares. I
>>>>>>>>>>>>>>>>>>>>>>>>>>
encountered a serious problem while crawling 35K documents. Most of the
>>>>>>>>>>>>>>>>>>>>>>>>>>
time, when manifoldcf start crawling a big sized documents (19Mo for
>>>>>>>>>>>>>>>>>>>>>>>>>>
example), it ends the job with the following error: repeated service
>>>>>>>>>>>>>>>>>>>>>>>>>>
interruptions - failure processing document : software caused connection
>>>>>>>>>>>>>>>>>>>>>>>>>>
abort: socket write error.
>>>>>>>>>>>>>>>>>>>>>>>>>>
Can you give me some tips on how to solve this
>>>>>>>>>>>>>>>>>>>>>>>>>>
problem, please ?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>>>>>>>>>>>>>>>>>>>
I'm looking forward for your response.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
Othman BELHAJ
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>

Mime
View raw message