Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9EA72200CE1 for ; Thu, 31 Aug 2017 17:46:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9CEC616B561; Thu, 31 Aug 2017 15:46:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C96D316B55F for ; Thu, 31 Aug 2017 17:46:04 +0200 (CEST) Received: (qmail 32103 invoked by uid 500); 31 Aug 2017 15:46:02 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 32092 invoked by uid 99); 31 Aug 2017 15:46:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Aug 2017 15:46:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 0A3B31A0A80 for ; Thu, 31 Aug 2017 15:46:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.099 X-Spam-Level: * X-Spam-Status: No, score=1.099 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=2, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id BWCzwVamkNwE for ; Thu, 31 Aug 2017 15:45:53 +0000 (UTC) Received: from mail-io0-f180.google.com (mail-io0-f180.google.com [209.85.223.180]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 0FE585FBBA for ; Thu, 31 Aug 2017 15:45:52 +0000 (UTC) Received: by mail-io0-f180.google.com with SMTP id 81so646694ioj.5 for ; Thu, 31 Aug 2017 08:45:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=hlso7MPdl4dxWt8inZtNCfC1k7O5sEXZV9aP8KZGpTg=; b=o2lLlUi5/t9XcUbpvSGtugLjjFwEcSQYrRCb5pg6aQMfmn3foV8AcI9kiO15+B3tiO dVJa+4gQtTVjZUf4z2W/AQT25fKx+3n+x+VE36+zZ+PCtpFu60hVoDXAqLaXQyTz9CzU eXM6B1XJ+kqHsbjcKNa5kL0ktFZtK5mdXlxJqhbfawzwltumBDq7tKH/6oy/PU9x3gGQ xCPA1bZlA90mF5korff20N7U+i3swBE6Qq4MUv9LpGqAjMLu2YLiIevx/lVlIWjAcfo0 30mT6o4tpSdM2dDIpDu0wHOs8e2nlChpCRfPSdD13R0ztHTh0eyBJh7FKsalo0aaBqzV ZEZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=hlso7MPdl4dxWt8inZtNCfC1k7O5sEXZV9aP8KZGpTg=; b=MD6l5/qPjPTYlN1cEDUttLcPbumFstAUkn+7Chsi7xFqnWxD1obLfHsxwp6OINilk4 S3cAYMqYerFrVnhfEFsNd1WcvBqZj+dflP+6MZO6H3fM95e8c4j39BNEvLj60PvUDGw+ XM2RfHBsKuoFGyjS4Ns+u8dUpxF76Mms7dmjVG2RwaylJ+nwY1Rx7jBoMYYD/zdRJYS0 VCPVTO7+T6eW/gfdk3rcC/av9grZthpTZEF3X9UxU2JA4TvONKvriYENGiUGtpr4VIxT WHR4x1XE5cy1daxkOj0wcgfWDh5hILgamEGBLJcQIKE2WF8uGz0FSbKAyLvTAFrBxhRV IW8w== X-Gm-Message-State: AHPjjUinphQvAc38a3ji1YlgjrdTCIX86nxtuVK9u/lJXx3hJl/1A93n BTDxnEciJEIx8MXDY/It1tXr6u2bDw== X-Google-Smtp-Source: ADKCNb7jhQd1jKnuni74CkB4QhUWGIGDltCDh2u+gUJMt8dGNUwgh1I0t9BRh14p2u/nrXdzduUBLyP9u0+E5fCvOwA= X-Received: by 10.107.134.155 with SMTP id q27mr669727ioi.185.1504194350359; Thu, 31 Aug 2017 08:45:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.56.214 with HTTP; Thu, 31 Aug 2017 08:45:49 -0700 (PDT) In-Reply-To: References: From: Karl Wright Date: Thu, 31 Aug 2017 11:45:49 -0400 Message-ID: Subject: Re: Question about ManifoldCF 2.8 To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary="001a113f9004ed3cb905580e89e8" archived-at: Thu, 31 Aug 2017 15:46:06 -0000 --001a113f9004ed3cb905580e89e8 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Are you using the zookeeper example, or the file-based example? If these jars have all been moved, and the options.env includes them, then I have to conclude that Apache POI's pom.xml is incorrect too. It will take a while to figure out what's missing that poi-ooxml.jar needs that is not listed. Karl On Thu, Aug 31, 2017 at 11:39 AM, Beelz Ryuzaki wrote= : > All the dependencies you mentioned have already been added in the > options.env.win file in the multiprocess-file-example repository. > > On Thu, 31 Aug 2017 at 17:33, Beelz Ryuzaki wrote: > >> Yes, I added it in the options.env.win file. Should it be the one in the >> multiprocess-zk-example document or multiprocess-file-example ? >> >> On Thu, 31 Aug 2017 at 17:30, Karl Wright wrote: >> >>> It's not related at all to elasticsearch. >>> Karl >>> >>> >>> On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki >>> wrote: >>> >>>> Could it be a problem of elasticsearch's version ? I'm actually using >>>> 2.1.0 which is pretty old for this new version of ManifoldCF? >>>> >>>> Othman. >>>> >>>> On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki >>>> wrote: >>>> >>>>> I moved back both the jars you mentioned and a different is showing. >>>>> You will find the stack trace attached. >>>>> >>>>> Thanks, >>>>> Othman >>>>> >>>>> On Thu, 31 Aug 2017 at 17:09, Karl Wright wrote: >>>>> >>>>>> I've looked at the dependencies; you should not have moved >>>>>> poi-3.15.jar. Please move that back, and commons-collections4-4.1.j= ar too. >>>>>> >>>>>> You *will* need to move curvesapi-1.04.jar though. >>>>>> >>>>>> Thanks, >>>>>> Karl >>>>>> >>>>>> >>>>>> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright >>>>>> wrote: >>>>>> >>>>>>> If you include poi.jar, then all dependencies of poi.jar must also >>>>>>> be included. This would mean that curvesapi-1.04.jar and >>>>>>> commons-collections4-4.1.jar should also be included. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki >>>>>> > wrote: >>>>>>> >>>>>>>> Hi Karl, >>>>>>>> >>>>>>>> I added the two jars that you have mentioned and another one : >>>>>>>> poi-3.15.jar . Unfortunately, there is another error showing. This= time, it >>>>>>>> concerns excel files. You will find attached the stack trace. >>>>>>>> >>>>>>>> Othman. >>>>>>>> >>>>>>>> On Thu, 31 Aug 2017 at 15:32, Karl Wright >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Othman, >>>>>>>>> >>>>>>>>> Yes, this shows that the jar we moved calls back into another jar= , >>>>>>>>> which will also need to be moved. *That* jar has yet another dep= endency >>>>>>>>> too. >>>>>>>>> >>>>>>>>> The list of jars is thus extended to include: >>>>>>>>> >>>>>>>>> poi-ooxml-3.15.jar >>>>>>>>> dom4j-1.6.1.jar >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki < >>>>>>>>> i93othman@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> You will find attached the stack trace. My apologies for the bad >>>>>>>>>> quality of the image, I'm doing my best to send you the stack tr= ace as I >>>>>>>>>> don't have the right to send documents outside the company. >>>>>>>>>> >>>>>>>>>> Thank you for your time, >>>>>>>>>> >>>>>>>>>> Othman >>>>>>>>>> >>>>>>>>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Once again, I need a stack trace to diagnose what the problem i= s. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki < >>>>>>>>>>> i93othman@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Oh, actually it didn't solve the problem. I looked into the lo= g >>>>>>>>>>>> file and saw the following error: >>>>>>>>>>>> >>>>>>>>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader >>>>>>>>>>>> java.lang.NoClassDefFoundError: org/apache/poi/ >>>>>>>>>>>> POIXMLTypeLoader. >>>>>>>>>>>> >>>>>>>>>>>> Maybe another jar is missing ? >>>>>>>>>>>> >>>>>>>>>>>> Othman. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki < >>>>>>>>>>>> i93othman@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I have tried what you told me to do, and you expected the >>>>>>>>>>>>> crawling resumed. How about the regular expressions? How can = I make complex >>>>>>>>>>>>> regular expressions in the job's paths tab ? >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you very much for your help. >>>>>>>>>>>>> >>>>>>>>>>>>> Othman. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki < >>>>>>>>>>>>> i93othman@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Ok, I will try it right away and let you know if it works. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Oh, and you also may need to edit your options.env files to >>>>>>>>>>>>>>> include them in the classpath for startup. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright < >>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If you are amenable, there is another workaround you could >>>>>>>>>>>>>>>> try. Specifically: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> (1) Shut down all MCF processes. >>>>>>>>>>>>>>>> (2) Move the following two files from connector-common-lib >>>>>>>>>>>>>>>> to lib: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> xmlbeans-2.6.0.jar >>>>>>>>>>>>>>>> poi-ooxml-schemas-3.15.jar >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> (3) Restart everything and see if your crawl resumes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Please let me know what happens. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright < >>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I created a ticket for this: CONNECTORS-1450. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> One simple workaround is to use the external Tika server >>>>>>>>>>>>>>>>> transformer rather than the embedded Tika Extractor. I'm= still looking >>>>>>>>>>>>>>>>> into why the jar is not being found. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>> i93othman@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Yes, I'm actually using the latest binary version, and m= y >>>>>>>>>>>>>>>>>> job got stuck on that specific file. >>>>>>>>>>>>>>>>>> The job status is still Running. You can see it in the >>>>>>>>>>>>>>>>>> attached file. For your information, the job started yes= terday. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright < >>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> It looks like a dependency of Apache POI is missing. >>>>>>>>>>>>>>>>>>> I think we will need a ticket to address this, if you >>>>>>>>>>>>>>>>>>> are indeed using the binary distribution. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>> i93othman@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm actually using the binary version. For security >>>>>>>>>>>>>>>>>>>> reasons, I can't send any files from my computer. I ha= ve copied the stack >>>>>>>>>>>>>>>>>>>> trace and scanned it with my cellphone. I hope it will= be helpful. >>>>>>>>>>>>>>>>>>>> Meanwhile, I have read the documentation about how to = restrict the crawling >>>>>>>>>>>>>>>>>>>> and I don't think the '|' works in the specified. For = instance, I would >>>>>>>>>>>>>>>>>>>> like to restrict the crawling for the documents that c= ounts the 'sound' >>>>>>>>>>>>>>>>>>>> word . I proceed as follows: *(SON)* . the document is= with capital letters >>>>>>>>>>>>>>>>>>>> and I noticed that it didn't take it into consideratio= n. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright < >>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The way you restrict documents with the windows share >>>>>>>>>>>>>>>>>>>>> connector is by specifying information on the "Paths"= tab in jobs that >>>>>>>>>>>>>>>>>>>>> crawl windows shares. There is end-user documentatio= n both online and >>>>>>>>>>>>>>>>>>>>> distributed with all binary distributions that descri= be how to do this. >>>>>>>>>>>>>>>>>>>>> Have you found it? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>>> i93othman@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hello Karl, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thank you for your response, I will start using >>>>>>>>>>>>>>>>>>>>>> zookeeper and I will let you know if it works. I hav= e another question to >>>>>>>>>>>>>>>>>>>>>> ask. Actually, I need to make some filters while cra= wling. I don't want to >>>>>>>>>>>>>>>>>>>>>> crawl some files and some folders. Could you give me= an example of how to >>>>>>>>>>>>>>>>>>>>>> use the regex. Does the regex allow to use /i to ign= ore cases ? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright < >>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Beelz, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> File-based sync is deprecated because people often >>>>>>>>>>>>>>>>>>>>>>> have problems with getting file permissions right, = and they do not >>>>>>>>>>>>>>>>>>>>>>> understand how to shut processes down cleanly, and = zookeeper is resilient >>>>>>>>>>>>>>>>>>>>>>> against that. I highly recommend using zookeeper s= ync. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> ManifoldCF is engineered to not put files into >>>>>>>>>>>>>>>>>>>>>>> memory so you do not need huge amounts of memory. = The default values are >>>>>>>>>>>>>>>>>>>>>>> more than enough for 35,000 files, which is a prett= y small job for >>>>>>>>>>>>>>>>>>>>>>> ManifoldCF. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>>>>> i93othman@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know >>>>>>>>>>>>>>>>>>>>>>>> how is zookeeper different from file based sync? I= also need a guidance on >>>>>>>>>>>>>>>>>>>>>>>> how to manage my pc's memory. How many Go should I= allocate for the >>>>>>>>>>>>>>>>>>>>>>>> start-agent of ManifoldCF? Is 4Go enough in order = to crawler 35K files ? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright < >>>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Your disk is not writable for some reason, and >>>>>>>>>>>>>>>>>>>>>>>>> that's interfering with ManifoldCF 2.8 locking. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I would suggest two things: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based >>>>>>>>>>>>>>>>>>>>>>>>> sync. >>>>>>>>>>>>>>>>>>>>>>>>> (2) Have a look if you still get failures after >>>>>>>>>>>>>>>>>>>>>>>>> that. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>>>>>>> i93othman@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I hav= e >>>>>>>>>>>>>>>>>>>>>>>>>> looked into the ManifoldCF log file and extracte= d the following warnings : >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> - Attempt to set file lock >>>>>>>>>>>>>>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2. >>>>>>>>>>>>>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch >>>>>>>>>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOO= L_ES >>>>>>>>>>>>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is de= nied. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. >>>>>>>>>>>>>>>>>>>>>>>>>> Shutting down process; locks may be left danglin= g. You must cleanup before >>>>>>>>>>>>>>>>>>>>>>>>>> restarting. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch >>>>>>>>>>>>>>>>>>>>>>>>>> output connection. Moreover, the job uses Tika t= o extract metadata and a >>>>>>>>>>>>>>>>>>>>>>>>>> file system as a repository connection. During t= he job, I don't extract the >>>>>>>>>>>>>>>>>>>>>>>>>> content of the documents. I was wandering if the= issue comes from >>>>>>>>>>>>>>>>>>>>>>>>>> elasticsearch ? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright < >>>>>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error tha= t >>>>>>>>>>>>>>>>>>>>>>>>>>> looks like it might go away on retry, but does = not. It can be either on >>>>>>>>>>>>>>>>>>>>>>>>>>> the repository side or on the output side. If = you look at the Simple >>>>>>>>>>>>>>>>>>>>>>>>>>> History in the UI, or at the manifoldcf.log fil= e, you should be able to get >>>>>>>>>>>>>>>>>>>>>>>>>>> a better sense of what went wrong. Without fur= ther information, I can't >>>>>>>>>>>>>>>>>>>>>>>>>>> say any more. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki = < >>>>>>>>>>>>>>>>>>>>>>>>>>> i93othman@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from >>>>>>>>>>>>>>>>>>>>>>>>>>>> soci=C3=A9t=C3=A9 g=C3=A9n=C3=A9rale in France= . I'm actually using your recent version of >>>>>>>>>>>>>>>>>>>>>>>>>>>> manifoldCF 2.8 . I'm working on an internal se= arch engine. For this reason, >>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm using manifoldcf in order to index documen= ts on windows shares. I >>>>>>>>>>>>>>>>>>>>>>>>>>>> encountered a serious problem while crawling 3= 5K documents. Most of the >>>>>>>>>>>>>>>>>>>>>>>>>>>> time, when manifoldcf start crawling a big siz= ed documents (19Mo for >>>>>>>>>>>>>>>>>>>>>>>>>>>> example), it ends the job with the following e= rror: repeated service >>>>>>>>>>>>>>>>>>>>>>>>>>>> interruptions - failure processing document : = software caused connection >>>>>>>>>>>>>>>>>>>>>>>>>>>> abort: socket write error. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this >>>>>>>>>>>>>>>>>>>>>>>>>>>> problem, please ? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0= . >>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>> --001a113f9004ed3cb905580e89e8 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Are you using the zookeeper example, or the file-based exa= mple?

If these jars have all been moved, and the options= .env includes them, then I have to conclude that Apache POI's pom.xml i= s incorrect too.=C2=A0 It will take a while to figure out what's missin= g that poi-ooxml.jar needs that is not listed.

Kar= l


On Thu, Aug 31, 2017 at 11:39 AM, Beelz Ryuzaki <i93othman@gm= ail.com> wrote:
All the dependencies you mentioned have already been added in t= he options.env.win file in the multiprocess-file-example repository.=C2=A0<= /div>

On Thu, 31 = Aug 2017 at 17:33, Beelz Ryuzaki <i93othman@gmail.com> wrote:
Yes, I added it in the options.en= v.win file. Should it be the one in the multiprocess-zk-example document or= multiprocess-file-example ?=C2=A0

On Thu, 31 Aug 2017 at 17:30, Karl Wright <daddywri@gmail.com> wrote:
It's not related at all to e= lasticsearch.
Karl


On Thu, Aug 31, 2017 at 11:26 AM= , Beelz Ryuzaki <i93othman@gmail.com> wrote:
Could it be a problem of elasticsearch'= s version ? I'm actually using 2.1.0 which is pretty old for this new v= ersion of ManifoldCF?

Ot= hman.

On Thu, 31 Aug = 2017 at 17:23, Beelz Ryuzaki <i93othman@gmail.com> wrote:
I moved back both the jars you mentio= ned and a different is showing. You will find the stack trace attached.=C2= =A0

Thanks,
Othman=C2=A0

On Thu, 31 Aug 2017 at 17:09, Karl Wright <daddywri@gmail.com> wrote:
I've looked at the dependenc= ies; you should not have moved poi-3.15.jar.=C2=A0 Please move that back, a= nd commons-collections4-4.1.jar too.

You *will* need to = move curvesapi-1.04.jar though.

Thanks,
= Karl


On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <daddywri@gmail.com&g= t; wrote:
If you include poi.= jar, then all dependencies of poi.jar must also be included.=C2=A0 This wou= ld mean that=C2=A0curvesapi-1.04.jar and commons-collections4-4.1.jar shoul= d also be included.

Karl

On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <i93othman@gmail.com> wrote:
Hi Karl= ,=C2=A0

I added the two = jars that you have mentioned and another one : poi-3.15.jar . Unfortunately= , there is another error showing. This time, it concerns excel files. You w= ill find attached the stack trace.=C2=A0

<= div dir=3D"auto">Othman.

On Thu, 31 Aug 2017 at 15:32, Karl Wr= ight <daddywri@g= mail.com> wrote:
Hi Oth= man,

Yes, this shows that the jar we moved calls back into another j= ar, which will also need to be moved. =C2=A0*That* jar has yet another depe= ndency too.

The list of jars is thus extended to include= :

poi-ooxml-3.15.jar
dom4j-1.6.1.jar

Karl


On Thu, Aug 31, 2017 at 9:2= 5 AM, Beelz Ryuzaki <i93othman@gmail.com> wrote:
You will find attached the stack trace= . My apologies for the bad quality of the image, I'm doing my best to s= end you the stack trace as I don't have the right to send documents out= side the company.

Thank = you for your time,

Othma= n=C2=A0

On Thu, 31 Aug 2017 at 1= 5:16, Karl Wright <daddywri@gmail.com> wrote:

On Thu, Aug 31, 2017 at 9:= 14 AM, Beelz Ryuzaki <i93othman@gmail.com> wrote:
Oh, actually it didn't solve the p= roblem. I looked into the log file and saw the following error:

Error tossed : org/apache/poi/= POIXMLTypeLoader
java.lang.NoClassDefFoundError= : org/apache/poi/POIXMLTypeLoader.

Maybe another jar is missing ?

<= /div>
Othman.=C2=A0
<= br>
On Thu, 31 Aug 2017 at 15:01, Beelz Ryuz= aki <i93othman@= gmail.com> wrote:
I have tried what you told me to do, and you expected the cr= awling resumed. How about the regular expressions? How can I make complex r= egular expressions in the job's paths tab ?

=
Thank you very much for your help.

Othman.=C2=A0


=
On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93othman@gmail.com> wrote:
Ok, I will try it= right away and let you know if it works.=C2=A0

=
Othman.

On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddywri@gmail.com> wrote:
Oh, and you also may need to edit y= our options.env files to include them in the classpath for startup.

Karl


On Thu, Aug 31, 2017 at 7:53 AM, Karl = Wright <da= ddywri@gmail.com> wrote:
If you are amenable, there is another workaround you could try.=C2=A0 Sp= ecifically:

(1) Shut down all MCF processes.
(2) Move the followi= ng two files from connector-common-lib to lib:

xmlbeans-2.6.0.jar
poi-ooxml-schemas-3.15.jar

(3) Restar= t everything and see if your crawl resumes.

Please= let me know what happens.

Karl

=

On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <= daddywri@gmail.com<= /a>> wrote:
I created a ti= cket for this: CONNECTORS-1450.

One simple workaround is= to use the external Tika server transformer rather than the embedded Tika = Extractor.=C2=A0 I'm still looking into why the jar is not being found.=

Karl

<= div class=3D"m_-1267896635751591070m_-4416853187548064002m_-294971598032995= 355m_8475367808148575934m_-7468680325212274938m_-3693748614646249210m_-7914= 097635468064711m_599025940193410029m_1717268638187300350m_-7778543115490460= m_-7965314753472558419m_-1860211959068093567m_3233898314711669615m_-4157747= 919478219095m_-7407814034805309235m_4545877114313425638h5">

On Thu, Aug 31, 2017 at 7:08 AM, = Beelz Ryuzaki <i93othman@gmail.com> wrote:
Yes, I'm actually using the latest binary= version, and my job got stuck on that specific file.=C2=A0
The job status is still Running. You can see it in the attached f= ile. For your information, the job started yesterday.=C2=A0

Thanks,=C2=A0
=
Othman

On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddywri@gmail.com&g= t; wrote:
It looks like a depe= ndency of Apache POI is missing.
I think we will need a ticket to addre= ss this, if you are indeed using the binary distribution.

Thanks!
Karl

On Thu, Aug 31, 2017 at 6:57 AM, Beelz = Ryuzaki <= i93othman@gmail.com> wrote:
I'm actually using the binary version. For secu= rity reasons, I can't send any files from my computer. I have copied th= e stack trace and scanned it with my cellphone. I hope it will be helpful. = Meanwhile, I have read the documentation about how to restrict the crawling= and I don't think the '|' works in the specified. For instance= , I would like to restrict the crawling for the documents that counts the &= #39;sound' word . I proceed as follows: *(SON)* . the document is with = capital letters and I noticed that it didn't take it into consideration= .=C2=A0

Thanks,=C2=A0
Othman


=

On Thu, 31 = Aug 2017 at 12:40, Karl Wright <daddywri@gmail.com> wrote:
Hi Othman,

The way you restrict docu= ments with the windows share connector is by specifying information on the = "Paths" tab in jobs that crawl windows shares.=C2=A0 There is end= -user documentation both online and distributed with all binary distributio= ns that describe how to do this.=C2=A0 Have you found it?
<= div>
Karl


On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryu= zaki <i93= othman@gmail.com> wrote:
Hello Karl,=C2=A0

Thank you for your response, I will start using zookeeper an= d I will let you know if it works. I have another question to ask. Actually= , I need to make some filters while crawling. I don't want to crawl som= e files and some folders. Could you give me an example of how to use the re= gex. Does the regex allow to use /i to ignore cases ?=C2=A0

Thanks,=C2=A0
= Othman

On We= d, 30 Aug 2017 at 19:53, Karl Wright <daddywri@gmail.com> wrote:
Hi Beelz,

File-based sync is d= eprecated because people often have problems with getting file permissions = right, and they do not understand how to shut processes down cleanly, and z= ookeeper is resilient against that.=C2=A0 I highly recommend using zookeepe= r sync.

ManifoldCF is engineered to not put files into memory so you= do not need huge amounts of memory.=C2=A0 The default values are more than= enough for 35,000 files, which is a pretty small job for ManifoldCF.
=

Thanks,
Karl


On Wed, Aug 30, 2017 at= 11:58 AM, Beelz Ryuzaki <i93othman@gmail.com> wrote:
I'm actually not using zookeep= er. i want to know how is zookeeper different from file based sync? I also = need a guidance on how to manage my pc's memory. How many Go should I a= llocate for the start-agent of ManifoldCF? Is 4Go enough in order to crawle= r 35K files ?

Othman.=C2= =A0
On Wed, 30 Aug 2017 at 16:11, Karl Wright= <daddywri@gmail= .com> wrote:
Your disk = is not writable for some reason, and that's interfering with ManifoldCF= 2.8 locking.

I would suggest two things:

=
(1) Use Zookeeper for sync instead of file-based sync.
(2) Have a look if you still get failures after that.

=
Thanks,
Karl


On Wed, Aug 30, 2017 at 9:37 AM, Beelz= Ryuzaki <i93othman@gmail.com> wrote:
Hi Mr Karl,=C2=A0

Thank you Mr Karl for your quick response. I have looked= into the ManifoldCF log file and extracted the following warnings :
<= div dir=3D"auto">
- Attempt to set file lock = 9;D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\syn= ch area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase) Sy= napses.lock' failed : Access is denied.


- Couldn't write to loc= k file; disk may be full. Shutting down process; locks may be left dangling= . You must cleanup before restarting.

ES (lowercase) synapses being the elasticsearch output connec= tion. Moreover, the job uses Tika to extract metadata and a file system as = a repository connection. During the job, I don't extract the content of= the documents. I was wandering if the issue comes from elasticsearch ?

Othman.=C2=A0


<= /div>
On Wed, 30 Aug 2017 at 14:08, Karl= Wright <daddywr= i@gmail.com> wrote:
Hi = Othman,

ManifoldCF aborts a job if there's an error = that looks like it might go away on retry, but does not.=C2=A0 It can be ei= ther on the repository side or on the output side.=C2=A0 If you look at the= Simple History in the UI, or at the manifoldcf.log file, you should be abl= e to get a better sense of what went wrong.=C2=A0 Without further informati= on, I can't say any more.

Thanks,
Ka= rl


On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <i93othman@gmail.com&= gt; wrote:
Hello= ,

I'm Othman Belhaj, a software engineer from soci=C3=A9t= =C3=A9 g=C3=A9n=C3=A9rale in France. I'm actually using your recent ver= sion of manifoldCF 2.8 . I'm working on an internal search engine. For = this reason, I'm using manifoldcf in order to index documents on window= s shares. I encountered a serious problem while crawling 35K documents. Mos= t of the time, when manifoldcf start crawling a big sized documents (19Mo f= or example), it ends the job with the following error: repeated service int= erruptions - failure processing document : software caused connection abort= : socket write error.=C2=A0
Can you give me some tips on how to solve this probl= em, please ?=C2=A0

I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
I'm looking f= orward for your response.

Best regards,=C2=A0

Othman BELHAJ








=






--001a113f9004ed3cb905580e89e8--