Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C02A82004F1 for ; Wed, 30 Aug 2017 19:53:32 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id BCF28169A02; Wed, 30 Aug 2017 17:53:32 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B4ACB169A01 for ; Wed, 30 Aug 2017 19:53:31 +0200 (CEST) Received: (qmail 6157 invoked by uid 500); 30 Aug 2017 17:53:30 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 6147 invoked by uid 99); 30 Aug 2017 17:53:30 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Aug 2017 17:53:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 98C4F182941 for ; Wed, 30 Aug 2017 17:53:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.099 X-Spam-Level: X-Spam-Status: No, score=0.099 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id ThtarnQkyBNy for ; Wed, 30 Aug 2017 17:53:27 +0000 (UTC) Received: from mail-io0-f182.google.com (mail-io0-f182.google.com [209.85.223.182]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 57EB46103C for ; Wed, 30 Aug 2017 17:53:27 +0000 (UTC) Received: by mail-io0-f182.google.com with SMTP id d78so6667071ioe.4 for ; Wed, 30 Aug 2017 10:53:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=pjwM4JTe2+6hbFr1aSiBI01v1v1LNpD+G/dzbvhfX20=; b=PXHIT1syJvt8qVqA6U+CbErHjPzEC/UTadaEztTzgqIRoNE+cBZ8dKaOG0ijfjvkN+ n+0oMxJyuTuXO6EJS103YXCp17XbBZWr6MOQUH8UDf1oX1lNDrVTObVcJJvSTqayQJfx nQ132kbUCO//fZfFhlddIL1sG4uLUhSLNfM/L/zgVpgdKTVkKML60Q5cnhtuMGUczjf5 WNhLNKp7I2vZaUMNABKS9wXANvDI4pGyt7RckV8T1NRNmOCeVNDWw4ITOKEuNJ7luJHf SwUT5bV1SpQkS9c9Vq/Q3yLajkeiU/sJy+dHr4cvuwFaJOXMciWvBCzF4cQaZc7vlGvv eWjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=pjwM4JTe2+6hbFr1aSiBI01v1v1LNpD+G/dzbvhfX20=; b=fNnd3+DnCeQnc4SUC+wmOQhv5PgVbZWH7lbai0g9lb/I1RwJLckGVHEWtk5WtGvD91 4Ior+3jIk2x+R8hRwTtY5pdx4cx5FM1B7IgJIUkalpkAR7GAEuZMdDg9jsx2ubVIt+Il KuWv666PmpKHeTq4LMr9og5hyiLlWTZKGsKDgxHYZNMMIDqu4LYN/NLQHj4mDOPnRXfc m7uXUBUw0uz7lRnjdHZHaIAazFbqoTx0Mvrjk0W3RGQEJ8Y7e9t76Wzd+hvQE7B56Dr+ ochZnRii/fL1dZF+jnU8G0Q8MUR0RSjKcwBL6/USB4bwVZOVxa39wtsijE3pw7NC0xZo 6kfg== X-Gm-Message-State: AHYfb5jOvaz2cRZCJIfvPRRzHRwR2px+3hIWYIyt0QZ5W7iRnrkNK/b0 hw2D65JlthEdALY2bHjQ0FqrFRS1dg== X-Received: by 10.36.163.73 with SMTP id p70mr2261407ite.40.1504115605956; Wed, 30 Aug 2017 10:53:25 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.19.68 with HTTP; Wed, 30 Aug 2017 10:53:25 -0700 (PDT) In-Reply-To: References: From: Karl Wright Date: Wed, 30 Aug 2017 13:53:25 -0400 Message-ID: Subject: Re: Question about ManifoldCF 2.8 To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary="94eb2c03b21464fb280557fc3404" archived-at: Wed, 30 Aug 2017 17:53:32 -0000 --94eb2c03b21464fb280557fc3404 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Beelz, File-based sync is deprecated because people often have problems with getting file permissions right, and they do not understand how to shut processes down cleanly, and zookeeper is resilient against that. I highly recommend using zookeeper sync. ManifoldCF is engineered to not put files into memory so you do not need huge amounts of memory. The default values are more than enough for 35,000 files, which is a pretty small job for ManifoldCF. Thanks, Karl On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki wrote= : > I'm actually not using zookeeper. i want to know how is zookeeper > different from file based sync? I also need a guidance on how to manage m= y > pc's memory. How many Go should I allocate for the start-agent of > ManifoldCF? Is 4Go enough in order to crawler 35K files ? > > Othman. > > On Wed, 30 Aug 2017 at 16:11, Karl Wright wrote: > >> Your disk is not writable for some reason, and that's interfering with >> ManifoldCF 2.8 locking. >> >> I would suggest two things: >> >> (1) Use Zookeeper for sync instead of file-based sync. >> (2) Have a look if you still get failures after that. >> >> Thanks, >> Karl >> >> >> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki >> wrote: >> >>> Hi Mr Karl, >>> >>> Thank you Mr Karl for your quick response. I have looked into the >>> ManifoldCF log file and extracted the following warnings : >>> >>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2. >>> 8\multiprocess-file-example\.\.\synch area\569\352\lock-_POOLTARGET_OUT= PUTCONNECTORPOOL_ES >>> (Lowercase) Synapses.lock' failed : Access is denied. >>> >>> >>> - Couldn't write to lock file; disk may be full. Shutting down process; >>> locks may be left dangling. You must cleanup before restarting. >>> >>> ES (lowercase) synapses being the elasticsearch output connection. >>> Moreover, the job uses Tika to extract metadata and a file system as a >>> repository connection. During the job, I don't extract the content of t= he >>> documents. I was wandering if the issue comes from elasticsearch ? >>> >>> Othman. >>> >>> >>> >>> On Wed, 30 Aug 2017 at 14:08, Karl Wright wrote: >>> >>>> Hi Othman, >>>> >>>> ManifoldCF aborts a job if there's an error that looks like it might g= o >>>> away on retry, but does not. It can be either on the repository side = or on >>>> the output side. If you look at the Simple History in the UI, or at t= he >>>> manifoldcf.log file, you should be able to get a better sense of what = went >>>> wrong. Without further information, I can't say any more. >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> I'm Othman Belhaj, a software engineer from soci=C3=A9t=C3=A9 g=C3=A9= n=C3=A9rale in >>>>> France. I'm actually using your recent version of manifoldCF 2.8 . I'= m >>>>> working on an internal search engine. For this reason, I'm using mani= foldcf >>>>> in order to index documents on windows shares. I encountered a seriou= s >>>>> problem while crawling 35K documents. Most of the time, when manifold= cf >>>>> start crawling a big sized documents (19Mo for example), it ends the = job >>>>> with the following error: repeated service interruptions - failure >>>>> processing document : software caused connection abort: socket write = error. >>>>> Can you give me some tips on how to solve this problem, please ? >>>>> >>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>> I'm looking forward for your response. >>>>> >>>>> Best regards, >>>>> >>>>> Othman BELHAJ >>>>> >>>> >>>> >> --94eb2c03b21464fb280557fc3404 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Beelz,

File-based sync is deprecated= because people often have problems with getting file permissions right, an= d they do not understand how to shut processes down cleanly, and zookeeper = is resilient against that.=C2=A0 I highly recommend using zookeeper sync.
ManifoldCF is engineered to not put files into memory so you do not n= eed huge amounts of memory.=C2=A0 The default values are more than enough f= or 35,000 files, which is a pretty small job for ManifoldCF.

=
Thanks,
Karl


On Wed, Aug 30, 2017 at 11:58 AM= , Beelz Ryuzaki <i93othman@gmail.com> wrote:
I'm actually not using zook= eeper. i want to know how is zookeeper different from file based sync? I al= so need a guidance on how to manage my pc's memory. How many Go should = I allocate for the start-agent of ManifoldCF? Is 4Go enough in order to cra= wler 35K files ?

Othman.= =C2=A0

On W= ed, 30 Aug 2017 at 16:11, Karl Wright <daddywri@gmail.com> wrote:
Your disk is not writable for some reason, and t= hat's interfering with ManifoldCF 2.8 locking.

I wou= ld suggest two things:

(1) Use Zookeeper for sync = instead of file-based sync.
(2) Have a look if you still get fail= ures after that.

Thanks,
Karl
=

O= n Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <i93othman@gmail.com> wr= ote:
Hi Mr Karl,= =C2=A0

Thank you Mr Karl= for your quick response. I have looked into the ManifoldCF log file and ex= tracted the following warnings :

- Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.= 8\multiprocess-file-example\.\.\synch area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase) Synapses.lock' failed : Access is= denied.


- Couldn't write to lock file; disk may be full. Shutting = down process; locks may be left dangling. You must cleanup before restartin= g.

ES (lowercase) synaps= es being the elasticsearch output connection. Moreover, the job uses Tika t= o extract metadata and a file system as a repository connection. During the= job, I don't extract the content of the documents. I was wandering if = the issue comes from elasticsearch ?

Othman.=C2=A0



On Wed, 30 Aug 2017 at 14:08, Karl = Wright <daddywri= @gmail.com> wrote:
Hi = Othman,

ManifoldCF aborts a job if there's an error = that looks like it might go away on retry, but does not.=C2=A0 It can be ei= ther on the repository side or on the output side.=C2=A0 If you look at the= Simple History in the UI, or at the manifoldcf.log file, you should be abl= e to get a better sense of what went wrong.=C2=A0 Without further informati= on, I can't say any more.

Thanks,
Ka= rl


On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <i93othman@gmail.com&= gt; wrote:
Hello= ,

I'm Othman Belhaj, a software engineer from soci=C3=A9t= =C3=A9 g=C3=A9n=C3=A9rale in France. I'm actually using your recent ver= sion of manifoldCF 2.8 . I'm working on an internal search engine. For = this reason, I'm using manifoldcf in order to index documents on window= s shares. I encountered a serious problem while crawling 35K documents. Mos= t of the time, when manifoldcf start crawling a big sized documents (19Mo f= or example), it ends the job with the following error: repeated service int= erruptions - failure processing document : software caused connection abort= : socket write error.=C2=A0
Can you give me some tips on how to solve this probl= em, please ?=C2=A0

I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
I'm looking f= orward for your response.

Best regards,=C2=A0

Othman BELHAJ



--94eb2c03b21464fb280557fc3404--