Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 39CBFF4B7 for ; Fri, 12 Dec 2014 14:56:08 +0000 (UTC) Received: (qmail 71297 invoked by uid 500); 12 Dec 2014 14:56:08 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 71244 invoked by uid 500); 12 Dec 2014 14:56:08 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 71234 invoked by uid 99); 12 Dec 2014 14:56:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Dec 2014 14:56:08 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates 209.85.160.170 as permitted sender) Received: from [209.85.160.170] (HELO mail-yk0-f170.google.com) (209.85.160.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Dec 2014 14:55:42 +0000 Received: by mail-yk0-f170.google.com with SMTP id q200so3205778ykb.29 for ; Fri, 12 Dec 2014 06:55:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=9XZoU0efd7s3IcqUbtBEM0tG9ntdYqawcoWvC2OW//8=; b=PNzZc+JCI8Ds9Z6rpSxi4OO7PbFsK4TIN9Vd+jq4YcnQMAiFLLokTE9K/JmKvCDZaH jLMEUn1T7hK8OWw/8bQxQSZd1ZGciMZUwufZAuqQR4hFBydSzbv+nxxIJWNxSP/3nL+W 6b0Py5yfwBSA8upr2k7dCcCC5dj1SWv0vZ/izWpSXRDDdZH43LIL+S8E8v966qh0WpIe ZxKUdehaSGKM+vPZ/gAcWTDqjey9V8uh3CWKvjt3y2FA6gKmzN0oLcmYUxdLN2vtWQ2J HJywvC/v/9rvxifHIO06eOvoAS/u1SZz39PpDvrSURQIWXjiCDlGpmjuEgSUX0S0QKpX 8ALw== MIME-Version: 1.0 X-Received: by 10.236.7.52 with SMTP id 40mr11960672yho.172.1418396141533; Fri, 12 Dec 2014 06:55:41 -0800 (PST) Received: by 10.170.205.65 with HTTP; Fri, 12 Dec 2014 06:55:41 -0800 (PST) In-Reply-To: <20141212144142.GH18988@spite.wcss.wroc.pl> References: <20141212135631.GG18988@spite.wcss.wroc.pl> <20141212144142.GH18988@spite.wcss.wroc.pl> Date: Fri, 12 Dec 2014 09:55:41 -0500 Message-ID: Subject: Re: ElastiSearch missing doc From: Karl Wright To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=001a1133e48a2ac0e2050a0616c2 X-Virus-Checked: Checked by ClamAV on apache.org --001a1133e48a2ac0e2050a0616c2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I've created CONNECTORS-1120 for this fix. I should have something to try shortly. Karl On Fri, Dec 12, 2014 at 9:41 AM, Kamil =C5=BByta wr= ote: > > On Fri, Dec 12, 2014 at 09:14:40AM -0500, Karl Wright wrote: > > Hi Kamil, > > > > You are getting a 404 error when ManifoldCF tries to delete a document > from > > the ElasticSearch index: > > > > >>>>>> > > else if (code =3D=3D 404) > > { > > setResult(IOutputHistoryActivity.HTTP_ERROR,Result.ERROR, "Page n= ot > > found: " + response); > > throw new ManifoldCFException("Server/page not found"); > > } > > <<<<<< > > > > The URL it is using is constructed as follows: > > > > >>>>>> > > String idField =3D URLEncoder.encode(documentURI); > > HttpDelete method =3D new HttpDelete(config.getServerLocation() + > > "/" + config.getIndexName() + "/" + config.getIndexType() > > + "/" + idField); > > call(method); > > <<<<<< > > > > So there are a number of possibilities. First possibility is that ES w= as > > down entirely when this job ended, and so document removal requests > failed > > for a legitimate reason. Second, it may be that the document in questi= on > > has already been deleted, and while this would formerly return a 200 > error > > code in the version of ES the connector was written for, it now returns= a > > 404. Finally, maybe the REST API changed so much that it is no longer > > possible to delete a document from the index this way. What version of > > ElasticSearch are you using, and can you find REST API documentation fo= r > > that version that you could point me at? Can you do enough research to > > find out what should work here? > > > > "version" : { > "number" : "1.4.1", > "build_hash" : "89d3241d670db65f994242c8e8383b169779e2d4", > "build_timestamp" : "2014-11-26T15:49:29Z", > "build_snapshot" : false, > "lucene_version" : "4.10.2" > }, > > url for deleting is correct: > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/doc= s-delete.html > and I found this: > http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/delete-= doc.html > > "If the document isn=E2=80=99t found, we get a 404 Not Found response cod= e and a > body like (...)" > > K > > > > > > > > > On Fri, Dec 12, 2014 at 8:56 AM, Kamil =C5=BByta > wrote: > > > > > > Hi, > > > When I testing ES as indexer some job ends with 'Error: Server/page n= ot > > > found'. In ES log I have > > > some too big doc exceptions. How this affect job? Full MCF logs: > > > > > > ERROR 2014-12-12 14:45:24,915 (Document cleanup thread '2') - Excepti= on > > > tossed: Server/page not found > > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Server/pag= e > not > > > found > > > at > > > > org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection= .handleResultCode(ElasticSearchConnection.java:234) > > > at > > > > org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection= .call(ElasticSearchConnection.java:203) > > > at > > > > org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchDelete.exe= cute(ElasticSearchDelete.java:45) > > > at > > > > org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.= removeDocument(ElasticSearchConnector.java:578) > > > at > > > > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.remove= Document(IncrementalIngester.java:2350) > > > at > > > > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.docume= ntDeleteMultiple(IncrementalIngester.java:1059) > > > at > > > > org.apache.manifoldcf.crawler.system.DocumentCleanupThread.run(DocumentCl= eanupThread.java:189) > > > > > > Thanks, > > > Kamil > > > > --001a1133e48a2ac0e2050a0616c2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I've created CONNECTORS-1120 for this fix.=C2=A0 I sho= uld have something to try shortly.

Karl


On Fri, Dec 12, 2014 at 9:41 AM,= Kamil =C5=BByta <kamil.zyta@pwr.edu.pl> wrote:On Fri, Dec 12, 2014 at 09:14:40AM -= 0500, Karl Wright wrote:
> Hi Kamil,
>
> You are getting a 404 error when ManifoldCF tries to delete a document= from
> the ElasticSearch index:
>
> >>>>>>
>=C2=A0 =C2=A0 =C2=A0else if (code =3D=3D 404)
>=C2=A0 =C2=A0 =C2=A0{
>=C2=A0 =C2=A0 =C2=A0 =C2=A0setResult(IOutputHistoryActivity.HTTP_ERROR,= Result.ERROR, "Page not
> found: " + response);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0throw new ManifoldCFException("Server/p= age not found");
>=C2=A0 =C2=A0 =C2=A0}
> <<<<<<
>
> The URL it is using is constructed as follows:
>
> >>>>>>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0String idField =3D URLEncoder.encode(documen= tURI);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0HttpDelete method =3D new HttpDelete(config.= getServerLocation() +
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"/" + config.getInde= xName() + "/" + config.getIndexType()
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+ "/" + idField); >=C2=A0 =C2=A0 =C2=A0 =C2=A0call(method);
> <<<<<<
>
> So there are a number of possibilities.=C2=A0 First possibility is tha= t ES was
> down entirely when this job ended, and so document removal requests fa= iled
> for a legitimate reason.=C2=A0 Second, it may be that the document in = question
> has already been deleted, and while this would formerly return a 200 e= rror
> code in the version of ES the connector was written for, it now return= s a
> 404.=C2=A0 Finally, maybe the REST API changed so much that it is no l= onger
> possible to delete a document from the index this way.=C2=A0 What vers= ion of
> ElasticSearch are you using, and can you find REST API documentation f= or
> that version that you could point me at?=C2=A0 Can you do enough resea= rch to
> find out what should work here?
>

=C2=A0 "version" : {
=C2=A0 =C2=A0 "number" : "1.4.1",
=C2=A0 =C2=A0 "build_hash" : "89d3241d670db65f994242c8e8383b= 169779e2d4",
=C2=A0 =C2=A0 "build_timestamp" : "2014-11-26T15:49:29Z"= ;,
=C2=A0 =C2=A0 "build_snapshot" : false,
=C2=A0 =C2=A0 "lucene_version" : "4.10.2"
=C2=A0 },

url for deleting is correct: http= ://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-dele= te.html
and I found this: http://www.elasticse= arch.org/guide/en/elasticsearch/guide/current/delete-doc.html

"If the document isn=E2=80=99t found, we get a 404 Not Found response = code and a body like (...)"

K

>
>
>
> On Fri, Dec 12, 2014 at 8:56 AM, Kamil =C5=BByta <kamil.zyta@pwr.edu.pl> wrote:
> >
> > Hi,
> > When I testing ES as indexer some job ends with 'Error: Serve= r/page not
> > found'. In ES log I have
> > some too big doc exceptions. How this affect job? Full MCF logs:<= br> > >
> > ERROR 2014-12-12 14:45:24,915 (Document cleanup thread '2'= ;) - Exception
> > tossed: Server/page not found
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Server= /page not
> > found
> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
> > org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchCo= nnection.handleResultCode(ElasticSearchConnection.java:234)
> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
> > org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchCo= nnection.call(ElasticSearchConnection.java:203)
> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
> > org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchDe= lete.execute(ElasticSearchDelete.java:45)
> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
> > org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchCo= nnector.removeDocument(ElasticSearchConnector.java:578)
> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
> > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngeste= r.removeDocument(IncrementalIngester.java:2350)
> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
> > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngeste= r.documentDeleteMultiple(IncrementalIngester.java:1059)
> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0at
> > org.apache.manifoldcf.crawler.system.DocumentCleanupThread.run(Do= cumentCleanupThread.java:189)
> >
> > Thanks,
> > Kamil
> >
--001a1133e48a2ac0e2050a0616c2--