From user-return-5734-archive-asf-public=cust-asf.ponee.io@manifoldcf.apache.org Wed Mar 6 07:03:49 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 56D4C180656 for ; Wed, 6 Mar 2019 08:03:48 +0100 (CET) Received: (qmail 24232 invoked by uid 500); 6 Mar 2019 07:03:47 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 24217 invoked by uid 99); 6 Mar 2019 07:03:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Mar 2019 07:03:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id CFF671806A2 for ; Wed, 6 Mar 2019 07:03:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.798 X-Spam-Level: * X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id UvW_9Ei9_UOj for ; Wed, 6 Mar 2019 07:03:45 +0000 (UTC) Received: from mail-wr1-f53.google.com (mail-wr1-f53.google.com [209.85.221.53]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 6AA375F242 for ; Wed, 6 Mar 2019 06:54:31 +0000 (UTC) Received: by mail-wr1-f53.google.com with SMTP id n2so12056121wrw.8 for ; Tue, 05 Mar 2019 22:54:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=irV5bami5+gN/obD4KlSmZxvwW3xGOkVrOG9wCzxxZM=; b=MxQUwwJcRYNW6Tn1Vbv042j32u1i+bSk7peoV5MRnNmYRPBaz/byMCl2tLmjIXNBU8 6FBqm7jYhOfTj2bFKpCukQVkLVTKlalgI4zjUgY4HqPPgEw+GH5xZE5Aau7Pv9VI3j19 xGEqk3A8og961uyM0UQa5GDwyflVMNiwcoafyNm4VJgMTaQIzih5Ixi28xyfdd4wkKmE zrZnbxOtc89M2J9C8oAoVSg2XCJUpjTjj6BOhGzBqSriZuObWNhNwfeDgvhKi8CHIZi5 +b79ROgPVGoyjq3JPph12irLp1Ohx8NrJ1bhnNu2CV7xoImILfgwg8o9NN8vp2j7Epo5 TJzQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=irV5bami5+gN/obD4KlSmZxvwW3xGOkVrOG9wCzxxZM=; b=KFNEj8ZDYamKddIMvTpxGPTY1ja7nUdrM8PdEF9oWWacn1rPsO5gRb6MN2S/5/8Mcc bANRWbCg6zfgKAT75T615u9FT6cOu+beB5hoptAXnnyQpuSKI0q8I1niSqkFhQzF+gC2 8ahda4jR5vjHHgbqs719elyKXwFy3Vqmg8QnQQsl43MlqJcbn/F0lT6gENhrZAgEjyje sNEmmpj4NdTlWMXt2Wi1/Z5wcBABofCOvPiKQrLQhhXjqPIp6iEEiv0zjZN21BO9ke2l aEUOypyBC/UDU2lOGCinL+Mi+G1/IxW4YWB9Mm/PI/Y7YyGPATLbkhrY7KJy+Srio0S2 bF9A== X-Gm-Message-State: APjAAAWj+BW4y6U//o9cdHxRZiauHDsvduTiVI5DB5/MUFGRRwuojnbn YY1kEmwnv7BjpQbILWbKbrX83MUXrvcsPUra8yQ0/g== X-Google-Smtp-Source: APXvYqxVgPKd6MaKrpflVH1xb9Fuif386asedrQUp4TEnpEW1ZgXnchZ+0Qhm8SDgeYGlIVvxcFRBH4xqg1pIgHUWEc= X-Received: by 2002:adf:e8c7:: with SMTP id k7mr1769928wrn.298.1551855269919; Tue, 05 Mar 2019 22:54:29 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Karl Wright Date: Wed, 6 Mar 2019 01:54:18 -0500 Message-ID: Subject: Re: Sharepoint Crawl - Missing documents To: user@manifoldcf.apache.org Content-Type: multipart/alternative; boundary="0000000000001b40330583677647" --0000000000001b40330583677647 Content-Type: text/plain; charset="UTF-8" Hi Guarav, Which version of SharePoint is this? And, did you install the SharePoint plugin for ManifoldCF, and select the correct versions of SharePoint in the connection configuration? Versions of SharePoint after 2010 limiited the number of documents that could be returned from the Lists service. The MCF plugin for SharePoint not only includes the ability to obtain user permissions, but also provides our own implementation of Lists that is not so limited. Karl On Wed, Mar 6, 2019 at 12:39 AM Gaurav G wrote: > Hi Karl, > > There are no subsites as such. It is one big library with all documents in > it in a flat structure. The same goes for the list. > We enabled the logging for the connector and ran the list job. Below is > the exception that it throws after it has crawled the list partially. It > looks like after it gets this exception it tries to start over from the > beginning and tries to do that a few times and then quits. > > DEBUG 2019-03-05T23:48:18,099 (Worker thread '6') - SharePoint: Checking > whether to include list item '/CONTENT/145120_.000' > DEBUG 2019-03-05T23:48:18,099 (Worker thread '6') - SharePoint: Checking > whether to include list item '/CONTENT/145121_.000' > DEBUG 2019-03-05T23:48:18,099 (Worker thread '6') - SharePoint: Checking > whether to include list item '/CONTENT/145122_.000' > DEBUG 2019-03-05T23:50:15,599 (Worker thread '6') - SharePoint: Got an > unknown remote exception getting child documents for site guid > {A6079591-4150-410E-9C12-B5CAEF02D400} - axis fault = Server.userException, > detail = org.xml.sax.SAXException: Processing instructions are not allowed > within SOAP messages - retrying > org.apache.axis.AxisFault: ; nested exception is: > org.xml.sax.SAXException: Processing instructions are not allowed > within SOAP messages > at org.apache.axis.AxisFault.makeFault(AxisFault.java:101) > ~[axis-1.4.jar:?] > at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:701) > ~[axis-1.4.jar:?] > at org.apache.axis.Message.getSOAPEnvelope(Message.java:435) > ~[axis-1.4.jar:?] > at > org.apache.axis.handlers.soap.MustUnderstandChecker.invoke(MustUnderstandChecker.java:62) > ~[axis-1.4.jar:?] > at org.apache.axis.client.AxisClient.invoke(AxisClient.java:206) > ~[axis-1.4.jar:?] > at org.apache.axis.client.Call.invokeEngine(Call.java:2784) > ~[axis-1.4.jar:?] > at org.apache.axis.client.Call.invoke(Call.java:2767) > ~[axis-1.4.jar:?] > at org.apache.axis.client.Call.invoke(Call.java:2443) > ~[axis-1.4.jar:?] > at org.apache.axis.client.Call.invoke(Call.java:2366) > ~[axis-1.4.jar:?] > at org.apache.axis.client.Call.invoke(Call.java:1812) > ~[axis-1.4.jar:?] > at > com.microsoft.sharepoint.webpartpages.PermissionsSoapStub.getListItems(PermissionsSoapStub.java:234) > ~[mcf-sharepoint-connector.jar:?] > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getChildren(SPSProxyHelper.java:661) > [mcf-sharepoint-connector.jar:?] > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:898) > [mcf-sharepoint-connector.jar:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?] > Caused by: org.xml.sax.SAXException: Processing instructions are not > allowed within SOAP messages > at > org.apache.axis.encoding.DeserializationContext.startDTD(DeserializationContext.java:1161) > ~[?:?] > at org.apache.xerces.parsers.AbstractSAXParser.doctypeDecl(Unknown > Source) ~[xercesImpl-2.10.0.jar:?] > at org.apache.xerces.impl.dtd.XMLDTDValidator.doctypeDecl(Unknown > Source) ~[xercesImpl-2.10.0.jar:?] > at > org.apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown > Source) ~[xercesImpl-2.10.0.jar:?] > at > org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown > Source) ~[xercesImpl-2.10.0.jar:?] > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) ~[xercesImpl-2.10.0.jar:?] > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) ~[xercesImpl-2.10.0.jar:?] > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) ~[xercesImpl-2.10.0.jar:?] > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > ~[xercesImpl-2.10.0.jar:?] > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown > Source) ~[xercesImpl-2.10.0.jar:?] > at > org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) > ~[xercesImpl-2.10.0.jar:?] > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > ~[xercesImpl-2.10.0.jar:?] > at > org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227) > ~[?:?] > at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696) > ~[?:?] > ... 12 more > WARN 2019-03-05T23:50:15,602 (Worker thread '6') - Service interruption > reported for job 1551357423253 connection 'Finance Test List': Remote > procedure exception: ; nested exception is: > org.xml.sax.SAXException: Processing instructions are not allowed > within SOAP messages > > > Thanks, > Gaurav > > On Mon, Mar 4, 2019 at 5:11 PM Karl Wright wrote: > >> Hi Gaurav, >> There is no document count threshold value. >> If you can identify libraries or subsites that aren't being crawled, you >> can turn on connector debugging to see why the connector is skipping them. >> There could be many reasons for a library or site to be skipped, e.g. bad >> specification rules, or permissions insufficient to read them. >> >> Karl >> >> >> On Mon, Mar 4, 2019 at 4:03 AM Gaurav G wrote: >> >>> Hi, >>> >>> We are trying to crawl a Sharepoint list with about 150,000 items and a >>> library with about 125,000 documents. >>> We have separate jobs for both. The list job only crawls about 50000 >>> items and completes cleanly while the library job crawls about 40000 >>> documents and completes cleanly. >>> We are trying to figure out why we are not getting the complete list. Is >>> there a threshold value beyond which the crawling doesn't happen. >>> For smaller repos (<30000 items) we are not facing any issue. Those get >>> crawled completely. >>> >>> Thanks, >>> Gaurav >>> >>> --0000000000001b40330583677647 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Guarav,
Which version of SharePoint is this?=C2=A0 = And, did you install the SharePoint plugin for ManifoldCF, and select the c= orrect versions of SharePoint in the connection configuration?

Versi= ons of SharePoint after 2010 limiited the number of documents that could be= returned from the Lists service.=C2=A0 The MCF plugin for SharePoint not o= nly includes the ability to obtain user permissions, but also provides our = own implementation of Lists that is not so limited.

Karl


On Wed, Mar 6, 2019 at 12:39 AM Gaurav G <goyalgauravg@gmail.com> wrote= :
Hi Karl,

There are = no subsites as such. It is one big library with all documents in it in a fl= at structure. The same goes for the list.
We enabled the logging = for the connector and ran the list job. Below is the exception that it thro= ws after it has crawled the list partially. It looks like after it gets thi= s exception it tries to start over from the beginning and tries to do that = a few times and then quits.=C2=A0

DEBUG 2019-= 03-05T23:48:18,099 (Worker thread '6') - SharePoint: Checking wheth= er to include list item '/CONTENT/145120_.000'
DEBUG 2019= -03-05T23:48:18,099 (Worker thread '6') - SharePoint: Checking whet= her to include list item '/CONTENT/145121_.000'
DEBUG 201= 9-03-05T23:48:18,099 (Worker thread '6') - SharePoint: Checking whe= ther to include list item '/CONTENT/145122_.000'
DEBUG 20= 19-03-05T23:50:15,599 (Worker thread '6') - SharePoint: Got an unkn= own remote exception getting child documents for site=C2=A0 guid {A6079591-= 4150-410E-9C12-B5CAEF02D400} - axis fault =3D Server.userException, detail = =3D org.xml.sax.SAXException: Processing instructions are not allowed withi= n SOAP messages - retrying
org.apache.axis.AxisFault: ; nested ex= ception is:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 org.xml.sax.SAXException:= Processing instructions are not allowed within SOAP messages
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.AxisFault.makeFault(AxisFault.j= ava:101) ~[axis-1.4.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apa= che.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:701) ~[axis-1.4.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.Message.getSOAPEnvel= ope(Message.java:435) ~[axis-1.4.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at org.apache.axis.handlers.soap.MustUnderstandChecker.invoke(MustUnder= standChecker.java:62) ~[axis-1.4.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at org.apache.axis.client.AxisClient.invoke(AxisClient.java:206) ~[axis= -1.4.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.client= .Call.invokeEngine(Call.java:2784) ~[axis-1.4.jar:?]
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 at org.apache.axis.client.Call.invoke(Call.java:2767) ~[a= xis-1.4.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.cli= ent.Call.invoke(Call.java:2443) ~[axis-1.4.jar:?]
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 at org.apache.axis.client.Call.invoke(Call.java:2366) ~[axis-= 1.4.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.axis.client.= Call.invoke(Call.java:1812) ~[axis-1.4.jar:?]
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 at com.microsoft.sharepoint.webpartpages.PermissionsSoapStub.get= ListItems(PermissionsSoapStub.java:234) ~[mcf-sharepoint-connector.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.manifoldcf.crawler.conne= ctors.sharepoint.SPSProxyHelper.getChildren(SPSProxyHelper.java:661) [mcf-s= harepoint-connector.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apa= che.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDo= cuments(SharePointRepository.java:898) [mcf-sharepoint-connector.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.manifoldcf.crawler.system.= WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]
Ca= used by: org.xml.sax.SAXException: Processing instructions are not allowed = within SOAP messages
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.ax= is.encoding.DeserializationContext.startDTD(DeserializationContext.java:116= 1) ~[?:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.xerces.parser= s.AbstractSAXParser.doctypeDecl(Unknown Source) ~[xercesImpl-2.10.0.jar:?]<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.xerces.impl.dtd.XMLDTDV= alidator.doctypeDecl(Unknown Source) ~[xercesImpl-2.10.0.jar:?]
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.xerces.impl.XMLDocumentScannerImp= l.scanDoctypeDecl(Unknown Source) ~[xercesImpl-2.10.0.jar:?]
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.xerces.impl.XMLDocumentScannerImpl$P= rologDispatcher.dispatch(Unknown Source) ~[xercesImpl-2.10.0.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.xerces.impl.XMLDocumentFragmen= tScannerImpl.scanDocument(Unknown Source) ~[xercesImpl-2.10.0.jar:?]
<= div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.xerces.parsers.XML11Configura= tion.parse(Unknown Source) ~[xercesImpl-2.10.0.jar:?]
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 at org.apache.xerces.parsers.XML11Configuration.parse(Unk= nown Source) ~[xercesImpl-2.10.0.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) ~[xercesIm= pl-2.10.0.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.xerces= .parsers.AbstractSAXParser.parse(Unknown Source) ~[xercesImpl-2.10.0.jar:?]=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.xerces.jaxp.SAXParserI= mpl$JAXPSAXParser.parse(Unknown Source) ~[xercesImpl-2.10.0.jar:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.xerces.jaxp.SAXParserImpl.parse= (Unknown Source) ~[xercesImpl-2.10.0.jar:?]
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 at org.apache.axis.encoding.DeserializationContext.parse(Deserializa= tionContext.java:227) ~[?:?]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.a= pache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696) ~[?:?]
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 ... 12 more
=C2=A0WARN 2019-03-05T23:= 50:15,602 (Worker thread '6') - Service interruption reported for j= ob 1551357423253 connection 'Finance Test List': Remote procedure e= xception: ; nested exception is:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 org.= xml.sax.SAXException: Processing instructions are not allowed within SOAP m= essages


Thanks,
Gau= rav

On Mon, Mar 4, 2019 at 5:11 PM Karl Wright <daddywri@gmail.com>= wrote:
Hi Gaurav,
There is no document count threshold value.
If y= ou can identify libraries or subsites that aren't being crawled, you ca= n turn on connector debugging to see why the connector is skipping them.=C2= =A0 There could be many reasons for a library or site to be skipped, e.g. b= ad specification rules, or permissions insufficient to read them.

Karl


On Mon, Mar 4, 2019 at 4:03 AM Gau= rav G <goyal= gauravg@gmail.com> wrote:
Hi,

We are trying to c= rawl a Sharepoint list with about 150,000 items and a library with about 12= 5,000 documents.=C2=A0
We have separate jobs for both. The list j= ob only crawls about 50000 items and completes cleanly while the library jo= b crawls about 40000 documents and completes cleanly.=C2=A0
We ar= e trying to figure out why we are not getting the complete list. Is there a= threshold value beyond which the crawling doesn't happen.
Fo= r smaller repos (<30000 items) we are not facing any issue. Those get cr= awled completely.

Thanks,
Gaurav

--0000000000001b40330583677647--