lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kris.t.musshorn....@mail.mil>
Subject RE: [Non-DoD Source] Re: SimplePostTool error (UNCLASSIFIED)
Date Fri, 15 Jul 2016 17:08:21 GMT
CLASSIFICATION: UNCLASSIFIED

Thanks Yonik and Eric,

If I set -filetypes csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,rtf,htm,html,txt
would this prevent indexing of xml files? 

Why does the simple post tool index .cfm files with this or default settings?

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor – Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Friday, July 15, 2016 12:30 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: [Non-DoD Source] Re: SimplePostTool error (UNCLASSIFIED)

simplePostTool is just that, simple. It's intended to get you started.
It is not a full-featured web crawler. As such, if you're encountering wonky web pages that
are not well formed HTML there's no guarantee that it'll handle them gracefully.

Crawling websites is a pain, so if you require something robust I'd investigate Nutch (which
integrates with Solr/Lucene) or similar.

Best,
Erick

On Fri, Jul 15, 2016 at 9:01 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) <kris.t.musshorn.ctr@mail.mil>
wrote:
> CLASSIFICATION: UNCLASSIFIED
>
> How do I correct this error when running the simple post tool against a website?
> The tool successfully indexed for about 30 mins before throwing this error and terminating.
>
> [Fatal Error] :642:15: XML document structures must start and end within the same entity.
> Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException;
lineNumber: 642; columnNumber: 15; XML document structures must start and end within the same
entity.
>         at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1219)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:601)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618)
>         at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:548)
>         at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:351)
>         at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:182)
>         at 
> org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:167)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 642; columnNumber: 15; XML document
structures must start and end within the same entity.
>         at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
>         at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>         at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1028)
>         at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1201)
>         ... 9 more
>
> Thanks,
> Kris
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Kris T. Musshorn
> FileMaker Developer - Contractor - Catapult Technology Inc.
> US Army Research Lab
> Aberdeen Proving Ground
> Application Management & Development Branch
> 410-278-7251
> kris.t.musshorn.ctr@mail.mil
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
> CLASSIFICATION: UNCLASSIFIED


CLASSIFICATION: UNCLASSIFIED
Mime
View raw message