manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: URISnytaxException
Date Thu, 17 Feb 2011 13:03:04 GMT
Hi,
You've done nothing wrong; the stack trace is being dumped because of
a debugging line that was inadvertantly left in the code recently.  It
should not change the way the crawl occurs.  Regardless, I've removed
the offending line from trunk now.

In case you are curious, what is happening is that the page link the
crawler has located is not properly URI encoded.  Space characters are
illegal in URI's.  Normally, the web connector would skip this link
and note that to the log.

Thanks,
Karl


On Thu, Feb 17, 2011 at 7:27 AM,  <bull1985@gmx.de> wrote:
> Hi all,
>
> I just checked out the newest version of MCF and now I am getting this error
> while crawling certain pages. What can I do against that?
>
> Error Message:
>
> java.net.URISyntaxException: Illegal character in path at index 73:
> /link/to/the/page/alan smithee.xls
>         at java.net.URI$Parser.fail(URI.java:2809)
>         at java.net.URI$Parser.checkChars(URI.java:2982)
>         at java.net.URI$Parser.parseHierarchical(URI.java:3066)
>         at java.net.URI$Parser.parse(URI.java:3024)
>         at java.net.URI.<init>(URI.java:578)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.makeDocumentIdentifier(WebcrawlerConnector.java:4774)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:5586)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:5701)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:48)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:223)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:6492)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:5553)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1132)
>         at
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
>
>
> How i set it up (hope that it helps):
>
> installed postgreSQL 8.3.11-1
> checked out the project into the MCF folder
> added jcifs1.2.15.jar at /connectors/jcifs/jcifs and renamed it to jcifs.jar
> built the project with ant at /mcf
> copied the content of "dist" to c:/documents and settings/myUserAccount/lcf
> added the properties.xml and the logging.ini there
> created a synchronization folder
> set MCF_HOME to the folder above
>
> executed in /processes/scripts these commands:
>
> org.apache.manifoldcf.core.DBCreate postgres p0sTgres
> org.apache.manifoldcf.agents.Install
> org.apache.manifoldcf.agents.Register
> org.apache.manifoldcf.crawler.system.CrawlerAgent
> org.apache.manifoldcf.agents.RegisterOutput
> org.apache.manifoldcf.agents.output.solr.SolrConnector "SOLR Connector"
> org.apache.manifoldcf.authorities.RegisterAuthority
> org.apache.manifoldcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority
> "Active Directory Authority"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector
> "Filesystem Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector "Database
> Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector
> "Windows Share Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.rss.RSSConnector "RSS Connector"
> org.apache.manifoldcf.crawler.Register
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector "Web
> Connector"
>
> and copied the content of /lcf/web/war to my /tomcat/webapps
>
> Thanks for your help and Best regards,
> Julian
>
>
> --
> Schon gehört? GMX hat einen genialen Phishing-Filter in die
> Toolbar eingebaut! http://www.gmx.net/de/go/toolbar

Mime
View raw message