manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "fred fredson" <bull1...@gmx.de>
Subject Re: URISnytaxException
Date Thu, 17 Feb 2011 13:09:27 GMT
Hi,
thanks for your quick reply and your explanation.

~Julian

>             
> -------- Original-Nachricht --------
> Datum: Thu, 17 Feb 2011 08:03:04 -0500
> Von: Karl Wright <daddywri@gmail.com>
> An: connectors-user@incubator.apache.org
> Betreff: Re: URISnytaxException
> 
>             Hi,
> You've done nothing wrong; the stack trace is being dumped because of
> a debugging line that was inadvertantly left in the code recently.  It
> should not change the way the crawl occurs.  Regardless, I've removed
> the offending line from trunk now.
> 
> In case you are curious, what is happening is that the page link the
> crawler has located is not properly URI encoded.  Space characters are
> illegal in URI's.  Normally, the web connector would skip this link
> and note that to the log.
> 
> Thanks,
> Karl
> 
> 
> On Thu, Feb 17, 2011 at 7:27 AM,  <bull1985@gmx.de> wrote:
> > Hi all,
> >
> > I just checked out the newest version of MCF and now I am getting this 
> error
> > while crawling certain pages. What can I do against that?
> >
> > Error Message:
> >
> > java.net.URISyntaxException: Illegal character in path at index 73:
> > /link/to/the/page/alan smithee.xls
> >         at java.net.URI$Parser.fail(URI.java:2809)
> >         at java.net.URI$Parser.checkChars(URI.java:2982)
> >         at java.net.URI$Parser.parseHierarchical(URI.java:3066)
> >         at java.net.URI$Parser.parse(URI.java:3024)
> >         at java.net.URI.<init>(URI.java:578)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.makeDocumentIdentifier(WebcrawlerConnector.java:4774)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:5586)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:5701)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:48)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:223)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:6492)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:5553)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1132)
> >         at
> > 
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
> >         at
> > 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
> >
> >
> > How i set it up (hope that it helps):
> >
> > installed postgreSQL 8.3.11-1
> > checked out the project into the MCF folder
> > added jcifs1.2.15.jar at /connectors/jcifs/jcifs and renamed it to 
> jcifs.jar
> > built the project with ant at /mcf
> > copied the content of "dist" to c:/documents and 
> settings/myUserAccount/lcf
> > added the properties.xml and the logging.ini there
> > created a synchronization folder
> > set MCF_HOME to the folder above
> >
> > executed in /processes/scripts these commands:
> >
> > org.apache.manifoldcf.core.DBCreate postgres p0sTgres
> > org.apache.manifoldcf.agents.Install
> > org.apache.manifoldcf.agents.Register
> > org.apache.manifoldcf.crawler.system.CrawlerAgent
> > org.apache.manifoldcf.agents.RegisterOutput
> > org.apache.manifoldcf.agents.output.solr.SolrConnector "SOLR Connector"
> > org.apache.manifoldcf.authorities.RegisterAuthority
> > 
> org.apache.manifoldcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority
> > "Active Directory Authority"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector
> > "Filesystem Connector"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector "Database
> > Connector"
> > org.apache.manifoldcf.crawler.Register
> > 
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector
> > "Windows Share Connector"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.rss.RSSConnector "RSS 
> Connector"
> > org.apache.manifoldcf.crawler.Register
> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector 
> "Web
> > Connector"
> >
> > and copied the content of /lcf/web/war to my /tomcat/webapps
> >
> > Thanks for your help and Best regards,
> > Julian
> >
> >
> > --
> > Schon gehört? GMX hat einen genialen Phishing-Filter in die
> > Toolbar eingebaut! http://www.gmx.net/de/go/toolbar
> 
        
-- 
GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit 
gratis Handy-Flat! http://portal.gmx.net/de/go/dsl

Mime
View raw message