lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Solr response error 403 when I try to index medium.com articles
Date Wed, 30 Mar 2016 17:05:05 GMT

403 means "forbidden" 

Something about the request Solr is sending -- or soemthing about the IP 
address Solr is connecting from when talking to medium.com -- is causing 
hte medium.com web server to reject the request.

This is something that servers may choose to do if they detect (via 
headers, or missing headers, or reverse ip lookup, or other 
distinctive nuances of how the connection was made) that the 
client connecting to their server isn't a "human browser" (ie: firefox, 
chrome, safari) and is a Robot that they don't want to cooperate with (ie: 
they might be happy toserve their pages to the google-bot crawler, but not 
to some third-party they've never heard of.

The specifics of how/why you might get a 403 for any given url are hard to 
debug -- it might literally depend on how many requests you've sent tothat 
domain in the past X hours.

In general Solr's ContentStream indexing from remote hosts isn't inteded 
to be a super robust solution for crawling arbitrary websites on the web 
-- if that's your goal, then i would suggest you look into running a more 
robust crawler (nutch, droids, Lucidworks Fusion, etc...) that has more 
features and debugging options (notably: rate limiting) and use that code 
to feath the content, then push it to Solr.


: Date: Tue, 29 Mar 2016 20:54:52 -0300
: From: Jeferson dos Anjos <jefersonanjos@packdocs.com>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Solr response error 403 when I try to index medium.com articles
: 
: I'm trying to index some pages of the medium. But I get error 403. I
: believe it is because the medium does not accept the user-agent solr. Has
: anyone ever experienced this? You know how to change?
: 
: I appreciate any help
: 
: <lst name="responseHeader">
: <int name="status">500</int>
: <int name="QTime">94</int>
: </lst>
: <lst name="error">
: <str name="msg">
: Server returned HTTP response code: 403 for URL:
: https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
: </str>
: <str name="trace">
: java.io.IOException: Server returned HTTP response code: 403 for URL:
: https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
: at sun.reflect.GeneratedConstructorAccessor314.newInstance(Unknown
: Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
: Source) at java.lang.reflect.Constructor.newInstance(Unknown Source)
: at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
: at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
: at java.security.AccessController.doPrivileged(Native Method) at
: sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown
: Source) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
: Source) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
: Source) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown
: Source) at org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:87)
: at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
: at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
: at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
: at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:291)
: at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at
: org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
: at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
: at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
: at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
: at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
: at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
: at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
: at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
: at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
: at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
: at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
: at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
: at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
: at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
: at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
: at org.eclipse.jetty.server.Server.handle(Server.java:368) at
: org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
: at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
: at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
: at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
: at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at
: org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
: at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
: at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
: at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
: at java.lang.Thread.run(Unknown Source) Caused by:
: java.io.IOException: Server returned HTTP response code: 403 for URL:
: https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
: at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
: Source) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
: Source) at sun.net.www.protocol.http.HttpURLConnection.getHeaderField(Unknown
: Source) at java.net.URLConnection.getContentType(Unknown Source) at
: sun.net.www.protocol.https.HttpsURLConnectionImpl.getContentType(Unknown
: Source) at org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:84)
: ... 33 more
: </str>
: <int name="code">500</int>
: </lst>
: </response>
: 
: 
: Jeferson M. dos Anjos
: CEO do Packdocs
: ps.: Mantenha seus arquivos vivos com o Packdocs (www.packdocs.com)
: 

-Hoss
http://www.lucidworks.com/

Mime
View raw message